Compare commits
52 commits
Author | SHA1 | Date | |
---|---|---|---|
|
205b0731db | ||
|
bf0ed943ec | ||
|
f5c1033fc9 | ||
|
ca302bb8db | ||
|
34d07a4fa1 | ||
|
e86863a8ae | ||
|
e4fca0d67e | ||
|
fe1c69f3ba | ||
|
0416cc159a | ||
|
52d3b8d9e9 | ||
|
3d8ab95f11 | ||
|
a8dc809787 | ||
|
099ef7d37a | ||
|
f69eaed5a6 | ||
|
7be5dfb9b1 | ||
|
95b644d431 | ||
|
bed11e83f1 | ||
|
dafaf93d50 | ||
|
31f475dcdd | ||
|
a76b52642d | ||
|
0744caad6f | ||
|
adc0d4ec4e | ||
|
253c7c4f2b | ||
|
db2dcce2ff | ||
|
712d88cf0d | ||
|
ffa6617fff | ||
|
e207bb6435 | ||
|
c90a7e42ab | ||
|
3294a44f76 | ||
|
174448a2b0 | ||
|
ae55c96506 | ||
|
5a2b2c2311 | ||
|
179bb65253 | ||
|
a7611c6e6f | ||
|
80ee1387f7 | ||
|
c92d4e1c2c | ||
|
c169b2ae30 | ||
|
213ef57abe | ||
|
4dc41ee02c | ||
|
47e8b485a5 | ||
|
93d5b503af | ||
|
f7d015004e | ||
|
6f7392cfaa | ||
|
0472fe6e0c | ||
|
8edfbc030c | ||
|
d212e7a8a3 | ||
|
b04664f9d5 | ||
|
4751d96a1d | ||
|
4011883ef2 | ||
|
e290f2c05f | ||
|
b7ef2be02e | ||
|
c1f0e8ac61 |
7 changed files with 624 additions and 90 deletions
10
.github/workflows/build-container.yaml
vendored
10
.github/workflows/build-container.yaml
vendored
|
@ -9,11 +9,12 @@ jobs:
|
|||
runs-on: ubuntu-latest
|
||||
steps:
|
||||
- name: Set up QEMU
|
||||
uses: docker/setup-qemu-action@v2
|
||||
|
||||
uses: docker/setup-qemu-action@v3
|
||||
- name: Set up Docker Buildx
|
||||
uses: docker/setup-buildx-action@v2
|
||||
uses: docker/setup-buildx-action@v3
|
||||
- name: Login to GHCR
|
||||
uses: docker/login-action@v2
|
||||
uses: docker/login-action@v3
|
||||
if: github.event_name != 'pull_request'
|
||||
with:
|
||||
registry: ghcr.io
|
||||
|
@ -21,9 +22,10 @@ jobs:
|
|||
password: ${{ secrets.GITHUB_TOKEN }}
|
||||
- name: Build and push
|
||||
id: docker_build
|
||||
uses: docker/build-push-action@v4
|
||||
uses: docker/build-push-action@v5
|
||||
with:
|
||||
push: true
|
||||
platforms: linux/amd64,linux/arm64
|
||||
tags: |
|
||||
ghcr.io/${{ github.repository_owner }}/fedifetcher:${{ github.ref_name }}
|
||||
ghcr.io/${{ github.repository_owner }}/fedifetcher:latest
|
||||
|
|
12
.github/workflows/get_context.yml
vendored
12
.github/workflows/get_context.yml
vendored
|
@ -12,17 +12,17 @@ jobs:
|
|||
environment: mastodon
|
||||
steps:
|
||||
- name: Checkout original repository
|
||||
uses: actions/checkout@v3
|
||||
uses: actions/checkout@v4
|
||||
with:
|
||||
fetch-depth: 0
|
||||
- name: Set up Python
|
||||
uses: actions/setup-python@v4
|
||||
uses: actions/setup-python@v5
|
||||
with:
|
||||
python-version: '3.10'
|
||||
cache: 'pip' # caching pip dependencies
|
||||
- run: pip install -r requirements.txt
|
||||
- name: Download all workflow run artifacts
|
||||
uses: dawidd6/action-download-artifact@v2
|
||||
uses: dawidd6/action-download-artifact@v3
|
||||
with:
|
||||
name: artifacts
|
||||
workflow: get_context.yml
|
||||
|
@ -32,12 +32,12 @@ jobs:
|
|||
run: ls -lR
|
||||
- run: python find_posts.py --lock-hours=0 --access-token=${{ secrets.ACCESS_TOKEN }} -c="./config.json"
|
||||
- name: Upload artifacts
|
||||
uses: actions/upload-artifact@v3
|
||||
uses: actions/upload-artifact@v4
|
||||
with:
|
||||
name: artifacts
|
||||
path: |
|
||||
artifacts
|
||||
- name: Checkout user's forked repository for keeping workflow alive
|
||||
uses: actions/checkout@v3
|
||||
uses: actions/checkout@v4
|
||||
- name: Keep workflow alive
|
||||
uses: gautamkrishnar/keepalive-workflow@v1
|
||||
uses: gautamkrishnar/keepalive-workflow@v1
|
||||
|
|
17
README.md
17
README.md
|
@ -25,7 +25,7 @@ For detailed information on the how and why, please read the [FediFetcher for Ma
|
|||
|
||||
FediFetcher makes use of the Mastodon API. It'll run against any instance implementing this API, and whilst it was built for Mastodon, it's been [confirmed working against Pleroma](https://fed.xnor.in/objects/6bd47928-704a-4cb8-82d6-87471d1b632f) as well.
|
||||
|
||||
FediFetcher will pull in posts and profiles from any server that implements the Mastodon API, including Mastodon, Pleroma, Akkoma, Pixelfed, and probably others.
|
||||
FediFetcher will pull in posts and profiles from any servers running the following software: Mastodon, Pleroma, Akkoma, Pixelfed, Hometown, Misskey, Firefish (Calckey), Foundkey, and Lemmy.
|
||||
|
||||
## Setup
|
||||
|
||||
|
@ -82,10 +82,6 @@ When using a cronjob, we are using file based locking to avoid multiple overlapp
|
|||
>
|
||||
> If you are running FediFetcher locally, my recommendation is to run it manually once, before turning on the cron job: The first run will be significantly slower than subsequent runs, and that will help you prevent overlapping during that first run.
|
||||
|
||||
> **Note**
|
||||
>
|
||||
> If you wish to run FediFetcher using Windows Task Scheduler, you can rename the script to the `.pyw` extension instead of `.py`, and it will run silently, without opening a console window.
|
||||
|
||||
#### To run FediFetcher from a container:
|
||||
|
||||
FediFetcher is also available in a pre-packaged container, [FediFetcher](https://github.com/nanos/FediFetcher/pkgs/container/fedifetcher) - Thank you [@nikdoof](https://github.com/nikdoof).
|
||||
|
@ -101,10 +97,16 @@ Persistent files are stored in `/app/artifacts` within the container, so you may
|
|||
|
||||
An [example Kubernetes CronJob](./examples/k8s-cronjob.yaml) for running the container is included in the `examples` folder.
|
||||
|
||||
An [example Docker Compose Script](./examples/docker-compose.yaml) for running the container periodically is included in the `examples` folder.
|
||||
|
||||
### Configuration options
|
||||
|
||||
FediFetcher has quite a few configuration options, so here is my quick configuration advice, that should probably work for most people:
|
||||
|
||||
> **Warning**
|
||||
>
|
||||
> **Do NOT** include your `access-token` in the `config.json` when running FediFetcher as GitHub Action. When running FediFetcher as GitHub Action **ALWAYS** [set the Access Token as an Action Secret](#to-run-fedifetcher-as-a-github-action).
|
||||
|
||||
```json
|
||||
{
|
||||
"access-token": "Your access token",
|
||||
|
@ -117,10 +119,6 @@ FediFetcher has quite a few configuration options, so here is my quick configura
|
|||
|
||||
If you configure FediFetcher this way, it'll fetch missing remote replies to the last 200 posts in your home timeline. It'll additionally backfill profiles of the last 80 people you followed, and of every account who appeared in your notifications during the past hour.
|
||||
|
||||
> **Warning**
|
||||
>
|
||||
> **Do NOT** include your `access-token` in the `config.json` when running FediFetcher as GitHub Action. When running FediFetcher as GitHub Action **ALWAYS** [set the Access Token as an Action Secret](#to-run-fedifetcher-as-a-github-action).
|
||||
|
||||
#### Advanced Options
|
||||
|
||||
Please find the list of all configuration options, including descriptions, below:
|
||||
|
@ -140,6 +138,7 @@ Option | Required? | Notes |
|
|||
|`backfill-with-context` | No | Set to `0` to disable fetching remote replies while backfilling profiles. This is enabled by default, but you can disable it, if it's too slow for you.
|
||||
|`backfill-mentioned-users` | No | Set to `0` to disable backfilling any mentioned users when fetching the home timeline. This is enabled by default, but you can disable it, if it's too slow for you.
|
||||
| `remember-users-for-hours` | No | How long between back-filling attempts for non-followed accounts? Defaults to `168`, i.e. one week.
|
||||
| `remember-hosts-for-days` | No | How long should FediFetcher cache host info for? Defaults to `30`.
|
||||
| `http-timeout` | No | The timeout for any HTTP requests to the Mastodon API in seconds. Defaults to `5`.
|
||||
| `lock-hours` | No | Determines after how many hours a lock file should be discarded. Not relevant when running the script as GitHub Action, as concurrency is prevented using a different mechanism. Recommended value: `24`.
|
||||
| `lock-file` | No | Location for the lock file. If not specified, will use `lock.lock` under the state directory. Not relevant when running the script as GitHub Action.
|
||||
|
|
19
examples/docker-compose.yaml
Normal file
19
examples/docker-compose.yaml
Normal file
|
@ -0,0 +1,19 @@
|
|||
name: fedifetcher
|
||||
services:
|
||||
fedifetcher:
|
||||
stdin_open: true
|
||||
tty: true
|
||||
image: ghcr.io/nanos/fedifetcher:latest
|
||||
command: "--access-token=<TOKEN> --server=<SERVER>"
|
||||
# Persist our data
|
||||
volumes:
|
||||
- ./data:/app/artifacts
|
||||
# Use the `deploy` option to enable `restart_policy`
|
||||
deploy:
|
||||
# Don't go above 1 replica to avoid multiple overlapping executions of the script
|
||||
replicas: 1
|
||||
restart_policy:
|
||||
# The `any` condition means even after successful runs, we'll restart the script
|
||||
condition: any
|
||||
# Specify how often the script should run - for example; after 1 hour.
|
||||
delay: 1h
|
|
@ -14,7 +14,7 @@ spec:
|
|||
apiVersion: batch/v1
|
||||
kind: CronJob
|
||||
metadata:
|
||||
name: FediFetcher
|
||||
name: fedifetcher
|
||||
spec:
|
||||
# Run every 2 hours
|
||||
schedule: "0 */2 * * *"
|
||||
|
@ -30,7 +30,7 @@ spec:
|
|||
persistentVolumeClaim:
|
||||
claimName: fedifetcher-pvc
|
||||
containers:
|
||||
- name: FediFetcher
|
||||
- name: fedifetcher
|
||||
image: ghcr.io/nanos/fedifetcher:latest
|
||||
args:
|
||||
- --server=your.server.social
|
||||
|
|
648
find_posts.py
648
find_posts.py
|
@ -12,6 +12,7 @@ import requests
|
|||
import time
|
||||
import argparse
|
||||
import uuid
|
||||
import defusedxml.ElementTree as ET
|
||||
|
||||
argparser=argparse.ArgumentParser()
|
||||
|
||||
|
@ -28,6 +29,7 @@ argparser.add_argument('--max-bookmarks', required = False, type=int, default=0,
|
|||
argparser.add_argument('--max-favourites', required = False, type=int, default=0, help="Fetch remote replies to the API key owners Favourites. We'll fetch replies to at most this many favourites")
|
||||
argparser.add_argument('--from-notifications', required = False, type=int, default=0, help="Backfill accounts of anyone appearing in your notifications, during the last hours")
|
||||
argparser.add_argument('--remember-users-for-hours', required=False, type=int, default=24*7, help="How long to remember users that you aren't following for, before trying to backfill them again.")
|
||||
argparser.add_argument('--remember-hosts-for-days', required=False, type=int, default=30, help="How long to remember host info for, before checking again.")
|
||||
argparser.add_argument('--http-timeout', required = False, type=int, default=5, help="The timeout for any HTTP requests to your own, or other instances.")
|
||||
argparser.add_argument('--backfill-with-context', required = False, type=int, default=1, help="If enabled, we'll fetch remote replies when backfilling profiles. Set to `0` to disable.")
|
||||
argparser.add_argument('--backfill-mentioned-users', required = False, type=int, default=1, help="If enabled, we'll backfill any mentioned users when fetching remote replies to timeline posts. Set to `0` to disable.")
|
||||
|
@ -65,17 +67,17 @@ def get_favourites(server, access_token, max):
|
|||
"Authorization": f"Bearer {access_token}",
|
||||
})
|
||||
|
||||
def add_user_posts(server, access_token, followings, know_followings, all_known_users, seen_urls):
|
||||
def add_user_posts(server, access_token, followings, known_followings, all_known_users, seen_urls, seen_hosts):
|
||||
for user in followings:
|
||||
if user['acct'] not in all_known_users and not user['url'].startswith(f"https://{server}/"):
|
||||
posts = get_user_posts(user, know_followings, server)
|
||||
posts = get_user_posts(user, known_followings, server, seen_hosts)
|
||||
|
||||
if(posts != None):
|
||||
count = 0
|
||||
failed = 0
|
||||
for post in posts:
|
||||
if post['reblog'] == None and post['url'] != None and post['url'] not in seen_urls:
|
||||
added = add_post_with_context(post, server, access_token, seen_urls)
|
||||
if post.get('reblog') is None and post.get('renoteId') is None and post.get('url') is not None and post.get('url') not in seen_urls:
|
||||
added = add_post_with_context(post, server, access_token, seen_urls, seen_hosts)
|
||||
if added is True:
|
||||
seen_urls.add(post['url'])
|
||||
count += 1
|
||||
|
@ -83,61 +85,159 @@ def add_user_posts(server, access_token, followings, know_followings, all_known_
|
|||
failed += 1
|
||||
log(f"Added {count} posts for user {user['acct']} with {failed} errors")
|
||||
if failed == 0:
|
||||
know_followings.add(user['acct'])
|
||||
known_followings.add(user['acct'])
|
||||
all_known_users.add(user['acct'])
|
||||
|
||||
def add_post_with_context(post, server, access_token, seen_urls):
|
||||
def add_post_with_context(post, server, access_token, seen_urls, seen_hosts):
|
||||
added = add_context_url(post['url'], server, access_token)
|
||||
if added is True:
|
||||
seen_urls.add(post['url'])
|
||||
if (post['replies_count'] or post['in_reply_to_id']) and arguments.backfill_with_context > 0:
|
||||
if ('replies_count' in post or 'in_reply_to_id' in post) and getattr(arguments, 'backfill_with_context', 0) > 0:
|
||||
parsed_urls = {}
|
||||
parsed = parse_url(post['url'], parsed_urls)
|
||||
if parsed == None:
|
||||
return True
|
||||
known_context_urls = get_all_known_context_urls(server, [post],parsed_urls)
|
||||
known_context_urls = get_all_known_context_urls(server, [post],parsed_urls, seen_hosts)
|
||||
add_context_urls(server, access_token, known_context_urls, seen_urls)
|
||||
return True
|
||||
|
||||
return False
|
||||
|
||||
def get_user_posts(user, know_followings, server):
|
||||
def get_user_posts(user, known_followings, server, seen_hosts):
|
||||
parsed_url = parse_user_url(user['url'])
|
||||
|
||||
if parsed_url == None:
|
||||
# We are adding it as 'known' anyway, because we won't be able to fix this.
|
||||
know_followings.add(user['acct'])
|
||||
known_followings.add(user['acct'])
|
||||
return None
|
||||
|
||||
if(parsed_url[0] == server):
|
||||
log(f"{user['acct']} is a local user. Skip")
|
||||
know_followings.add(user['acct'])
|
||||
known_followings.add(user['acct'])
|
||||
return None
|
||||
|
||||
|
||||
post_server = get_server_info(parsed_url[0], seen_hosts)
|
||||
if post_server is None:
|
||||
log(f'server {parsed_url[0]} not found for post')
|
||||
return None
|
||||
|
||||
if post_server['mastodonApiSupport']:
|
||||
return get_user_posts_mastodon(parsed_url[1], post_server['webserver'])
|
||||
|
||||
if post_server['lemmyApiSupport']:
|
||||
return get_user_posts_lemmy(parsed_url[1], user['url'], post_server['webserver'])
|
||||
|
||||
if post_server['misskeyApiSupport']:
|
||||
return get_user_posts_misskey(parsed_url[1], post_server['webserver'])
|
||||
|
||||
log(f'server api unknown for {post_server["webserver"]}, cannot fetch user posts')
|
||||
return None
|
||||
|
||||
def get_user_posts_mastodon(userName, webserver):
|
||||
try:
|
||||
user_id = get_user_id(parsed_url[0], parsed_url[1])
|
||||
user_id = get_user_id(webserver, userName)
|
||||
except Exception as ex:
|
||||
log(f"Error getting user ID for user {user['acct']}: {ex}")
|
||||
log(f"Error getting user ID for user {userName}: {ex}")
|
||||
return None
|
||||
|
||||
|
||||
try:
|
||||
url = f"https://{parsed_url[0]}/api/v1/accounts/{user_id}/statuses?limit=40"
|
||||
url = f"https://{webserver}/api/v1/accounts/{user_id}/statuses?limit=40"
|
||||
response = get(url)
|
||||
|
||||
if(response.status_code == 200):
|
||||
return response.json()
|
||||
elif response.status_code == 404:
|
||||
raise Exception(
|
||||
f"User {user['acct']} was not found on server {parsed_url[0]}"
|
||||
f"User {userName} was not found on server {webserver}"
|
||||
)
|
||||
else:
|
||||
raise Exception(
|
||||
f"Error getting URL {url}. Status code: {response.status_code}"
|
||||
)
|
||||
except Exception as ex:
|
||||
log(f"Error getting posts for user {user['acct']}: {ex}")
|
||||
log(f"Error getting posts for user {userName}: {ex}")
|
||||
return None
|
||||
|
||||
def get_user_posts_lemmy(userName, userUrl, webserver):
|
||||
# community
|
||||
if re.match(r"^https:\/\/[^\/]+\/c\/", userUrl):
|
||||
try:
|
||||
url = f"https://{webserver}/api/v3/post/list?community_name={userName}&sort=New&limit=50"
|
||||
response = get(url)
|
||||
|
||||
if(response.status_code == 200):
|
||||
posts = [post['post'] for post in response.json()['posts']]
|
||||
for post in posts:
|
||||
post['url'] = post['ap_id']
|
||||
return posts
|
||||
|
||||
except Exception as ex:
|
||||
log(f"Error getting community posts for community {userName}: {ex}")
|
||||
return None
|
||||
|
||||
# user
|
||||
if re.match(r"^https:\/\/[^\/]+\/u\/", userUrl):
|
||||
try:
|
||||
url = f"https://{webserver}/api/v3/user?username={userName}&sort=New&limit=50"
|
||||
response = get(url)
|
||||
|
||||
if(response.status_code == 200):
|
||||
comments = [post['post'] for post in response.json()['comments']]
|
||||
posts = [post['post'] for post in response.json()['posts']]
|
||||
all_posts = comments + posts
|
||||
for post in all_posts:
|
||||
post['url'] = post['ap_id']
|
||||
return all_posts
|
||||
|
||||
except Exception as ex:
|
||||
log(f"Error getting user posts for user {userName}: {ex}")
|
||||
return None
|
||||
|
||||
def get_user_posts_misskey(userName, webserver):
|
||||
# query user info via search api
|
||||
# we could filter by host but there's no way to limit that to just the main host on firefish currently
|
||||
# on misskey it works if you supply '.' as the host but firefish does not
|
||||
userId = None
|
||||
try:
|
||||
url = f'https://{webserver}/api/users/search-by-username-and-host'
|
||||
resp = post(url, { 'username': userName })
|
||||
|
||||
if resp.status_code == 200:
|
||||
res = resp.json()
|
||||
for user in res:
|
||||
if user['host'] is None:
|
||||
userId = user['id']
|
||||
break
|
||||
else:
|
||||
log(f"Error finding user {userName} from {webserver}. Status Code: {resp.status_code}")
|
||||
return None
|
||||
except Exception as ex:
|
||||
log(f"Error finding user {userName} from {webserver}. Exception: {ex}")
|
||||
return None
|
||||
|
||||
if userId is None:
|
||||
log(f'Error finding user {userName} from {webserver}: user not found on server in search')
|
||||
return None
|
||||
|
||||
try:
|
||||
url = f'https://{webserver}/api/users/notes'
|
||||
resp = post(url, { 'userId': userId, 'limit': 40 })
|
||||
|
||||
if resp.status_code == 200:
|
||||
notes = resp.json()
|
||||
for note in notes:
|
||||
if note.get('url') is None:
|
||||
# add this to make it look like Mastodon status objects
|
||||
note.update({ 'url': f"https://{webserver}/notes/{note['id']}" })
|
||||
return notes
|
||||
else:
|
||||
log(f"Error getting posts by user {userName} from {webserver}. Status Code: {resp.status_code}")
|
||||
return None
|
||||
except Exception as ex:
|
||||
log(f"Error getting posts by user {userName} from {webserver}. Exception: {ex}")
|
||||
return None
|
||||
|
||||
|
||||
def get_new_follow_requests(server, access_token, max, known_followings):
|
||||
"""Get any new follow requests for the specified user, up to the max number provided"""
|
||||
|
||||
|
@ -356,21 +456,24 @@ def get_reply_toots(user_id, server, access_token, seen_urls, reply_since):
|
|||
)
|
||||
|
||||
|
||||
def get_all_known_context_urls(server, reply_toots,parsed_urls):
|
||||
def get_all_known_context_urls(server, reply_toots, parsed_urls, seen_hosts):
|
||||
"""get the context toots of the given toots from their original server"""
|
||||
known_context_urls = set(
|
||||
filter(
|
||||
lambda url: not url.startswith(f"https://{server}/"),
|
||||
itertools.chain.from_iterable(
|
||||
get_toot_context(*parse_url(toot["url"] if toot["reblog"] is None else toot["reblog"]["url"],parsed_urls), toot["url"])
|
||||
for toot in filter(
|
||||
lambda toot: toot_has_parseable_url(toot,parsed_urls),
|
||||
reply_toots
|
||||
)
|
||||
),
|
||||
)
|
||||
)
|
||||
known_context_urls = set()
|
||||
|
||||
for toot in reply_toots:
|
||||
if toot_has_parseable_url(toot, parsed_urls):
|
||||
url = toot["url"] if toot["reblog"] is None else toot["reblog"]["url"]
|
||||
parsed_url = parse_url(url, parsed_urls)
|
||||
context = get_toot_context(parsed_url[0], parsed_url[1], url, seen_hosts)
|
||||
if context is not None:
|
||||
for item in context:
|
||||
known_context_urls.add(item)
|
||||
else:
|
||||
log(f"Error getting context for toot {url}")
|
||||
|
||||
known_context_urls = set(filter(lambda url: not url.startswith(f"https://{server}/"), known_context_urls))
|
||||
log(f"Found {len(known_context_urls)} known context toots")
|
||||
|
||||
return known_context_urls
|
||||
|
||||
|
||||
|
@ -435,6 +538,11 @@ def parse_user_url(url):
|
|||
if match is not None:
|
||||
return match
|
||||
|
||||
match = parse_lemmy_profile_url(url)
|
||||
if match is not None:
|
||||
return match
|
||||
|
||||
# Pixelfed profile paths do not use a subdirectory, so we need to match for them last.
|
||||
match = parse_pixelfed_profile_url(url)
|
||||
if match is not None:
|
||||
return match
|
||||
|
@ -448,17 +556,32 @@ def parse_url(url, parsed_urls):
|
|||
match = parse_mastodon_url(url)
|
||||
if match is not None:
|
||||
parsed_urls[url] = match
|
||||
|
||||
if url not in parsed_urls:
|
||||
match = parse_mastodon_uri(url)
|
||||
if match is not None:
|
||||
parsed_urls[url] = match
|
||||
|
||||
if url not in parsed_urls:
|
||||
match = parse_pleroma_url(url)
|
||||
if match is not None:
|
||||
parsed_urls[url] = match
|
||||
|
||||
if url not in parsed_urls:
|
||||
match = parse_lemmy_url(url)
|
||||
if match is not None:
|
||||
parsed_urls[url] = match
|
||||
|
||||
if url not in parsed_urls:
|
||||
match = parse_pixelfed_url(url)
|
||||
if match is not None:
|
||||
parsed_urls[url] = match
|
||||
|
||||
if url not in parsed_urls:
|
||||
match = parse_misskey_url(url)
|
||||
if match is not None:
|
||||
parsed_urls[url] = match
|
||||
|
||||
if url not in parsed_urls:
|
||||
log(f"Error parsing toot URL {url}")
|
||||
parsed_urls[url] = None
|
||||
|
@ -468,7 +591,7 @@ def parse_url(url, parsed_urls):
|
|||
def parse_mastodon_profile_url(url):
|
||||
"""parse a Mastodon Profile URL and return the server and username"""
|
||||
match = re.match(
|
||||
r"https://(?P<server>.*)/@(?P<username>.*)", url
|
||||
r"https://(?P<server>[^/]+)/@(?P<username>[^/]+)", url
|
||||
)
|
||||
if match is not None:
|
||||
return (match.group("server"), match.group("username"))
|
||||
|
@ -477,23 +600,31 @@ def parse_mastodon_profile_url(url):
|
|||
def parse_mastodon_url(url):
|
||||
"""parse a Mastodon URL and return the server and ID"""
|
||||
match = re.match(
|
||||
r"https://(?P<server>.*)/@(?P<username>.*)/(?P<toot_id>.*)", url
|
||||
r"https://(?P<server>[^/]+)/@(?P<username>[^/]+)/(?P<toot_id>[^/]+)", url
|
||||
)
|
||||
if match is not None:
|
||||
return (match.group("server"), match.group("toot_id"))
|
||||
return None
|
||||
|
||||
def parse_mastodon_uri(uri):
|
||||
"""parse a Mastodon URI and return the server and ID"""
|
||||
match = re.match(
|
||||
r"https://(?P<server>[^/]+)/users/(?P<username>[^/]+)/statuses/(?P<toot_id>[^/]+)", uri
|
||||
)
|
||||
if match is not None:
|
||||
return (match.group("server"), match.group("toot_id"))
|
||||
return None
|
||||
|
||||
def parse_pleroma_url(url):
|
||||
"""parse a Pleroma URL and return the server and ID"""
|
||||
match = re.match(r"https://(?P<server>.*)/objects/(?P<toot_id>.*)", url)
|
||||
match = re.match(r"https://(?P<server>[^/]+)/objects/(?P<toot_id>[^/]+)", url)
|
||||
if match is not None:
|
||||
server = match.group("server")
|
||||
url = get_redirect_url(url)
|
||||
if url is None:
|
||||
return None
|
||||
|
||||
match = re.match(r"/notice/(?P<toot_id>.*)", url)
|
||||
match = re.match(r"/notice/(?P<toot_id>[^/]+)", url)
|
||||
if match is not None:
|
||||
return (server, match.group("toot_id"))
|
||||
return None
|
||||
|
@ -501,7 +632,7 @@ def parse_pleroma_url(url):
|
|||
|
||||
def parse_pleroma_profile_url(url):
|
||||
"""parse a Pleroma Profile URL and return the server and username"""
|
||||
match = re.match(r"https://(?P<server>.*)/users/(?P<username>.*)", url)
|
||||
match = re.match(r"https://(?P<server>[^/]+)/users/(?P<username>[^/]+)", url)
|
||||
if match is not None:
|
||||
return (match.group("server"), match.group("username"))
|
||||
return None
|
||||
|
@ -509,7 +640,16 @@ def parse_pleroma_profile_url(url):
|
|||
def parse_pixelfed_url(url):
|
||||
"""parse a Pixelfed URL and return the server and ID"""
|
||||
match = re.match(
|
||||
r"https://(?P<server>.*)/p/(?P<username>.*)/(?P<toot_id>.*)", url
|
||||
r"https://(?P<server>[^/]+)/p/(?P<username>[^/]+)/(?P<toot_id>[^/]+)", url
|
||||
)
|
||||
if match is not None:
|
||||
return (match.group("server"), match.group("toot_id"))
|
||||
return None
|
||||
|
||||
def parse_misskey_url(url):
|
||||
"""parse a Misskey URL and return the server and ID"""
|
||||
match = re.match(
|
||||
r"https://(?P<server>[^/]+)/notes/(?P<toot_id>[^/]+)", url
|
||||
)
|
||||
if match is not None:
|
||||
return (match.group("server"), match.group("toot_id"))
|
||||
|
@ -517,11 +657,26 @@ def parse_pixelfed_url(url):
|
|||
|
||||
def parse_pixelfed_profile_url(url):
|
||||
"""parse a Pixelfed Profile URL and return the server and username"""
|
||||
match = re.match(r"https://(?P<server>.*)/(?P<username>.*)", url)
|
||||
match = re.match(r"https://(?P<server>[^/]+)/(?P<username>[^/]+)", url)
|
||||
if match is not None:
|
||||
return (match.group("server"), match.group("username"))
|
||||
return None
|
||||
|
||||
def parse_lemmy_url(url):
|
||||
"""parse a Lemmy URL and return the server, and ID"""
|
||||
match = re.match(
|
||||
r"https://(?P<server>[^/]+)/(?:comment|post)/(?P<toot_id>[^/]+)", url
|
||||
)
|
||||
if match is not None:
|
||||
return (match.group("server"), match.group("toot_id"))
|
||||
return None
|
||||
|
||||
def parse_lemmy_profile_url(url):
|
||||
"""parse a Lemmy Profile URL and return the server and username"""
|
||||
match = re.match(r"https://(?P<server>[^/]+)/(?:u|c)/(?P<username>[^/]+)", url)
|
||||
if match is not None:
|
||||
return (match.group("server"), match.group("username"))
|
||||
return None
|
||||
|
||||
def get_redirect_url(url):
|
||||
"""get the URL given URL redirects to"""
|
||||
|
@ -546,20 +701,37 @@ def get_redirect_url(url):
|
|||
return None
|
||||
|
||||
|
||||
def get_all_context_urls(server, replied_toot_ids):
|
||||
def get_all_context_urls(server, replied_toot_ids, seen_hosts):
|
||||
"""get the URLs of the context toots of the given toots"""
|
||||
return filter(
|
||||
lambda url: not url.startswith(f"https://{server}/"),
|
||||
itertools.chain.from_iterable(
|
||||
get_toot_context(server, toot_id, url)
|
||||
get_toot_context(server, toot_id, url, seen_hosts)
|
||||
for (url, (server, toot_id)) in replied_toot_ids
|
||||
),
|
||||
)
|
||||
|
||||
|
||||
def get_toot_context(server, toot_id, toot_url):
|
||||
def get_toot_context(server, toot_id, toot_url, seen_hosts):
|
||||
"""get the URLs of the context toots of the given toot"""
|
||||
url = f"https://{server}/api/v1/statuses/{toot_id}/context"
|
||||
|
||||
post_server = get_server_info(server, seen_hosts)
|
||||
if post_server is None:
|
||||
log(f'server {server} not found for post')
|
||||
return []
|
||||
|
||||
if post_server['mastodonApiSupport']:
|
||||
return get_mastodon_urls(post_server['webserver'], toot_id, toot_url)
|
||||
if post_server['lemmyApiSupport']:
|
||||
return get_lemmy_urls(post_server['webserver'], toot_id, toot_url)
|
||||
if post_server['misskeyApiSupport']:
|
||||
return get_misskey_urls(post_server['webserver'], toot_id, toot_url)
|
||||
|
||||
log(f'unknown server api for {server}')
|
||||
return []
|
||||
|
||||
def get_mastodon_urls(webserver, toot_id, toot_url):
|
||||
url = f"https://{webserver}/api/v1/statuses/{toot_id}/context"
|
||||
try:
|
||||
resp = get(url)
|
||||
except Exception as ex:
|
||||
|
@ -574,17 +746,119 @@ def get_toot_context(server, toot_id, toot_url):
|
|||
except Exception as ex:
|
||||
log(f"Error parsing context for toot {toot_url}. Exception: {ex}")
|
||||
return []
|
||||
elif resp.status_code == 429:
|
||||
reset = datetime.strptime(resp.headers['x-ratelimit-reset'], '%Y-%m-%dT%H:%M:%S.%fZ')
|
||||
log(f"Rate Limit hit when getting context for {toot_url}. Waiting to retry at {resp.headers['x-ratelimit-reset']}")
|
||||
time.sleep((reset - datetime.now()).total_seconds() + 1)
|
||||
return get_toot_context(server, toot_id, toot_url)
|
||||
|
||||
log(
|
||||
f"Error getting context for toot {toot_url}. Status code: {resp.status_code}"
|
||||
)
|
||||
return []
|
||||
|
||||
def get_lemmy_urls(webserver, toot_id, toot_url):
|
||||
if toot_url.find("/comment/") != -1:
|
||||
return get_lemmy_comment_context(webserver, toot_id, toot_url)
|
||||
if toot_url.find("/post/") != -1:
|
||||
return get_lemmy_comments_urls(webserver, toot_id, toot_url)
|
||||
else:
|
||||
log(f'unknown lemmy url type {toot_url}')
|
||||
return []
|
||||
|
||||
def get_lemmy_comment_context(webserver, toot_id, toot_url):
|
||||
"""get the URLs of the context toots of the given toot"""
|
||||
comment = f"https://{webserver}/api/v3/comment?id={toot_id}"
|
||||
try:
|
||||
resp = get(comment)
|
||||
except Exception as ex:
|
||||
log(f"Error getting comment {toot_id} from {toot_url}. Exception: {ex}")
|
||||
return []
|
||||
|
||||
if resp.status_code == 200:
|
||||
try:
|
||||
res = resp.json()
|
||||
post_id = res['comment_view']['comment']['post_id']
|
||||
return get_lemmy_comments_urls(webserver, post_id, toot_url)
|
||||
except Exception as ex:
|
||||
log(f"Error parsing context for comment {toot_url}. Exception: {ex}")
|
||||
return []
|
||||
|
||||
def get_lemmy_comments_urls(webserver, post_id, toot_url):
|
||||
"""get the URLs of the comments of the given post"""
|
||||
urls = []
|
||||
url = f"https://{webserver}/api/v3/post?id={post_id}"
|
||||
try:
|
||||
resp = get(url)
|
||||
except Exception as ex:
|
||||
log(f"Error getting post {post_id} from {toot_url}. Exception: {ex}")
|
||||
return []
|
||||
|
||||
if resp.status_code == 200:
|
||||
try:
|
||||
res = resp.json()
|
||||
if res['post_view']['counts']['comments'] == 0:
|
||||
return []
|
||||
urls.append(res['post_view']['post']['ap_id'])
|
||||
except Exception as ex:
|
||||
log(f"Error parsing post {post_id} from {toot_url}. Exception: {ex}")
|
||||
|
||||
url = f"https://{webserver}/api/v3/comment/list?post_id={post_id}&sort=New&limit=50"
|
||||
try:
|
||||
resp = get(url)
|
||||
except Exception as ex:
|
||||
log(f"Error getting comments for post {post_id} from {toot_url}. Exception: {ex}")
|
||||
return []
|
||||
|
||||
if resp.status_code == 200:
|
||||
try:
|
||||
res = resp.json()
|
||||
list_of_urls = [comment_info['comment']['ap_id'] for comment_info in res['comments']]
|
||||
log(f"Got {len(list_of_urls)} comments for post {toot_url}")
|
||||
urls.extend(list_of_urls)
|
||||
return urls
|
||||
except Exception as ex:
|
||||
log(f"Error parsing comments for post {toot_url}. Exception: {ex}")
|
||||
|
||||
log(f"Error getting comments for post {toot_url}. Status code: {resp.status_code}")
|
||||
return []
|
||||
|
||||
def get_misskey_urls(webserver, post_id, toot_url):
|
||||
"""get the URLs of the comments of a given misskey post"""
|
||||
|
||||
urls = []
|
||||
url = f"https://{webserver}/api/notes/children"
|
||||
try:
|
||||
resp = post(url, { 'noteId': post_id, 'limit': 100, 'depth': 12 })
|
||||
except Exception as ex:
|
||||
log(f"Error getting post {post_id} from {toot_url}. Exception: {ex}")
|
||||
return []
|
||||
|
||||
if resp.status_code == 200:
|
||||
try:
|
||||
res = resp.json()
|
||||
log(f"Got children for misskey post {toot_url}")
|
||||
list_of_urls = [f'https://{webserver}/notes/{comment_info["id"]}' for comment_info in res]
|
||||
urls.extend(list_of_urls)
|
||||
except Exception as ex:
|
||||
log(f"Error parsing post {post_id} from {toot_url}. Exception: {ex}")
|
||||
else:
|
||||
log(f"Error getting post {post_id} from {toot_url}. Status Code: {resp.status_code}")
|
||||
|
||||
url = f"https://{webserver}/api/notes/conversation"
|
||||
try:
|
||||
resp = post(url, { 'noteId': post_id, 'limit': 100 })
|
||||
except Exception as ex:
|
||||
log(f"Error getting post {post_id} from {toot_url}. Exception: {ex}")
|
||||
return []
|
||||
|
||||
if resp.status_code == 200:
|
||||
try:
|
||||
res = resp.json()
|
||||
log(f"Got conversation for misskey post {toot_url}")
|
||||
list_of_urls = [f'https://{webserver}/notes/{comment_info["id"]}' for comment_info in res]
|
||||
urls.extend(list_of_urls)
|
||||
except Exception as ex:
|
||||
log(f"Error parsing post {post_id} from {toot_url}. Exception: {ex}")
|
||||
else:
|
||||
log(f"Error getting post {post_id} from {toot_url}. Status Code: {resp.status_code}")
|
||||
|
||||
return urls
|
||||
|
||||
def add_context_urls(server, access_token, context_urls, seen_urls):
|
||||
"""add the given toot URLs to the server"""
|
||||
|
@ -625,11 +899,6 @@ def add_context_url(url, server, access_token):
|
|||
"Make sure you have the read:search scope enabled for your access token."
|
||||
)
|
||||
return False
|
||||
elif resp.status_code == 429:
|
||||
reset = datetime.strptime(resp.headers['x-ratelimit-reset'], '%Y-%m-%dT%H:%M:%S.%fZ')
|
||||
log(f"Rate Limit hit when adding url {search_url}. Waiting to retry at {resp.headers['x-ratelimit-reset']}")
|
||||
time.sleep((reset - datetime.now()).total_seconds() + 1)
|
||||
return add_context_url(url, server, access_token)
|
||||
else:
|
||||
log(
|
||||
f"Error adding url {search_url} to server {server}. Status code: {resp.status_code}"
|
||||
|
@ -666,12 +935,30 @@ def get_paginated_mastodon(url, max, headers = {}, timeout = 0, max_tries = 5):
|
|||
if(isinstance(max, int)):
|
||||
while len(result) < max and 'next' in response.links:
|
||||
response = get(response.links['next']['url'], headers, timeout, max_tries)
|
||||
result = result + response.json()
|
||||
if response.status_code != 200:
|
||||
raise Exception(
|
||||
f"Error getting URL {response.url}. \
|
||||
Status code: {response.status_code}"
|
||||
)
|
||||
response_json = response.json()
|
||||
if isinstance(response_json, list):
|
||||
result += response_json
|
||||
else:
|
||||
break
|
||||
else:
|
||||
while parser.parse(result[-1]['created_at']) >= max and 'next' in response.links:
|
||||
while result and parser.parse(result[-1]['created_at']) >= max \
|
||||
and 'next' in response.links:
|
||||
response = get(response.links['next']['url'], headers, timeout, max_tries)
|
||||
result = result + response.json()
|
||||
|
||||
if response.status_code != 200:
|
||||
raise Exception(
|
||||
f"Error getting URL {response.url}. \
|
||||
Status code: {response.status_code}"
|
||||
)
|
||||
response_json = response.json()
|
||||
if isinstance(response_json, list):
|
||||
result += response_json
|
||||
else:
|
||||
break
|
||||
return result
|
||||
|
||||
|
||||
|
@ -697,9 +984,61 @@ def get(url, headers = {}, timeout = 0, max_tries = 5):
|
|||
raise Exception(f"Maximum number of retries exceeded for rate limited request {url}")
|
||||
return response
|
||||
|
||||
def post(url, json, headers = {}, timeout = 0, max_tries = 5):
|
||||
"""A simple wrapper to make a post request while providing our user agent, and respecting rate limits"""
|
||||
h = headers.copy()
|
||||
if 'User-Agent' not in h:
|
||||
h['User-Agent'] = 'FediFetcher (https://go.thms.uk/mgr)'
|
||||
|
||||
if timeout == 0:
|
||||
timeout = arguments.http_timeout
|
||||
|
||||
response = requests.post( url, json=json, headers= h, timeout=timeout)
|
||||
if response.status_code == 429:
|
||||
if max_tries > 0:
|
||||
reset = parser.parse(response.headers['x-ratelimit-reset'])
|
||||
now = datetime.now(datetime.now().astimezone().tzinfo)
|
||||
wait = (reset - now).total_seconds() + 1
|
||||
log(f"Rate Limit hit requesting {url}. Waiting {wait} sec to retry at {response.headers['x-ratelimit-reset']}")
|
||||
time.sleep(wait)
|
||||
return post(url, json, headers, timeout, max_tries - 1)
|
||||
|
||||
raise Exception(f"Maximum number of retries exceeded for rate limited request {url}")
|
||||
return response
|
||||
|
||||
def log(text):
|
||||
print(f"{datetime.now()} {datetime.now().astimezone().tzinfo}: {text}")
|
||||
|
||||
class ServerList:
|
||||
def __init__(self, iterable):
|
||||
self._dict = {}
|
||||
for item in iterable:
|
||||
if('last_checked' in iterable[item]):
|
||||
iterable[item]['last_checked'] = parser.parse(iterable[item]['last_checked'])
|
||||
self.add(item, iterable[item])
|
||||
|
||||
def add(self, key, item):
|
||||
self._dict[key] = item
|
||||
|
||||
def get(self, key):
|
||||
return self._dict[key]
|
||||
|
||||
def pop(self,key):
|
||||
return self._dict.pop(key)
|
||||
|
||||
def __contains__(self, item):
|
||||
return item in self._dict
|
||||
|
||||
def __iter__(self):
|
||||
return iter(self._dict)
|
||||
|
||||
def __len__(self):
|
||||
return len(self._dict)
|
||||
|
||||
def toJSON(self):
|
||||
return json.dumps(self._dict,default=str)
|
||||
|
||||
|
||||
class OrderedSet:
|
||||
"""An ordered set implementation over a dict"""
|
||||
|
||||
|
@ -742,8 +1081,161 @@ class OrderedSet:
|
|||
return len(self._dict)
|
||||
|
||||
def toJSON(self):
|
||||
return json.dump(self._dict, f, default=str)
|
||||
return json.dumps(self._dict,default=str)
|
||||
|
||||
def get_server_from_host_meta(server):
|
||||
url = f'https://{server}/.well-known/host-meta'
|
||||
try:
|
||||
resp = get(url, timeout = 30)
|
||||
except Exception as ex:
|
||||
log(f"Error getting host meta for {server}. Exception: {ex}")
|
||||
return None
|
||||
|
||||
if resp.status_code == 200:
|
||||
try:
|
||||
hostMeta = ET.fromstring(resp.text)
|
||||
lrdd = hostMeta.find('.//{http://docs.oasis-open.org/ns/xri/xrd-1.0}Link[@rel="lrdd"]')
|
||||
url = lrdd.get('template')
|
||||
match = re.match(
|
||||
r"https://(?P<server>[^/]+)/", url
|
||||
)
|
||||
if match is not None:
|
||||
return match.group("server")
|
||||
else:
|
||||
raise Exception(f'server not found in lrdd for {server}')
|
||||
return None
|
||||
except Exception as ex:
|
||||
log(f'Error parsing host meta for {server}. Exception: {ex}')
|
||||
return None
|
||||
else:
|
||||
log(f'Error getting host meta for {server}. Status Code: {resp.status_code}')
|
||||
return None
|
||||
|
||||
def get_nodeinfo(server, seen_hosts, host_meta_fallback = False):
|
||||
url = f'https://{server}/.well-known/nodeinfo'
|
||||
try:
|
||||
resp = get(url, timeout = 30)
|
||||
except Exception as ex:
|
||||
log(f"Error getting host node info for {server}. Exception: {ex}")
|
||||
return None
|
||||
|
||||
# if well-known nodeinfo isn't found, try to check host-meta for a webfinger URL
|
||||
# needed on servers where the display domain is different than the web domain
|
||||
if resp.status_code != 200 and not host_meta_fallback:
|
||||
# not found, try to check host-meta as a fallback
|
||||
log(f'nodeinfo for {server} not found, checking host-meta')
|
||||
new_server = get_server_from_host_meta(server)
|
||||
if new_server is not None:
|
||||
if new_server == server:
|
||||
log(f'host-meta for {server} did not get a new server.')
|
||||
return None
|
||||
else:
|
||||
return get_nodeinfo(new_server, seen_hosts, True)
|
||||
else:
|
||||
return None
|
||||
|
||||
if resp.status_code == 200:
|
||||
nodeLoc = None
|
||||
try:
|
||||
nodeInfo = resp.json()
|
||||
for link in nodeInfo['links']:
|
||||
if link['rel'] in [
|
||||
'http://nodeinfo.diaspora.software/ns/schema/2.0',
|
||||
'http://nodeinfo.diaspora.software/ns/schema/2.1',
|
||||
]:
|
||||
nodeLoc = link['href']
|
||||
break
|
||||
except Exception as ex:
|
||||
log(f'error getting server {server} info from well-known node info. Exception: {ex}')
|
||||
return None
|
||||
else:
|
||||
log(f'Error getting well-known host node info for {server}. Status Code: {resp.status_code}')
|
||||
return None
|
||||
|
||||
if nodeLoc is None:
|
||||
log(f'could not find link to node info in well-known nodeinfo of {server}')
|
||||
return None
|
||||
|
||||
# regrab server from nodeLoc, again in the case of different display and web domains
|
||||
match = re.match(
|
||||
r"https://(?P<server>[^/]+)/", nodeLoc
|
||||
)
|
||||
if match is None:
|
||||
log(f"Error getting web server name from {server}.")
|
||||
return None
|
||||
|
||||
server = match.group('server')
|
||||
|
||||
# return early if the web domain has been seen previously (in cases with host-meta lookups)
|
||||
if server in seen_hosts:
|
||||
return seen_hosts.get(server)
|
||||
|
||||
try:
|
||||
resp = get(nodeLoc, timeout = 30)
|
||||
except Exception as ex:
|
||||
log(f"Error getting host node info for {server}. Exception: {ex}")
|
||||
return None
|
||||
|
||||
if resp.status_code == 200:
|
||||
try:
|
||||
nodeInfo = resp.json()
|
||||
if 'activitypub' not in nodeInfo['protocols']:
|
||||
log(f'server {server} does not support activitypub, skipping')
|
||||
return None
|
||||
return {
|
||||
'webserver': server,
|
||||
'software': nodeInfo['software']['name'],
|
||||
'version': nodeInfo['software']['version'],
|
||||
'rawnodeinfo': nodeInfo,
|
||||
}
|
||||
except Exception as ex:
|
||||
log(f'error getting server {server} info from nodeinfo. Exception: {ex}')
|
||||
return None
|
||||
else:
|
||||
log(f'Error getting host node info for {server}. Status Code: {resp.status_code}')
|
||||
return None
|
||||
|
||||
def get_server_info(server, seen_hosts):
|
||||
if server in seen_hosts:
|
||||
serverInfo = seen_hosts.get(server)
|
||||
if('info' in serverInfo and serverInfo['info'] == None):
|
||||
return None
|
||||
return serverInfo
|
||||
|
||||
nodeinfo = get_nodeinfo(server, seen_hosts)
|
||||
if nodeinfo is None:
|
||||
seen_hosts.add(server, {
|
||||
'info': None,
|
||||
'last_checked': datetime.now()
|
||||
})
|
||||
else:
|
||||
set_server_apis(nodeinfo)
|
||||
seen_hosts.add(server, nodeinfo)
|
||||
if server is not nodeinfo['webserver']:
|
||||
seen_hosts.add(nodeinfo['webserver'], nodeinfo)
|
||||
return nodeinfo
|
||||
|
||||
def set_server_apis(server):
|
||||
# support for new server software should be added here
|
||||
software_apis = {
|
||||
'mastodonApiSupport': ['mastodon', 'pleroma', 'akkoma', 'pixelfed', 'hometown', 'iceshrimp'],
|
||||
'misskeyApiSupport': ['misskey', 'calckey', 'firefish', 'foundkey', 'sharkey'],
|
||||
'lemmyApiSupport': ['lemmy']
|
||||
}
|
||||
|
||||
# software that has specific API support but is not compatible with FediFetcher for various reasons:
|
||||
# * gotosocial - All Mastodon APIs require access token (https://github.com/superseriousbusiness/gotosocial/issues/2038)
|
||||
|
||||
for api, softwareList in software_apis.items():
|
||||
server[api] = server['software'] in softwareList
|
||||
|
||||
# search `features` list in metadata if available
|
||||
if 'metadata' in server['rawnodeinfo'] and 'features' in server['rawnodeinfo']['metadata'] and type(server['rawnodeinfo']['metadata']['features']) is list:
|
||||
features = server['rawnodeinfo']['metadata']['features']
|
||||
if 'mastodon_api' in features:
|
||||
server['mastodonApiSupport'] = True
|
||||
|
||||
server['last_checked'] = datetime.now()
|
||||
|
||||
if __name__ == "__main__":
|
||||
start = datetime.now()
|
||||
|
@ -821,6 +1313,7 @@ if __name__ == "__main__":
|
|||
REPLIED_TOOT_SERVER_IDS_FILE = os.path.join(arguments.state_dir, "replied_toot_server_ids")
|
||||
KNOWN_FOLLOWINGS_FILE = os.path.join(arguments.state_dir, "known_followings")
|
||||
RECENTLY_CHECKED_USERS_FILE = os.path.join(arguments.state_dir, "recently_checked_users")
|
||||
SEEN_HOSTS_FILE = os.path.join(arguments.state_dir, "seen_hosts")
|
||||
|
||||
|
||||
seen_urls = OrderedSet([])
|
||||
|
@ -854,6 +1347,22 @@ if __name__ == "__main__":
|
|||
|
||||
all_known_users = OrderedSet(list(known_followings) + list(recently_checked_users))
|
||||
|
||||
if os.path.exists(SEEN_HOSTS_FILE):
|
||||
with open(SEEN_HOSTS_FILE, "r", encoding="utf-8") as f:
|
||||
seen_hosts = ServerList(json.load(f))
|
||||
|
||||
for host in list(seen_hosts):
|
||||
serverInfo = seen_hosts.get(host)
|
||||
if 'last_checked' in serverInfo:
|
||||
serverAge = datetime.now(serverInfo['last_checked'].tzinfo) - serverInfo['last_checked']
|
||||
if(serverAge.total_seconds() > arguments.remember_hosts_for_days * 24 * 60 * 60 ):
|
||||
seen_hosts.pop(host)
|
||||
elif('info' in serverInfo and serverInfo['info'] == None and serverAge.total_seconds() > 60 * 60 ):
|
||||
# Don't cache failures for more than 24 hours
|
||||
seen_hosts.pop(host)
|
||||
else:
|
||||
seen_hosts = ServerList({})
|
||||
|
||||
if(isinstance(arguments.access_token, str)):
|
||||
setattr(arguments, 'access_token', [arguments.access_token])
|
||||
|
||||
|
@ -866,19 +1375,19 @@ if __name__ == "__main__":
|
|||
reply_toots = get_all_reply_toots(
|
||||
arguments.server, user_ids, token, seen_urls, arguments.reply_interval_in_hours
|
||||
)
|
||||
known_context_urls = get_all_known_context_urls(arguments.server, reply_toots,parsed_urls)
|
||||
known_context_urls = get_all_known_context_urls(arguments.server, reply_toots,parsed_urls, seen_hosts)
|
||||
seen_urls.update(known_context_urls)
|
||||
replied_toot_ids = get_all_replied_toot_server_ids(
|
||||
arguments.server, reply_toots, replied_toot_server_ids, parsed_urls
|
||||
)
|
||||
context_urls = get_all_context_urls(arguments.server, replied_toot_ids)
|
||||
context_urls = get_all_context_urls(arguments.server, replied_toot_ids, seen_hosts)
|
||||
add_context_urls(arguments.server, token, context_urls, seen_urls)
|
||||
|
||||
|
||||
if arguments.home_timeline_length > 0:
|
||||
"""Do the same with any toots on the key owner's home timeline """
|
||||
timeline_toots = get_timeline(arguments.server, token, arguments.home_timeline_length)
|
||||
known_context_urls = get_all_known_context_urls(arguments.server, timeline_toots,parsed_urls)
|
||||
known_context_urls = get_all_known_context_urls(arguments.server, timeline_toots,parsed_urls, seen_hosts)
|
||||
add_context_urls(arguments.server, token, known_context_urls, seen_urls)
|
||||
|
||||
# Backfill any post authors, and any mentioned users
|
||||
|
@ -900,40 +1409,40 @@ if __name__ == "__main__":
|
|||
if user not in mentioned_users and user['acct'] not in all_known_users:
|
||||
mentioned_users.append(user)
|
||||
|
||||
add_user_posts(arguments.server, token, filter_known_users(mentioned_users, all_known_users), recently_checked_users, all_known_users, seen_urls)
|
||||
add_user_posts(arguments.server, token, filter_known_users(mentioned_users, all_known_users), recently_checked_users, all_known_users, seen_urls, seen_hosts)
|
||||
|
||||
if arguments.max_followings > 0:
|
||||
log(f"Getting posts from last {arguments.max_followings} followings")
|
||||
user_id = get_user_id(arguments.server, arguments.user, token)
|
||||
followings = get_new_followings(arguments.server, user_id, arguments.max_followings, all_known_users)
|
||||
add_user_posts(arguments.server, token, followings, known_followings, all_known_users, seen_urls)
|
||||
add_user_posts(arguments.server, token, followings, known_followings, all_known_users, seen_urls, seen_hosts)
|
||||
|
||||
if arguments.max_followers > 0:
|
||||
log(f"Getting posts from last {arguments.max_followers} followers")
|
||||
user_id = get_user_id(arguments.server, arguments.user, token)
|
||||
followers = get_new_followers(arguments.server, user_id, arguments.max_followers, all_known_users)
|
||||
add_user_posts(arguments.server, token, followers, recently_checked_users, all_known_users, seen_urls)
|
||||
add_user_posts(arguments.server, token, followers, recently_checked_users, all_known_users, seen_urls, seen_hosts)
|
||||
|
||||
if arguments.max_follow_requests > 0:
|
||||
log(f"Getting posts from last {arguments.max_follow_requests} follow requests")
|
||||
follow_requests = get_new_follow_requests(arguments.server, token, arguments.max_follow_requests, all_known_users)
|
||||
add_user_posts(arguments.server, token, follow_requests, recently_checked_users, all_known_users, seen_urls)
|
||||
add_user_posts(arguments.server, token, follow_requests, recently_checked_users, all_known_users, seen_urls, seen_hosts)
|
||||
|
||||
if arguments.from_notifications > 0:
|
||||
log(f"Getting notifications for last {arguments.from_notifications} hours")
|
||||
notification_users = get_notification_users(arguments.server, token, all_known_users, arguments.from_notifications)
|
||||
add_user_posts(arguments.server, token, notification_users, recently_checked_users, all_known_users, seen_urls)
|
||||
add_user_posts(arguments.server, token, notification_users, recently_checked_users, all_known_users, seen_urls, seen_hosts)
|
||||
|
||||
if arguments.max_bookmarks > 0:
|
||||
log(f"Pulling replies to the last {arguments.max_bookmarks} bookmarks")
|
||||
bookmarks = get_bookmarks(arguments.server, token, arguments.max_bookmarks)
|
||||
known_context_urls = get_all_known_context_urls(arguments.server, bookmarks,parsed_urls)
|
||||
known_context_urls = get_all_known_context_urls(arguments.server, bookmarks,parsed_urls, seen_hosts)
|
||||
add_context_urls(arguments.server, token, known_context_urls, seen_urls)
|
||||
|
||||
if arguments.max_favourites > 0:
|
||||
log(f"Pulling replies to the last {arguments.max_favourites} favourites")
|
||||
favourites = get_favourites(arguments.server, token, arguments.max_favourites)
|
||||
known_context_urls = get_all_known_context_urls(arguments.server, favourites,parsed_urls)
|
||||
known_context_urls = get_all_known_context_urls(arguments.server, favourites,parsed_urls, seen_hosts)
|
||||
add_context_urls(arguments.server, token, known_context_urls, seen_urls)
|
||||
|
||||
with open(KNOWN_FOLLOWINGS_FILE, "w", encoding="utf-8") as f:
|
||||
|
@ -946,7 +1455,10 @@ if __name__ == "__main__":
|
|||
json.dump(dict(list(replied_toot_server_ids.items())[-10000:]), f)
|
||||
|
||||
with open(RECENTLY_CHECKED_USERS_FILE, "w", encoding="utf-8") as f:
|
||||
recently_checked_users.toJSON()
|
||||
f.write(recently_checked_users.toJSON())
|
||||
|
||||
with open(SEEN_HOSTS_FILE, "w", encoding="utf-8") as f:
|
||||
f.write(seen_hosts.toJSON())
|
||||
|
||||
os.remove(LOCK_FILE)
|
||||
|
||||
|
|
|
@ -2,7 +2,9 @@ certifi==2022.12.7
|
|||
charset-normalizer==3.0.1
|
||||
docutils==0.19
|
||||
idna==3.4
|
||||
python-dateutil==2.8.2
|
||||
requests==2.28.2
|
||||
six==1.16.0
|
||||
smmap==5.0.0
|
||||
urllib3==1.26.14
|
||||
python-dateutil==2.8.2
|
||||
defusedxml==0.7.1
|
||||
|
|
Loading…
Add table
Reference in a new issue