Allow us to backfill posts from users that are mentioned in notifications

This commit is contained in:
nanos 2023-03-28 07:04:03 +01:00
parent c5208568b5
commit 9b7093e478
4 changed files with 155 additions and 55 deletions

View file

@ -33,7 +33,7 @@ jobs:
path: artifacts
- name: Get Directory structure
run: ls -lR
- run: python find_posts.py --lock-hours=0 --access-token=${{ secrets.ACCESS_TOKEN }} --server=${{ vars.MASTODON_SERVER }} --reply-interval-in-hours=${{ vars.REPLY_INTERVAL_IN_HOURS || 0 }} --home-timeline-length=${{ vars.HOME_TIMELINE_LENGTH || 0 }} --max-followings=${{ vars.MAX_FOLLOWINGS || 0 }} --user=${{ vars.USER }} --max-followers=${{ vars.MAX_FOLLOWERS || 0 }} --http-timeout=${{ vars.HTTP_TIMEOUT || 5 }} --max-follow-requests=${{ vars.MAX_FOLLOW_REQUESTS || 0 }} --on-fail=${{ vars.ON_FAIL }} --on-start=${{ vars.ON_START }} --on-done=${{ vars.ON_DONE }} --max-bookmarks=${{ vars.MAX_BOOKMARKS || 0 }}
- run: python find_posts.py --lock-hours=0 --access-token=${{ secrets.ACCESS_TOKEN }} --server=${{ vars.MASTODON_SERVER }} --reply-interval-in-hours=${{ vars.REPLY_INTERVAL_IN_HOURS || 0 }} --home-timeline-length=${{ vars.HOME_TIMELINE_LENGTH || 0 }} --max-followings=${{ vars.MAX_FOLLOWINGS || 0 }} --user=${{ vars.USER }} --max-followers=${{ vars.MAX_FOLLOWERS || 0 }} --http-timeout=${{ vars.HTTP_TIMEOUT || 5 }} --max-follow-requests=${{ vars.MAX_FOLLOW_REQUESTS || 0 }} --on-fail=${{ vars.ON_FAIL }} --on-start=${{ vars.ON_START }} --on-done=${{ vars.ON_DONE }} --max-bookmarks=${{ vars.MAX_BOOKMARKS || 0 }} --remember-users-for-hours=${{ vars.REMEMBER_USERS_FOR_HOURS || 168 }} --from-notifications=${{ vars.FROM_NOTIFICATIONS || 0 }}
- name: Upload artifacts
uses: actions/upload-artifact@v3
with:

View file

@ -7,6 +7,7 @@ This GitHub repository provides a simple script that can pull missing posts into
2. fetch missing replies to the most recent posts in your home timeline,
3. fetch missing replies to your bookmarks.
2. It can also backfill profiles on your instance. In particular it can
1. fetch missing recent posts from users that have recently appeared in your notifications,
1. fetch missing recent posts from users that you have recently followed,
2. fetch missing recent posts form users that have recently followed you,
3. fetch missing recent posts form users that have recently sent you a follow request.
@ -26,7 +27,7 @@ Regardless of how you want to run FediFetcher, you must first get an access toke
1. In Mastodon go to Preferences > Development > New Application
1. give it a nice name
2. Enable the required scopes for your options. See below for details, but if you want to use all parts of this script, you'll need these scopes: `read:search`, `read:statuses`, `read:follows`, `read:bookmarks`, and `admin:read:accounts`
2. Enable the required scopes for your options. You could tick `read` and `admin:read:accounts`, or see below for a list of which scopes are required for which options.
3. Save
4. Copy the value of `Your access token`
@ -54,7 +55,7 @@ If you want to, you can of course also run FediFetcher locally as a cron job:
1. To get started, clone this repository.
2. Install requirements: `pip install -r requirements.txt`
3. Then simply run this script like so: `python find_posts.py --access-token=<TOKEN> --server=<SERVER>` etc. (Read below, or run `python find_posts.py -h` to get a list of all options)
3. Then simply run this script like so: `python find_posts.py --access-token=<TOKEN> --server=<SERVER>` etc. An example script can be found in the [`examples`](https://github.com/nanos/FediFetcher/tree/main/examples) folder (Read below, or run `python find_posts.py -h` to get a list of all options)
When using a cronjob, we are using file based locking to avoid multiple overlapping executions of the script. The timeout period for the lock can be configured using `--lock-hours`.
@ -75,21 +76,36 @@ An example Kubernetes CronJob for running the container is included in the [`exa
### Configuration options
Please see below for a list of configuration options. Use the 'Environment Variable Name' if you are running FediFetcher has a GitHub Action, otherwise use the 'Command line flag'.
FediFetcher has quite a few configuration options, so here is my quick configuration advice, that should probably work for most people (use the *Environment Variable Name* if you are running FediFetcher has a GitHub Action, otherwise use the *Command line flag*):
| Environment Variable Name | Command line flag | Recommended Value |
|:-------------------------|:-------------------|:-----------|
| -- | `--access-token` | (Your access token) |
| `MASTODON_SERVER`|`--server` | (your Mastodon server name) |
| `HOME_TIMELINE_LENGTH` | `--home-timeline-length` | `200` |
| `MAX_FOLLOWINGS` | `--max-followings` | `80` |
| `FROM_NOTIFICATIONS` | `--from-notifications` | `1` |
If you configure FediFetcher this way, it'll fetch missing remote replies to the last 200 posts in your home timeline. It'll additionally backfill profiles of the last 80 people you followed, and of every account who appeared in your notifications during the past hour.
#### Advanced Options
Please find the list of all configuration options, including descriptions, below:
| Environment Variable Name | Command line flag | Required? | Notes |
|:---------------------------------------------------|:----------------------------------------------------|-----------|:------|
| -- | `--access-token` | Yes | The access token. If using GitHub action, this needs to be provided as a Secret called `ACCESS_TOKEN` |
|`MASTODON_SERVER`|`--server`|Yes|The domain only of your mastodon server (without `https://` prefix) e.g. `mstdn.thms.uk`. |
| `HOME_TIMELINE_LENGTH` | `--home-timeline-length` | No | Provide to fetch remote replies to posts in the API-Key owner's home timeline. Determines how many posts we'll fetch replies for. (An integer number, e.g. `200`)
| `REPLY_INTERVAL_IN_HOURS` | `--reply-interval-in-hours` | No | Provide to fetch remote replies to posts that have received replies from users on your own instance. Determines how far back in time we'll go to find posts that have received replies. (An integer number, e.g. `24`.) Requires an access token with `admin:read:accounts`
| `USER` | `--user` | See Notes | Required together with `MAX_FOLLOWERS` or `MAX_FOLLOWINGS`: The username of the user whose followers or followings you want to backfill (e.g. `michael` for the user `@michael@thms.uk`).
| `MAX_FOLLOWINGS` | `--max-followings` | No | Provide to backfill profiles for your most recent followings. Determines how many of your last followings you want to backfill. (An integer number, e.g. `80`. Ensure you also provide `USER`).
| `MAX_FOLLOWERS` | `--max-followers` | No | Provide to backfill profiles for your most recent followers. Determines how many of your last followers you want to backfill. (An integer number, e.g. `80`. Ensure you also provide `USER`).
| `MAX_FOLLOW_REQUESTS` | `--max-follow-requests` | No | Provide to backfill profiles for the API key owner's most recent pending follow requests. Determines how many of your last follow requests you want to backfill. (An integer number, e.g. `80`.). Requires an access token with `read:follows` scope.
| `MAX_BOOKMARKS` | `--max-bookmarks` | No | Provide to fetch remote replies to any posts you have bookmarked. Determines how many of your bookmarks you want to get replies to. (An integer number, e.g. `80`.). Requires an access token with `read:bookmarks` scope.
| `HOME_TIMELINE_LENGTH` | `--home-timeline-length` | No | Provide to fetch remote replies to posts in the API-Key owner's home timeline. Determines how many posts we'll fetch replies for. Recommended value: `200`.
| `REPLY_INTERVAL_IN_HOURS` | `--reply-interval-in-hours` | No | Provide to fetch remote replies to posts that have received replies from users on your own instance. Determines how far back in time we'll go to find posts that have received replies. Recommend value: `0` (disabled). Requires an access token with `admin:read:accounts`.
| `MAX_BOOKMARKS` | `--max-bookmarks` | No | Provide to fetch remote replies to any posts you have bookmarked. Determines how many of your bookmarks you want to get replies to. Recommended value: `80`. Requires an access token with `read:bookmarks` scope.
| `MAX_FOLLOWINGS` | `--max-followings` | No | Provide to backfill profiles for your most recent followings. Determines how many of your last followings you want to backfill. Recommended value: `80`.
| `MAX_FOLLOWERS` | `--max-followers` | No | Provide to backfill profiles for your most recent followers. Determines how many of your last followers you want to backfill. Recommended value: `80`.
| `MAX_FOLLOW_REQUESTS` | `--max-follow-requests` | No | Provide to backfill profiles for the API key owner's most recent pending follow requests. Determines how many of your last follow requests you want to backfill. Recommended value: `80`.
| `FROM_NOTIFICATIONS` | `--from-notifications` | No | Provide to backfill profiles of anyone mentioned in your recent notifications. Determines how many hours of notifications you want to look at. Requires an access token with `read:notifications` scope. Recommended value: `1`, unless you run FediFetcher less than once per hour.
| `REMEMBER_USERS_FOR_HOURS` | `--remember-users-for-hours` | No | How long between back-filling attempts for non-followed accounts? Defaults to `168`, i.e. one week.
| `HTTP_TIMEOUT` | `--http-timeout` | No | The timeout for any HTTP requests to the Mastodon API in seconds. Defaults to `5`.
| -- | `--lock-hours` | No | Determines after how many hours a lock file should be discarded. Not relevant when running the script as GitHub Action, as concurrency is prevented using a different mechanism.
| -- | `--lock-hours` | No | Determines after how many hours a lock file should be discarded. Not relevant when running the script as GitHub Action, as concurrency is prevented using a different mechanism. Recommended value: `24`.
| `ON_START` | `--on-start` | No | Optionally provide a callback URL that will be pinged when processing is starting. A query parameter `rid={uuid}` will automatically be appended to uniquely identify each execution. This can be used to monitor your script using a service such as healthchecks.io.
| `ON_DONE` | `--on-done` | No | Optionally provide a callback URL that will be called when processing is finished. A query parameter `rid={uuid}` will automatically be appended to uniquely identify each execution. This can be used to monitor your script using a service such as healthchecks.io.
| `ON_FAIL` | `--on-fail` | No | Optionally provide a callback URL that will be called when processing has failed. A query parameter `rid={uuid}` will automatically be appended to uniquely identify each execution. This can be used to monitor your script using a service such as healthchecks.io.
@ -99,12 +115,15 @@ Please see below for a list of configuration options. Use the 'Environment Varia
- For all actions, your access token must include these scopes:
- `read:search`
- `read:statuses`
- `read:accounts`
- If you are supplying `REPLY_INTERVAL_IN_HOURS` / `--reply-interval-in-hours` you must additionally enable this scope:
- `admin:read:accounts`
- If you are supplying `MAX_FOLLOW_REQUESTS` / `--max-follow-requests` you must additionally enable this scope:
- `read:follows`
- If you are supplying `MAX_BOOKMARKS` / `--max-bookmarks` you must additionally enable this scope:
- `read:bookmarks`
- If you are supplying `FROM_NOTIFICATIONS` / `--from-notifications` you must additionally enable this scope:
- `read:notifications`
## Acknowledgments

View file

@ -37,12 +37,10 @@ spec:
- --access-token=TOKEN
- --home-timeline-length
- "200"
- --reply-interval-in-hours
- "24"
- --max-followings
- "80"
- --max-followers
- "80"
- --from-notification
- "4"
volumeMounts:
- name: artifacts
mountPath: /app/artifacts

View file

@ -23,6 +23,8 @@ argparser.add_argument('--max-followings', required = False, type=int, default=0
argparser.add_argument('--max-followers', required = False, type=int, default=0, help="Backfill posts for new accounts following --user. We'll backfill at most this many followers' posts")
argparser.add_argument('--max-follow-requests', required = False, type=int, default=0, help="Backfill posts of the API key owners pending follow requests. We'll backfill at most this many requester's posts")
argparser.add_argument('--max-bookmarks', required = False, type=int, default=0, help="Fetch remote replies to the API key owners Bookmarks. We'll fetch replies to at most this many bookmarks")
argparser.add_argument('--from-notifications', required = False, type=int, default=0, help="Backfill accounts of anyone appearing in your notifications, during the last hours")
argparser.add_argument('--remember-users-for-hours', required=False, type=int, default=24*7, help="How long to remember users that you aren't following for, before trying to backfill them again.")
argparser.add_argument('--http-timeout', required = False, type=int, default=5, help="The timeout for any HTTP requests to your own, or other instances.")
argparser.add_argument('--lock-hours', required = False, type=int, default=24, help="The lock timeout in hours.")
argparser.add_argument('--on-done', required = False, default=None, help="Provide a url that will be pinged when processing has completed. You can use this for 'dead man switch' monitoring of your task")
@ -41,11 +43,15 @@ def pull_context(
known_followings,
max_followers,
max_follow_requests,
max_bookmarks
max_bookmarks,
recently_checked_users,
from_notifications
):
parsed_urls = {}
all_known_users = OrderedSet(list(known_followings) + list(recently_checked_users))
if reply_interval_hours > 0:
"""pull the context toots of toots user replied to, from their
original server, and add them to the local server."""
@ -68,22 +74,27 @@ def pull_context(
known_context_urls = get_all_known_context_urls(server, timeline_toots,parsed_urls)
add_context_urls(server, access_token, known_context_urls, seen_urls)
if max_followings > 0 and backfill_followings_for_user != '':
log(f"Getting posts from {backfill_followings_for_user}'s last {max_followings} followings")
user_id = get_user_id(server, backfill_followings_for_user)
followings = get_new_followings(server, user_id, max_followings, known_followings)
add_following_posts(server, access_token, followings, known_followings, seen_urls)
if max_followings > 0:
log(f"Getting posts from last {max_followings} followings")
user_id = get_user_id(server, backfill_followings_for_user, access_token)
followings = get_new_followings(server, user_id, max_followings, all_known_users)
add_user_posts(server, access_token, followings, known_followings, all_known_users, seen_urls)
if max_followers > 0 and backfill_followings_for_user != '':
log(f"Getting posts from {backfill_followings_for_user}'s last {max_followers} followers")
user_id = get_user_id(server, backfill_followings_for_user)
followers = get_new_followers(server, user_id, max_followers, known_followings)
add_following_posts(server, access_token, followers, known_followings, seen_urls)
if max_followers > 0:
log(f"Getting posts from last {max_followers} followers")
user_id = get_user_id(server, backfill_followings_for_user, access_token)
followers = get_new_followers(server, user_id, max_followers, all_known_users)
add_user_posts(server, access_token, followers, recently_checked_users, all_known_users, seen_urls)
if max_follow_requests > 0:
log(f"Getting posts from last {max_follow_requests} follow requests")
follow_requests = get_new_follow_requests(server, access_token, max_follow_requests, known_followings)
add_following_posts(server, access_token, follow_requests, known_followings, seen_urls)
follow_requests = get_new_follow_requests(server, access_token, max_follow_requests, all_known_users)
add_user_posts(server, access_token, follow_requests, recently_checked_users, all_known_users, seen_urls)
if from_notifications > 0:
log(f"Getting notifications for last {from_notifications} hours")
notification_users = get_notification_users(server, access_token, all_known_users, from_notifications)
add_user_posts(server, access_token, notification_users, recently_checked_users, all_known_users, seen_urls)
if max_bookmarks > 0:
log(f"Pulling replies to the last {max_bookmarks} bookmarks")
@ -91,12 +102,29 @@ def pull_context(
known_context_urls = get_all_known_context_urls(server, bookmarks,parsed_urls)
add_context_urls(server, access_token, known_context_urls, seen_urls)
def get_notification_users(server, access_token, known_users, max_age):
since = datetime.now(datetime.now().astimezone().tzinfo) - timedelta(hours=max_age)
notifications = get_paginated_mastodon(f"https://{server}/api/v1/notifications", since, headers={
"Authorization": f"Bearer {access_token}",
})
notification_users = []
for notification in notifications:
notificationDate = parser.parse(notification['created_at'])
if(notificationDate >= since and notification['account'] not in notification_users):
notification_users.append(notification['account'])
new_notification_users = filter_known_users(notification_users, known_users)
log(f"Found {len(notification_users)} users in notifications, {len(new_notification_users)} of which are new")
return new_notification_users
def get_bookmarks(server, access_token, max):
return get_paginated_mastodon(f"https://{server}/api/v1/bookmarks", max, {
"Authorization": f"Bearer {access_token}",
})
def add_following_posts(server, access_token, followings, know_followings, seen_urls):
def add_user_posts(server, access_token, followings, know_followings, all_known_users, seen_urls):
for user in followings:
posts = get_user_posts(user, know_followings, server)
@ -114,6 +142,7 @@ def add_following_posts(server, access_token, followings, know_followings, seen_
log(f"Added {count} posts for user {user['acct']} with {failed} errors")
if failed == 0:
know_followings.add(user['acct'])
all_known_users.add(user['acct'])
def get_user_posts(user, know_followings, server):
parsed_url = parse_user_url(user['url'])
@ -160,24 +189,24 @@ def get_new_follow_requests(server, access_token, max, known_followings):
})
# Remove any we already know about
new_follow_requests = list(filter(
lambda user: user['acct'] not in known_followings,
follow_requests
))
new_follow_requests = filter_known_users(follow_requests, known_followings)
log(f"Got {len(follow_requests)} follow_requests, {len(new_follow_requests)} of which are new")
return new_follow_requests
def filter_known_users(users, known_users):
return list(filter(
lambda user: user['acct'] not in known_users,
users
))
def get_new_followers(server, user_id, max, known_followers):
"""Get any new followings for the specified user, up to the max number provided"""
followers = get_paginated_mastodon(f"https://{server}/api/v1/accounts/{user_id}/followers", max)
# Remove any we already know about
new_followers = list(filter(
lambda user: user['acct'] not in known_followers,
followers
))
new_followers = filter_known_users(followers, known_followers)
log(f"Got {len(followers)} followers, {len(new_followers)} of which are new")
@ -188,22 +217,29 @@ def get_new_followings(server, user_id, max, known_followings):
following = get_paginated_mastodon(f"https://{server}/api/v1/accounts/{user_id}/following", max)
# Remove any we already know about
new_followings = list(filter(
lambda user: user['acct'] not in known_followings,
following
))
new_followings = filter_known_users(following, known_followings)
log(f"Got {len(following)} followings, {len(new_followings)} of which are new")
return new_followings
def get_user_id(server, user):
def get_user_id(server, user = None, access_token = None):
"""Get the user id from the server, using a username"""
headers = {}
if user != None and user != '':
url = f"https://{server}/api/v1/accounts/lookup?acct={user}"
elif access_token != None:
url = f"https://{server}/api/v1/accounts/verify_credentials"
headers = {
"Authorization": f"Bearer {access_token}",
}
else:
raise Exception('You must supply either a user name or an access token, to get an user ID')
response = get(url)
response = get(url, headers=headers)
if response.status_code == 200:
return response.json()['id']
@ -620,7 +656,12 @@ def add_context_url(url, server, access_token):
def get_paginated_mastodon(url, max, headers = {}, timeout = 0, max_tries = 5):
"""Make a paginated request to mastodon"""
response = get(f"{url}?limit={max}", headers, timeout, max_tries)
if(isinstance(max, int)):
furl = f"{url}?limit={max}"
else:
furl = url
response = get(furl, headers, timeout, max_tries)
if response.status_code != 200:
if response.status_code == 401:
@ -640,9 +681,14 @@ def get_paginated_mastodon(url, max, headers = {}, timeout = 0, max_tries = 5):
result = response.json()
if(isinstance(max, int)):
while len(result) < max and 'next' in response.links:
response = get(response.links['next']['url'], headers, timeout, max_tries)
result = result + response.json()
else:
while parser.parse(result[-1]['created_at']) >= max and 'next' in response.links:
response = get(response.links['next']['url'], headers, timeout, max_tries)
result = result + response.json()
return result
@ -677,12 +723,28 @@ class OrderedSet:
def __init__(self, iterable):
self._dict = {}
if isinstance(iterable, dict):
for item in iterable:
if isinstance(iterable[item], str):
self.add(item, parser.parse(iterable[item]))
else:
self.add(item, iterable[item])
else:
for item in iterable:
self.add(item)
def add(self, item):
def add(self, item, time = None):
if item not in self._dict:
self._dict[item] = None
if(time == None):
self._dict[item] = datetime.now(datetime.now().astimezone().tzinfo)
else:
self._dict[item] = time
def pop(self, item):
self._dict.pop(item)
def get(self, item):
return self._dict[item]
def update(self, iterable):
for item in iterable:
@ -697,6 +759,9 @@ class OrderedSet:
def __len__(self):
return len(self._dict)
def toJSON(self):
return json.dump(self._dict, f, default=str)
if __name__ == "__main__":
start = datetime.now()
@ -751,6 +816,7 @@ if __name__ == "__main__":
SEEN_URLS_FILE = "artifacts/seen_urls"
REPLIED_TOOT_SERVER_IDS_FILE = "artifacts/replied_toot_server_ids"
KNOWN_FOLLOWINGS_FILE = "artifacts/known_followings"
RECENTLY_CHECKED_USERS_FILE = "artifacts/recently_checked_users"
SEEN_URLS = OrderedSet([])
@ -768,6 +834,18 @@ if __name__ == "__main__":
with open(KNOWN_FOLLOWINGS_FILE, "r", encoding="utf-8") as f:
KNOWN_FOLLOWINGS = OrderedSet(f.read().splitlines())
RECENTLY_CHECKED_USERS = OrderedSet({})
if os.path.exists(RECENTLY_CHECKED_USERS_FILE):
with open(RECENTLY_CHECKED_USERS_FILE, "r", encoding="utf-8") as f:
RECENTLY_CHECKED_USERS = OrderedSet(json.load(f))
# Remove any users whose last check is too long in the past from the list
for user in list(RECENTLY_CHECKED_USERS):
lastCheck = RECENTLY_CHECKED_USERS.get(user)
userAge = datetime.now(lastCheck.tzinfo) - lastCheck
if(userAge.total_seconds() > arguments.remember_users_for_hours * 60 * 60):
RECENTLY_CHECKED_USERS.pop(user)
pull_context(
arguments.server,
arguments.access_token,
@ -780,7 +858,9 @@ if __name__ == "__main__":
KNOWN_FOLLOWINGS,
arguments.max_followers,
arguments.max_follow_requests,
arguments.max_bookmarks
arguments.max_bookmarks,
RECENTLY_CHECKED_USERS,
arguments.from_notifications,
)
with open(KNOWN_FOLLOWINGS_FILE, "w", encoding="utf-8") as f:
@ -792,6 +872,9 @@ if __name__ == "__main__":
with open(REPLIED_TOOT_SERVER_IDS_FILE, "w", encoding="utf-8") as f:
json.dump(dict(list(REPLIED_TOOT_SERVER_IDS.items())[-10000:]), f)
with open(RECENTLY_CHECKED_USERS_FILE, "w", encoding="utf-8") as f:
RECENTLY_CHECKED_USERS.toJSON()
os.remove(LOCK_FILE)
if(arguments.on_done != None and arguments.on_done != ''):