Rework host diffing
Now it'll detect wether there are changes to a hosts closure at all, lists
build failures as such and is able to handle newly added or removed hosts.
https://github.com/chaos-jetzt/chaos-jetzt-nixfiles/actions/runs/5770703946
shows the intended behavior when hosts are added, removed, builds fail or
changes are made.
Various smaller changes or cleanups that, to me, wouldn't warrant a PR on their
own. Besides addressing some TODOs (namely the one in the flake.nix), goals
included a reduction of redundant and ambiguous code / statements (e.g. the
isDev detection) and a reduction of (visual) complexity making the code easier
to follow, understand and review.
Some variables that were intendet to be used were in fact not used (e.g.
allTargets) but that will be needed as soon as we have a second non-dev
host in our nixfiles.
Also updated the triggers, only building on pushes to main since the rest will eventually be a PR to main, so that the we can ditch the avoid duplicates action.
With all https://tickets.chaos.jetzt/shortcode links will redirect to
the appropriate ticket-shop without a need for us to place manual
redirect links.
Also updated the triggers, only building on pushes to main since the
rest will eventually be a PR to main, so that the we can ditch the avoid
duplicates action.
With https://github.com/chaos-jetzt/website_pelican/pull/33, a lot of
orphans are to be expected which will take up space on our servers. This
introduces a timer which runs once a week and will delete any
website generations older than 28 days.
The actual 404 will be generated from pelican. log_not_found was set for
privacy reasons (since we don't have a favicon, every request still gets
logged with it's full IP due to the 404)
To reduce the amount of redudand rebuilds cachix is used to store
outputs. The cachix cache should be accessible in the cachix UI to
everyone in the @chaos-jetzt/infra team
I didn't notice this was missing in #5 until after deploying it. Since
the ports on the monitoring-network-interface (ens10) were not open,
scraping would fail and thus generate alerts.
The goal is to create a monitoring setup where each server monitors
itself when it comes failing systemd services, disk or RAM filling up,
…. In addition each prometheus will monitor remote prometheus and
alertmanager instances for signs of failure (e.g. being unreachable,
errors in notification delivery, dropping alerts).
A lot of metrics (especially histograms from prometheus or alertmanager)
are being dropped before ingestion to disk save on space and memory.
Depending on how many servers we may or may not have in the future this
could probably use some kind of overhaul since we rightnow have n^2
monitoring peer relationships (not even speaking of possible duplicated
alerts).