|
Some checks failed
ci/crow/push/install-deps Pipeline was successful
ci/crow/push/check-PR/9 Pipeline was successful
ci/crow/push/check-PR/11 Pipeline was successful
ci/crow/push/check-PR/7 Pipeline was successful
ci/crow/push/main/2 Pipeline was successful
ci/crow/push/main/11 Pipeline was successful
ci/crow/push/main/8 Pipeline was successful
ci/crow/push/check-PR/10 Pipeline was successful
ci/crow/push/check-PR/4 Pipeline was successful
ci/crow/push/main/4 Pipeline failed
ci/crow/push/check-PR/8 Pipeline was successful
ci/crow/push/check-PR/14 Pipeline was successful
ci/crow/push/check-PR/2 Pipeline was successful
ci/crow/push/main/7 Pipeline failed
ci/crow/push/main/9 Pipeline was successful
ci/crow/push/main/1 Pipeline was successful
ci/crow/push/main/10 Pipeline was successful
ci/crow/push/check-PR/5 Pipeline was successful
ci/crow/push/main/3 Pipeline was successful
ci/crow/push/main/13 Pipeline was successful
ci/crow/push/main/5 Pipeline was successful
ci/crow/push/check-PR/1 Pipeline was successful
ci/crow/push/main/14 Pipeline was successful
ci/crow/push/check-PR/12 Pipeline was successful
ci/crow/push/main/12 Pipeline was successful
ci/crow/push/check-PR/3 Pipeline was successful
ci/crow/push/check-PR/13 Pipeline was successful
ci/crow/push/check-PR/6 Pipeline was successful
ci/crow/push/main/6 Pipeline was successful
|
||
|---|---|---|
| .crow | ||
| assets | ||
| diagrams | ||
| docs | ||
| environments | ||
| playbooks | ||
| scripts | ||
| ssh-keys | ||
| templates | ||
| .ansible-lint | ||
| .ecrc | ||
| .editorconfig | ||
| .gitignore | ||
| .gitleaksignore | ||
| .markdownlint.yaml | ||
| .pre-commit-config.yaml | ||
| .prettierrc.json | ||
| .terraform.lock.hcl | ||
| .yamllint.yaml | ||
| ansible.cfg | ||
| CODEOWNERS | ||
| diagram.drawio | ||
| Justfile | ||
| README.md | ||
| renovate.json | ||
| requirements.yml | ||
Table of Contents
- OpenTofu
- Ansible
- Architecture overview
- Hardware
- Backups
- Maintenance Protocol
- Git
- DB
- Troubleshooting
- Ceph
- User Management
- Secret Store
- CI/CD
- HAProxy
- General
- Debugging commands
- Installation
- fail2ban
- Architecture
- SSH Protection (4-Layer System)
- HAProxy Protection (3-Layer System)
- Useful Commands
- Firewall Integration
- Docker Compatibility
- Maintenance
- Networking
- Monitoring
- Discourse
OpenTofu
OpenTofu is used for everything related to infrastructure provisioning: servers, network, storage, DNS, etc.
Environments live in environments/.
Each has its own state file stored in S3.
Ansible
- Environment-specific inventories are stored in
environments. - Playbooks are stored in
playbooks/<env>. - Roles are stored in
roles/<env>. - Collections are stored in
collections/<env>.
Playbooks can be executed locally via the Justfiles rules:
pb <playbook name> <env>: Executes a playbook locally.pb-dry <playbook name> <env>: Executes a playbook in "check mode"
Architecture overview
- Postgres DB (HA) via autobase
- "Storage" (images, LFS, packages) outsourced to S3 (Hetzner)
- Avatars stay on disk (as loading from CDN takes too long)
- Backups (repos, avatars, ssh keys, etc.) backed up to S3 (Hetzner & Backblaze) via
restic - Hetzner Servers are used
Hardware
Hardware table
| Name | Env | CPU | Mem | Disk | OS | GB 6 SC | GB 6 MC | Used for | Costs/m (€) |
|---|---|---|---|---|---|---|---|---|---|
| minerva | prod | Intel XEON E-2176G |
64 GB DDR4 ECC |
2x 960 GB SAMSUNG MZQLB960HAJR-00007 NVME |
Alma10 | 1749 | 7352 | Git, DB | 36.7 |
| hades | prod | Intel XEON E-2276G |
64 GB DDR4 ECC |
2x 960 GB SAMSUNG MZQLB960HAJR-00007 NVME |
Alma10 | 1749 | 7352 | Git, DB | 37.7 |
| demeter | prod | Intel XEON E-2176G |
64 GB DDR4 ECC |
2x 960 GB SAMSUNG MZQLB960HAJR-00007 NVME |
Alma10 | 1749 | 7352 | Git, DB | 40.7 |
| artemis | prod | AMD Ryzen 7 PRO 8700GE |
64 GB DDR5 ECC |
2x 500 GB SAMSUNG MZVL2512HCJQ-00B0 NVME |
Alma9 | 2676 | 11864 | CI/CD | 47.3 |
| gaia | prod | Apple M4 | 32 GB DDR5 |
1x 500 GB APPLE SSD AP0512Z |
macOS 26 | 3781 | 14858 | CI/CD | - |
| --- | --- | --- | --- | --- | --- | --- | --- | --- | ---- |
| misc | prod | ARMv8 | 8 GB DDR5 |
80 GB NVME SSD | Alma9 | 1079 | 3490 | CI/CD, status, Forum | 6.49 |
| cf-dev | dev | ARMv8 | 4 GB DDR5 |
40 GB NVME SSD | Alma9 | 1035 | 1869 | 3.79 | |
| misc-dev | dev | ARMv8 | 4 GB DDR5 |
40 GB NVME SSD | Alma9 | 1035 | 1869 | CI/CD, status, Forum | 3.79 |
Written bytes on disks when acquired:
- hades:
- NVME0: 159 TB (read), 1.3 PB (write)
- NVME1: 63 TB (read), 1.3 PB (write)
- minerva:
- NVME0: 97 TB (read), 55 TB (write)
- NVME1: 85 TB (read), 62 TB (write)
- demeter:
- NVME0: 95 TB (read), 51 TB (write)
- NVME0: 139 TB (read), 47 TB (write)
Disk benchmark
Disk benchmark
Sequential Operations (1M block size)
| Server | Read BW (MB/s) | Read Lat Avg (ms) | Read Lat 99th (ms) | Write BW (MB/s) | Write Lat Avg (ms) | Write Lat 99th (ms) |
|---|---|---|---|---|---|---|
| gaia | 5125.30 | 0.20 | 0.49 | 2721.74 | 0.37 | 0.89 |
| artemis | 3488.88 | 0.29 | 0.37 | 1036.53 | 0.96 | 2.09 |
| demeter | 1943.37 | 0.51 | 0.73 | 1282.39 | 0.78 | 1.34 |
| pgnode2-dev | 1537.45 | 0.65 | 1.20 | 1101.16 | 0.90 | 1.35 |
Random Operations (4k block size)
| Server | Read IOPS | Read Lat Avg (ms) | Read Lat 99th (ms) | Write IOPS | Write Lat Avg (ms) | Write Lat 99th (ms) |
|---|---|---|---|---|---|---|
| demeter | 12130 | 0.08 | 0.31 | 74743 | 0.01 | 0.02 |
| gaia | 14448 | 0.07 | 0.08 | 7740 | 0.13 | 1.50 |
| artemis | 12765 | 0.08 | 0.09 | 43692 | 0.02 | 0.02 |
| pgnode2-dev | 6252 | 0.16 | 0.33 | 8257 | 0.12 | 0.33 |
#!/bin/bash
# Enhanced FIO Disk Benchmark Script with Markdown Output to .txt
TEST_FILE="/tmp/fio_test_file" # Change path if needed
TEST_SIZE="2G" # Size of test file
RUNTIME="30" # Duration of each test in seconds
BLOCK_SIZE_SEQ="1M" # Block size for sequential tests
BLOCK_SIZE_RAND="4k" # Block size for random tests
NUMJOBS="1" # Number of parallel jobs
RESULT_FILE="fio_results.txt" # Markdown log file
# Ensure jq is installed
if ! command -v jq &> /dev/null; then
echo "Error: jq is not installed. Install it with: sudo apt install jq"
exit 1
fi
# Store results for Markdown table
declare -a TABLE_ROWS
run_test() {
local name=$1
local rw=$2
local bs=$3
local mode=$4 # read or write for JSON parsing
fio --name="$name" --rw="$rw" --bs="$bs" --size="$TEST_SIZE" \
--numjobs="$NUMJOBS" --time_based --runtime="$RUNTIME" \
--group_reporting --filename="$TEST_FILE" --direct=1 \
--output-format=json > "${name}.json"
# Extract metrics from JSON and round to 2 decimals
BW=$(jq -r ".jobs[0].$mode.bw_bytes/1048576" "${name}.json" | awk '{printf "%.2f", $1}')
IOPS=$(jq -r ".jobs[0].$mode.iops" "${name}.json" | awk '{printf "%.2f", $1}')
LAT_AVG=$(jq -r ".jobs[0].$mode.lat_ns.mean/1000000" "${name}.json" | awk '{printf "%.2f", $1}')
LAT_99=$(jq -r ".jobs[0].$mode.clat_ns.percentile[\"99.000000\"]/1000000" "${name}.json" | awk '{printf "%.2f", $1}')
# Save row for Markdown table
TABLE_ROWS+=("| $name | $BW | $IOPS | $LAT_AVG | $LAT_99 |")
}
echo "=== Running FIO Disk Benchmark ==="
run_test "SeqRead" "read" "$BLOCK_SIZE_SEQ" "read"
run_test "SeqWrite" "write" "$BLOCK_SIZE_SEQ" "write"
run_test "RandRead" "randread" "$BLOCK_SIZE_RAND" "read"
run_test "RandWrite" "randwrite" "$BLOCK_SIZE_RAND" "write"
# Cleanup
rm -f "$TEST_FILE"
# Prepare Markdown table
{
echo "### FIO Benchmark Results ($(date +'%Y-%m-%d'))"
echo "| Test Type | BW (MB/s) | IOPS | Avg Lat (ms) | 99th Lat (ms) |"
echo "|-----------|-----------|------|--------------|---------------|"
for row in "${TABLE_ROWS[@]}"; do
echo "$row"
done
echo
} | tee -a "$RESULT_FILE"
echo "Benchmark complete. Results appended to $RESULT_FILE"
Postgres Benchmark
Postgres benchmark
sudo -u postgres createdb pgbench_test
sudo -u postgres pgbench -i -s 10 pgbench_test
sudo -u postgres pgbench -c 30 -j 4 -T 120 pgbench_test # write
sudo -u postgres pgbench -S -c 30 -j 4 -T 120 pgbench_test # read
| Node | Write TPS | Read TPS | Read Latency (ms) | Write Latency (ms) |
|---|---|---|---|---|
| demeter | 19971 | 160344 | 0.187 | 0.256 |
Backups
Static assets are backed up via restic.
If possible, backups are stored in the nbg1 region (and assets in the fsn1 region).
- Each backup task has its own CRON
systemdtimer - Scripts are stored in
/opt/restic/ - "packages" backups are separate as the source lives in S3 already. A mirror of the bucket is synced every hour via
rclone.
Restore
Restore instructions
Restic backups are configured per-host in host_vars/. To restore from a backup:
-
Set up env
export RESTIC_REPOSITORY="s3:https://nbg1.your-objectstorage.com/<repo>" export RESTIC_PASSWORD="<restic-password>" export AWS_ACCESS_KEY_ID="<s3-access-key>" export AWS_SECRET_ACCESS_KEY="<s3-secret-key>" export AWS_DEFAULT_REGION="nbg1" -
List available snapshots
restic snapshots -
Restore a specific snapshot
# Restore to original location restic restore <snapshot-id> --target / # Or restore to a different location restic restore <snapshot-id> --target /tmp/restore # Restore only specific files/paths restic restore <snapshot-id> --target /tmp/restore --path <path>
Useful restic commands
# Check repository integrity
restic check
# Show differences between snapshots
restic diff <snapshot1-id> <snapshot2-id>
Maintenance Protocol
(Copy to issue and apply step by step)
- 1. Run
just pb restic prodwithrestic_backup_now: trueto create fresh backups of all important data - 2. Enable maintenance mode:
- 2.1. Haproxy: run
haproxy-maintenance enable git 30(adjust time as needed) - 2.2. Gatus: Tweak the
maintenance-windowssetting in the host_vars config file and enabled it by applyingjust pb gatus prod
- 2.1. Haproxy: run
- 3. Reboot nodes (
codefloe,misc, PG replicas, PG master) - 4. Additional tasks
- 5. When done earlier: remove maintenance mode:
haproxy-maintenance disable git. Both Gatus and HAProxy maintenance expire automatically but users can't interact with the service until maintenance mode is up.
Git
Custom icons
Something small enough to escape casual notice.- repo
carbon:repo-source-code - org
carbon:building - comment
mdi-light:comment - PR
carbon:pull-request - tag
carbon:tag - settings
carbon: settings - merge:
carbon:pull-request - mirror
octicon:mirror - bell
carbon:notification - plus
mdi:plus - trash
ph:trash-light - lock
carbon:lock - unlock
carbon:unlock - pin:
mdi-light:pin - pin-slash:
mdi-light:pin-off - mute:
mdi-light:volume-mute - unmute:
mdi-light:volume-high - key:
material-symbols-light:key-outline - copy:
octicon:copy-24 - git-merge:
carbon:merge - smiley:
ph:smiley-wink-light - repo-forked:
carbon:fork - star:
carbon:star - eye:
lineicons:eye - pulse:
ph:pulse-light - question:
material-symbols-light:help-outline - tools:
carbon: settings - issue-opened:
octicon:issue-opened-24 - issue-closed:
octicon:issue-closed-24 - code:
material-symbols-light:code - database:
material-symbols-light:database-outline - git-branch:
carbon:branch - history:
material-symbols-light:history - milestone:
octicon:milestone-24 - search:
material-symbols-light:search - sign-out:
carbon:logout - book:
carbon:book - pencil:
material-symbols-light:ink-pen-outline - light-bulb:
octicon:light-bulb-24 - info:
carbon:information - report:
carbon:warning - person:
carbon:user - server:
circum:server - project-symlink:
si:projects-line - package:
material-symbols-light:package-2-outline - calendar:
mdi-light:calendar - people:
carbon:group - container:
octicon:container-24 - download:
material-symbols-light:download - cpu:
carbon:chip - rss:
mdi-light:rss - terminal:
material-symbols-light:terminal - globe:
material-symbols-light:globe - filter:
material-symbols-light:filter-list - repo-push:
octicon:repo-push-24 - file-zip:
material-symbols-light:folder-zip-outline - clock:
mdi-light:clock - apps:
material-symbols-light:apps - note:
material-symbols-light:notes-rounded
Notes:
- Icons must be changed in the source code directly as they are included during the UI build
- Icon sizes can only be changed by adjustinz the
sizeparam in the svg component, e.g.<svg-icon name="octicon-repo" :size="20" class="tw-ml-1 tw-mt-0.5"/> - The respective icon-name class must be added to the SVGs, e.g.
class="svg octicon octicon-comment-discussion"
Changelog panel in home view
- Custom
dashboard_tmpl.html - Custom
discourse-embed.js, server as a local asset. The default one from Discourse is super minimal and inflexible. - Custom CSS (
discourse-topics.css) which aligns the embedded content to the Forgejo UI style. - Discourse docs on this topic
CodeFloe-specific mods
| Date | PR | Purpose | Merged into Forgejo | Merge Date | FJ refs |
|---|---|---|---|---|---|
| 2025-07-24 | #2 | Improved commit-history view on mobile | ❌️ | Issue | |
| 2025-07-24 | #8 | Version helper for semver version in footer form release branches | ❓️ | ||
| 2025-08-11 | #11 | Support for file icon sets | ❓️ |
DB
Postgres HA, self-managed on Hetzner (Cloud) VMs. While the Hetzner NVME disks on cloud VMs are not the most performant ones in the market, they provide a good balance between performance and cost. Scale up is easily possible up to 32 GB Memory. Due to the use of (HAProxy) load balancing + connection pooling, DB performance shouldn't be a performance issue for quite some time.
Backups are performed every day (diff) and weekly (full) through pgbackrest.
Setup
- HAProxy load balancer as single point of entry. Forwards to connection poolers (
pgbouncer) for primary and read replicas. - All read queries are load balanced across the read-replicas. Primary is used for writes. Forgejo will have support for splitting read/write queries starting in v12.
- PGBouncer "transaction" mode (which would be a bit faster) does not work with Forgejo/Gitea. Forced to use "session" mode instead.
Backup
Backup details
- Via CRON and
pgbackrest(see pgbackrest for details) to S3 -/etc/cron.d/pgbackrest-codefloe - Full backup once a week (
00 3 * * 0) - Diff backup daily (
00 3 * * 1-6)
su postgres
pgbackrest info
ansible-playbook deploy_pgcluster.yml -t point_in_time_recovery -e "disable_archive_command=false"
Manual:
su postgres
# restore
pgbackrest --stanza=codefloe --set=20250413-030002F restore --delta
# start PG
/usr/pgsql-17/bin/pg_ctl start \
-D /var/lib/pgsql/17/data \
-w -t 3600 \
-o "--config-file=/var/lib/pgsql/17/data/postgresql.conf" \
-o "-c restore_command='pgbackrest --stanza=codefloe archive-get %f %p'" \
-o "-c archive_command=/bin/true"
Major upgrade
- Check for compatibility with new version:
ansible-playbook -D -e "pg_old_version=16 pg_new_version=17" --tags 'pre-checks,upgrade-check' -i inventory -D pg_upgrade.yml - Perform upgrade:
ansible-playbook -e "pg_old_version=16 pg_new_version=17" -i inventory -D pg_upgrade.yml - Update variable
postgresql_versionto new version
Connecting
export PGHOST=10.10.5.2
export PGPORT=5000 # primary
export PGPORT=5001 # replicas
export PGUSER=postgres
export PGDATABASE=postgres
export PGPASSWORD=
Troubleshooting
etcd
If etcd is unhealthy, e.g. due to inconsistent certificates, the easiest is to wipe the cluster and start fresh:
-
rm -rf /var/lib/etcd/*on each node -
Run the
etcd_clusterautobase playbook via- name: Run Autobase etcd_cluster ansible.builtin.import_playbook: vitabaks.autobase.etcd_clusterThis will recreate all certs and restore the
etcdcluster. It will NOT wipe anypatronidata. Patroni will continue to work as before.
patroni
If patroni becomes unhealthy, it might also because the certs referenced in /etc/patroni/patroni.yml are not matching all hostnames and nodes.
To regenerate them, run the config_pgcluster autobase playbook with -e tls_cert_regenerate=true.
Ceph
WIP
User Management
Two-fold: UNIX user on the hosts and users for accessing hosted services (secret store, monitoring, cloud).
Access to hosts is declared transparently in the active_users dictionary in group_vars/all.yml through the hosts variable, including the month when access was granted.
Secret Store
OpenBao (WIP)
CI/CD
- One
amd64andarm64runner is provided globally - Users can add their own runners (for both Crow or Actions)
Forgejo Runner Setup
The infrastructure runs Forgejo Actions runners for CI/CD workflows, supporting both amd64 and arm64 architectures.
Architecture
Runner Deployment:
- Deployed via Ansible role
devxy.cicd.forgejo_runner - Playbook:
playbooks/playbook-forgejo-runner.yaml - Runs on hosts in the
ci_agentinventory group - Runner version managed via Renovate (currently v11.3.1)
Container Runtime:
- Uses Docker-in-Docker (DinD) for job isolation
- Supports both IPv4 and IPv6 networking
- Custom network subnets for container isolation
Cache Architecture:
- Distributed cache system for Actions artifacts and dependencies
- Cache host:
artemis(10.10.5.5) runs the cache server on port 4001 - Cache proxy: Each runner node runs a local proxy on port 4000
- Runners access cache via Docker bridge gateway:
http://{{ ansible_docker0.ipv4.address }}:4000 - Cache directory:
/opt/data/forgejo-actions-cache(onartemis) - Shared cache secret for authentication across all runners
Image Registry:
All container images are mirrored from data.forgejo.org/oci/ for reliability and reduced external dependencies.
Network Configuration
IPv4/IPv6 Dual Stack:
- DinD network:
172.80.0.0/16(IPv4),fd00:d0ca:2:1::/80(IPv6) - Internal network:
fd00:d0ca:2:2::/80(IPv6) - Host networks: Runners can access Forgejo instances on internal IPs
Docker Configuration:
- Docker bridge gateway dynamically resolved via
ansible_docker0.ipv4.address - Default bridge:
172.17.0.1(typically, but queried dynamically) - Custom address pools prevent IP conflicts across multiple runners
HAProxy
General
- IPv4 and IPv6
- HTTP3 via quic-enabled WolfSSL
Debugging commands
# View stick table entries
echo "show table per_ip_rates" | socat stdio /var/lib/haproxy/stats
# Watch specific IP
echo "show table per_ip_rates data.http_req_rate" | socat stdio /var/lib/haproxy/stats | grep <IP>
# Clear an IP from rate limit table
echo "clear table per_ip_rates key <IP>" | socat stdio /var/lib/haproxy/stats
Installation
Alma9 ships version 2.4 (2023), hence installing from source.
This is needed anyhow to provide http3 support using openssl/quictls.
zenetys/rpm-haproxy provides an approach to easily build HAProxy from source, though it is lacking some aarch64 libs.
To bundle HAProxy with a custom SSL lib, it needs to be built from source anyhow.
fail2ban
fail2ban provides multi-layered intrusion prevention for both SSH and HAProxy, using a firewall backend that adapts to the host's environment. Bans persist across reboots via an SQLite database.
Architecture
- Firewall Backend:
firewalldwith rich rules on non-Kubernetes nodes (modern RHEL 9 approach using nftables)iptableson Kubernetes nodes (k3s) to avoid conflicts with kube-router network policies- Automatic detection via
which kubectlcommand
- Log Backend: systemd journal for real-time log monitoring
- Ban Persistence: SQLite database (
/var/lib/fail2ban/fail2ban.sqlite3) with 24-hour purge age - Ports Protected: SSH on ports 22 and 2222, HTTP/HTTPS on ports 80 and 443
SSH Protection (4-Layer System)
SSH protection uses 4 complementary jails targeting different attack patterns:
| Jail | Purpose | Max Retry | Find Time | Ban Time | Target |
|---|---|---|---|---|---|
sshd |
Basic brute-force | 5 failures | 10 min | 1 hour | Standard login attempts |
sshd-aggressive |
Scanner detection | 10 failures | 5 min | 24 hours | Persistent scanners |
sshd-ddos |
DoS/flooding | 20 failures | 1 min | 2 hours | High-frequency attacks |
sshd-long-term |
Slow attacks | 15 failures | 1 hour | 24 hours | Patient attackers |
Configuration: /etc/fail2ban/jail.d/sshd.conf
Design rationale:
- Multiple jails with different time windows catch attackers using various strategies
- Short windows (1-10 min) catch brute-force attempts
- Long window (1 hour) catches slow, persistent attacks that spread attempts over time
- Graduated ban times: temporary bans for quick attempts, 24-hour bans for persistent threats
HAProxy Protection (3-Layer System)
HAProxy protection monitors bad requests, DoS attempts, and scanning behavior:
| Jail | Purpose | Max Retry | Find Time | Ban Time | Target |
|---|---|---|---|---|---|
haproxy-badreq |
Bad requests | 10 requests | 10 min | 1 hour | Malformed HTTP |
haproxy-ddos |
DoS flooding | 100 requests | 1 min | 2 hours | High-volume attacks |
haproxy-scanner |
Scanner detection | 10 requests | 5 min | 24 hours | Vulnerability scanning |
Configuration: /etc/fail2ban/jail.d/haproxy.conf
All HAProxy jails use the same filter (haproxy-badreq) at /etc/fail2ban/filter.d/haproxy-badreq.conf, which detects <BADREQ> entries in HAProxy logs.
Useful Commands
Useful commands
# View all active jails
fail2ban-client status
# View specific jail status and banned IPs
fail2ban-client status sshd
fail2ban-client status haproxy-badreq
# Unban a specific IP
fail2ban-client unban <IP>
# Check ban database
sqlite3 /var/lib/fail2ban/fail2ban.sqlite3 "SELECT * FROM bans;"
# View recent bans
journalctl -u fail2ban -n 100 | grep Ban
# Reload configuration (preserves bans)
fail2ban-client reload
Firewall Integration
On non-Kubernetes nodes (using firewalld):
# View fail2ban rich rules
firewall-cmd --list-rich-rules
# View all firewalld zones
firewall-cmd --list-all-zones
# Check if an IP is blocked
firewall-cmd --query-rich-rule='rule family="ipv4" source address="<IP>" reject'
On Kubernetes nodes (using iptables):
# View fail2ban chains
iptables -L -n --line-numbers
# View specific fail2ban chain
iptables -L f2b-sshd -n -v
Docker Compatibility
The fail2ban configuration is designed to work seamlessly with Docker:
firewallduses adocker-forwardingpolicy that allows Docker container traffic- Docker manages its own
docker0interface and bridge network - No manual zone assignment for Docker interfaces (avoids ZONE_CONFLICT errors)
- On k3s nodes,
firewalldis completely disabled to prevent conflicts with kube-router
Maintenance
To announce a planned maintenance, use forgejo-notification:
export FORGEJO_HOST="cf"
export FORGEJO_DATA_DIR="/opt/data/forgejo/custom"
forgejo-notification add \
--title "General Maintenance" \
--message "Regular server and database maintenance. Estimated duration: 30 minutes" \
--start "2025-10-28 20:00 CET"
forgejo-notification list
When the maintenace starts, use assets/haproxy/maintenance-mode.sh (deployed to every server running HAProxy and added to $PATH) to enable maintenance mode:
haproxy-maintenance enable git 30haproxy-maintenance add-bypass 192.168.1.100haproxy-maintenance statushaproxy-maintenance disable git
See also haproxy-maintenance -h.
Networking
Note
Not in use right now
To optimize latency between regions, private wireguard networks were created between the nodes running haproxy. On all besides the main node running git, the SSH port was changed and Port 22 is watched by haproxy. This way, haproxy can redirect traffic to the main node running git over the internal network.
Internal networking
VSwitch setup
-
Create VLAN interface on robot server:
# this becomes the private ipv4 of the robot server nmcli connection add type vlan \ con-name vswitch4023 \ mtu 1400 \ dev eno1 \ id 4023 \ ip4 10.10.5.3/24 \ gw4 10.10.5.1 \ ipv4.routes "10.10.0.0/16 10.10.5.1" nmcli connection up vswitch4023 -
Route all 10.x requests from the robot servers through the vswitch:
ip route add 10.10.0.0/16 via 10.10.5.1
Troubleshooting:
ip route get <ip>traceroute -n <ip>
Multi Region latency
https://jawher.me/wireguard-ansible-systemd-ubuntu/
How to test: spin up a remote server next to the region proxy and measure time when connecting directly to the domain (without regional proxy running) and compare with the time when connecting through the proxy (= internal network).
- US (east)-DE: 0.211 ms (32% - 63% speedup)
- without WG:
0.313783s - 0.562196s(depending how congested the route is) - with WG:
0.211686s(quite stable)
- without WG:
Monitoring
- Via Grafana devXY instance (Public dashboard sharing doesn't work right now)
- PSI support must be manually enabled for pressure information through
node-exporter - Node exporter running on each node
- Postgres exporter running on DB nodes
- HAProxy exporter running on each node
- Upstream source: https://grafana.com/grafana/dashboards/12693-haproxy-2-full/
- Git:
- Upstream source: https://grafana.com/grafana/dashboards/17802-gitea-dashbaord/
Discourse
Installation
-
Backups: Daily to S3
git clone https://github.com/discourse/discourse_docker.git /var/discourse
cd /var/discourse
chmod 700 containers
- Copy
samples/standalone.ymland createcontainers/app.yml - Edit
containers/app.ymland comment out ports and default nginx templates - Set domain name and configure mail
- Run
/var/discourse/launcher rebuild app
Everything will be bundled in one container named app.
Additonal webserver config is required to point to unix@/var/discourse/shared/standalone/nginx.http.sock (socket of bundled discourse nginx)
Alter config in /var/lib/discourse/containers/app.yml.
Then run ./launcher rebuild app.
Initially started out with the Bitnami installation.
However, sidekiq was not working properly and there is little support for it and a lot for the official one.
Downside: it is a big monolith and one cannot really selectively choose the individual compoments (e.g. PG version or Redis provider).
Anyhow, the official one also has some default config tweaks which let the instance feel more smooth.