IaC for CodeFloe
Find a file
renovate-bot 897408762a
Some checks failed
ci/crow/push/install-deps Pipeline was successful
ci/crow/push/check-PR/9 Pipeline was successful
ci/crow/push/check-PR/11 Pipeline was successful
ci/crow/push/check-PR/7 Pipeline was successful
ci/crow/push/main/2 Pipeline was successful
ci/crow/push/main/11 Pipeline was successful
ci/crow/push/main/8 Pipeline was successful
ci/crow/push/check-PR/10 Pipeline was successful
ci/crow/push/check-PR/4 Pipeline was successful
ci/crow/push/main/4 Pipeline failed
ci/crow/push/check-PR/8 Pipeline was successful
ci/crow/push/check-PR/14 Pipeline was successful
ci/crow/push/check-PR/2 Pipeline was successful
ci/crow/push/main/7 Pipeline failed
ci/crow/push/main/9 Pipeline was successful
ci/crow/push/main/1 Pipeline was successful
ci/crow/push/main/10 Pipeline was successful
ci/crow/push/check-PR/5 Pipeline was successful
ci/crow/push/main/3 Pipeline was successful
ci/crow/push/main/13 Pipeline was successful
ci/crow/push/main/5 Pipeline was successful
ci/crow/push/check-PR/1 Pipeline was successful
ci/crow/push/main/14 Pipeline was successful
ci/crow/push/check-PR/12 Pipeline was successful
ci/crow/push/main/12 Pipeline was successful
ci/crow/push/check-PR/3 Pipeline was successful
ci/crow/push/check-PR/13 Pipeline was successful
ci/crow/push/check-PR/6 Pipeline was successful
ci/crow/push/main/6 Pipeline was successful
chore(deps): update dependency crowci/crow to v5.2.0
2026-01-17 00:52:05 +00:00
.crow chore(deps): update ansible-deps (major) (#34) 2025-11-15 10:07:38 +00:00
assets chore: increase size for codefloe logo 2025-12-28 12:12:17 +01:00
diagrams chore: add excalidraw diagram 2025-11-03 12:10:18 +01:00
docs feat: add maintenance setup (page, script) 2025-08-26 08:59:25 +02:00
environments chore(deps): update dependency crowci/crow to v5.2.0 2026-01-17 00:52:05 +00:00
playbooks refactor: switch to codefloe harbor registry 2026-01-09 16:04:42 +01:00
scripts feat: deploy buildx volume delete script to all nodes 2025-10-15 10:17:52 +02:00
ssh-keys chore: replace ssh key 2025-04-20 22:47:53 +01:00
templates improve haproxy service capabilities 2025-08-20 22:25:45 +02:00
.ansible-lint chore: use private route for actions runner 2025-06-10 20:55:44 +02:00
.ecrc dump 2025-03-06 10:19:41 +01:00
.editorconfig dump 2025-03-06 10:19:41 +01:00
.gitignore gitignore 2025-07-20 21:35:19 +02:00
.gitleaksignore feat: dynamically allowlist internal servers in firewalls 2026-01-10 16:37:50 +01:00
.markdownlint.yaml chore: clean 2025-07-30 14:03:59 +02:00
.pre-commit-config.yaml chore(deps): update pre-commit hook ansible-community/ansible-lint to v26.1.1 2026-01-17 00:51:43 +00:00
.prettierrc.json dump 2025-03-06 10:19:41 +01:00
.terraform.lock.hcl feat: dynamically allowlist internal servers in firewalls 2026-01-10 16:37:50 +01:00
.yamllint.yaml feat: add forum dns 2025-04-29 09:36:30 +02:00
ansible.cfg chore: remove collections_local/ 2025-10-29 10:08:45 +01:00
CODEOWNERS chore: add CODEOWNERS 2025-07-12 10:24:56 +02:00
diagram.drawio chore: add drawio file 2025-04-27 12:25:52 +02:00
Justfile feat: add misc-dev and add CI in HA mode 2025-07-20 21:34:50 +02:00
README.md chore: clean readme, add more collapse blocks 2025-11-15 12:40:30 +01:00
renovate.json chore: extend from codefloe's renovate config 2025-11-01 10:03:52 +00:00
requirements.yml chore(deps): update ansible-deps 2026-01-15 00:42:17 +00:00

Table of Contents

OpenTofu

OpenTofu is used for everything related to infrastructure provisioning: servers, network, storage, DNS, etc.

Environments live in environments/. Each has its own state file stored in S3.

Ansible

  • Environment-specific inventories are stored in environments.
  • Playbooks are stored in playbooks/<env>.
  • Roles are stored in roles/<env>.
  • Collections are stored in collections/<env>.

Playbooks can be executed locally via the Justfiles rules:

  • pb <playbook name> <env>: Executes a playbook locally.
  • pb-dry <playbook name> <env>: Executes a playbook in "check mode"

Architecture overview

  • Postgres DB (HA) via autobase
  • "Storage" (images, LFS, packages) outsourced to S3 (Hetzner)
  • Avatars stay on disk (as loading from CDN takes too long)
  • Backups (repos, avatars, ssh keys, etc.) backed up to S3 (Hetzner & Backblaze) via restic
  • Hetzner Servers are used

Hardware

Hardware table
Name Env CPU Mem Disk OS GB 6 SC GB 6 MC Used for Costs/m (€)
minerva prod Intel XEON
E-2176G
64 GB
DDR4 ECC
2x 960 GB SAMSUNG
MZQLB960HAJR-00007
NVME
Alma10 1749 7352 Git, DB 36.7
hades prod Intel XEON
E-2276G
64 GB
DDR4 ECC
2x 960 GB SAMSUNG
MZQLB960HAJR-00007
NVME
Alma10 1749 7352 Git, DB 37.7
demeter prod Intel XEON
E-2176G
64 GB
DDR4 ECC
2x 960 GB SAMSUNG
MZQLB960HAJR-00007
NVME
Alma10 1749 7352 Git, DB 40.7
artemis prod AMD Ryzen 7
PRO 8700GE
64 GB
DDR5 ECC
2x 500 GB SAMSUNG
MZVL2512HCJQ-00B0
NVME
Alma9 2676 11864 CI/CD 47.3
gaia prod Apple M4 32 GB
DDR5
1x 500 GB APPLE
SSD AP0512Z
macOS 26 3781 14858 CI/CD -
--- --- --- --- --- --- --- --- --- ----
misc prod ARMv8 8 GB
DDR5
80 GB NVME SSD Alma9 1079 3490 CI/CD, status, Forum 6.49
cf-dev dev ARMv8 4 GB
DDR5
40 GB NVME SSD Alma9 1035 1869 3.79
misc-dev dev ARMv8 4 GB
DDR5
40 GB NVME SSD Alma9 1035 1869 CI/CD, status, Forum 3.79

Written bytes on disks when acquired:

  • hades:
    • NVME0: 159 TB (read), 1.3 PB (write)
    • NVME1: 63 TB (read), 1.3 PB (write)
  • minerva:
    • NVME0: 97 TB (read), 55 TB (write)
    • NVME1: 85 TB (read), 62 TB (write)
  • demeter:
    • NVME0: 95 TB (read), 51 TB (write)
    • NVME0: 139 TB (read), 47 TB (write)

Disk benchmark

Disk benchmark

Sequential Operations (1M block size)

Server Read BW (MB/s) Read Lat Avg (ms) Read Lat 99th (ms) Write BW (MB/s) Write Lat Avg (ms) Write Lat 99th (ms)
gaia 5125.30 0.20 0.49 2721.74 0.37 0.89
artemis 3488.88 0.29 0.37 1036.53 0.96 2.09
demeter 1943.37 0.51 0.73 1282.39 0.78 1.34
pgnode2-dev 1537.45 0.65 1.20 1101.16 0.90 1.35

Random Operations (4k block size)

Server Read IOPS Read Lat Avg (ms) Read Lat 99th (ms) Write IOPS Write Lat Avg (ms) Write Lat 99th (ms)
demeter 12130 0.08 0.31 74743 0.01 0.02
gaia 14448 0.07 0.08 7740 0.13 1.50
artemis 12765 0.08 0.09 43692 0.02 0.02
pgnode2-dev 6252 0.16 0.33 8257 0.12 0.33
#!/bin/bash
# Enhanced FIO Disk Benchmark Script with Markdown Output to .txt

TEST_FILE="/tmp/fio_test_file" # Change path if needed
TEST_SIZE="2G"                 # Size of test file
RUNTIME="30"                   # Duration of each test in seconds
BLOCK_SIZE_SEQ="1M"            # Block size for sequential tests
BLOCK_SIZE_RAND="4k"           # Block size for random tests
NUMJOBS="1"                    # Number of parallel jobs
RESULT_FILE="fio_results.txt"  # Markdown log file

# Ensure jq is installed
if ! command -v jq &> /dev/null; then
    echo "Error: jq is not installed. Install it with: sudo apt install jq"
    exit 1
fi

# Store results for Markdown table
declare -a TABLE_ROWS

run_test() {
    local name=$1
    local rw=$2
    local bs=$3
    local mode=$4 # read or write for JSON parsing

    fio --name="$name" --rw="$rw" --bs="$bs" --size="$TEST_SIZE" \
        --numjobs="$NUMJOBS" --time_based --runtime="$RUNTIME" \
        --group_reporting --filename="$TEST_FILE" --direct=1 \
        --output-format=json > "${name}.json"

    # Extract metrics from JSON and round to 2 decimals
    BW=$(jq -r ".jobs[0].$mode.bw_bytes/1048576" "${name}.json" | awk '{printf "%.2f", $1}')
    IOPS=$(jq -r ".jobs[0].$mode.iops" "${name}.json" | awk '{printf "%.2f", $1}')
    LAT_AVG=$(jq -r ".jobs[0].$mode.lat_ns.mean/1000000" "${name}.json" | awk '{printf "%.2f", $1}')
    LAT_99=$(jq -r ".jobs[0].$mode.clat_ns.percentile[\"99.000000\"]/1000000" "${name}.json" | awk '{printf "%.2f", $1}')

    # Save row for Markdown table
    TABLE_ROWS+=("| $name | $BW | $IOPS | $LAT_AVG | $LAT_99 |")
}

echo "=== Running FIO Disk Benchmark ==="
run_test "SeqRead" "read" "$BLOCK_SIZE_SEQ" "read"
run_test "SeqWrite" "write" "$BLOCK_SIZE_SEQ" "write"
run_test "RandRead" "randread" "$BLOCK_SIZE_RAND" "read"
run_test "RandWrite" "randwrite" "$BLOCK_SIZE_RAND" "write"

# Cleanup
rm -f "$TEST_FILE"

# Prepare Markdown table
{
    echo "### FIO Benchmark Results ($(date +'%Y-%m-%d'))"
    echo "| Test Type | BW (MB/s) | IOPS | Avg Lat (ms) | 99th Lat (ms) |"
    echo "|-----------|-----------|------|--------------|---------------|"
    for row in "${TABLE_ROWS[@]}"; do
        echo "$row"
    done
    echo
} | tee -a "$RESULT_FILE"

echo "Benchmark complete. Results appended to $RESULT_FILE"

Postgres Benchmark

Postgres benchmark
sudo -u postgres createdb pgbench_test
sudo -u postgres pgbench -i -s 10 pgbench_test
sudo -u postgres pgbench -c 30 -j 4 -T 120 pgbench_test # write
sudo -u postgres pgbench -S -c 30 -j 4 -T 120 pgbench_test # read
Node Write TPS Read TPS Read Latency (ms) Write Latency (ms)
demeter 19971 160344 0.187 0.256

Backups

Static assets are backed up via restic. If possible, backups are stored in the nbg1 region (and assets in the fsn1 region).

  • Each backup task has its own CRON systemd timer
  • Scripts are stored in /opt/restic/
  • "packages" backups are separate as the source lives in S3 already. A mirror of the bucket is synced every hour via rclone.

Restore

Restore instructions

Restic backups are configured per-host in host_vars/. To restore from a backup:

  1. Set up env

    export RESTIC_REPOSITORY="s3:https://nbg1.your-objectstorage.com/<repo>"
    export RESTIC_PASSWORD="<restic-password>"
    export AWS_ACCESS_KEY_ID="<s3-access-key>"
    export AWS_SECRET_ACCESS_KEY="<s3-secret-key>"
    export AWS_DEFAULT_REGION="nbg1"
    
  2. List available snapshots

    restic snapshots
    
  3. Restore a specific snapshot

    # Restore to original location
    restic restore <snapshot-id> --target /
    
    # Or restore to a different location
    restic restore <snapshot-id> --target /tmp/restore
    
    # Restore only specific files/paths
    restic restore <snapshot-id> --target /tmp/restore --path <path>
    

Useful restic commands

# Check repository integrity
restic check

# Show differences between snapshots
restic diff <snapshot1-id> <snapshot2-id>

Maintenance Protocol

(Copy to issue and apply step by step)

  • 1. Run just pb restic prod with restic_backup_now: true to create fresh backups of all important data
  • 2. Enable maintenance mode:
    • 2.1. Haproxy: run haproxy-maintenance enable git 30 (adjust time as needed)
    • 2.2. Gatus: Tweak the maintenance-windows setting in the host_vars config file and enabled it by applying just pb gatus prod
  • 3. Reboot nodes (codefloe, misc, PG replicas, PG master)
  • 4. Additional tasks
  • 5. When done earlier: remove maintenance mode: haproxy-maintenance disable git. Both Gatus and HAProxy maintenance expire automatically but users can't interact with the service until maintenance mode is up.

Git

Custom icons Something small enough to escape casual notice.
  • repo carbon:repo-source-code
  • org carbon:building
  • comment mdi-light:comment
  • PR carbon:pull-request
  • tag carbon:tag
  • settings carbon: settings
  • merge: carbon:pull-request
  • mirror octicon:mirror
  • bell carbon:notification
  • plus mdi:plus
  • trash ph:trash-light
  • lock carbon:lock
  • unlock carbon:unlock
  • pin: mdi-light:pin
  • pin-slash: mdi-light:pin-off
  • mute: mdi-light:volume-mute
  • unmute: mdi-light:volume-high
  • key: material-symbols-light:key-outline
  • copy: octicon:copy-24
  • git-merge: carbon:merge
  • smiley: ph:smiley-wink-light
  • repo-forked: carbon:fork
  • star: carbon:star
  • eye: lineicons:eye
  • pulse: ph:pulse-light
  • question: material-symbols-light:help-outline
  • tools: carbon: settings
  • issue-opened: octicon:issue-opened-24
  • issue-closed: octicon:issue-closed-24
  • code: material-symbols-light:code
  • database: material-symbols-light:database-outline
  • git-branch: carbon:branch
  • history: material-symbols-light:history
  • milestone: octicon:milestone-24
  • search: material-symbols-light:search
  • sign-out: carbon:logout
  • book: carbon:book
  • pencil: material-symbols-light:ink-pen-outline
  • light-bulb: octicon:light-bulb-24
  • info: carbon:information
  • report: carbon:warning
  • person: carbon:user
  • server: circum:server
  • project-symlink: si:projects-line
  • package: material-symbols-light:package-2-outline
  • calendar: mdi-light:calendar
  • people: carbon:group
  • container: octicon:container-24
  • download: material-symbols-light:download
  • cpu: carbon:chip
  • rss: mdi-light:rss
  • terminal: material-symbols-light:terminal
  • globe: material-symbols-light:globe
  • filter: material-symbols-light:filter-list
  • repo-push: octicon:repo-push-24
  • file-zip: material-symbols-light:folder-zip-outline
  • clock: mdi-light:clock
  • apps: material-symbols-light:apps
  • note: material-symbols-light:notes-rounded

Notes:

  • Icons must be changed in the source code directly as they are included during the UI build
  • Icon sizes can only be changed by adjustinz the size param in the svg component, e.g. <svg-icon name="octicon-repo" :size="20" class="tw-ml-1 tw-mt-0.5"/>
  • The respective icon-name class must be added to the SVGs, e.g. class="svg octicon octicon-comment-discussion"

Changelog panel in home view

CodeFloe-specific mods

Date PR Purpose Merged into Forgejo Merge Date FJ refs
2025-07-24 #2 Improved commit-history view on mobile Issue
2025-07-24 #8 Version helper for semver version in footer form release branches
2025-08-11 #11 Support for file icon sets

DB

Postgres HA, self-managed on Hetzner (Cloud) VMs. While the Hetzner NVME disks on cloud VMs are not the most performant ones in the market, they provide a good balance between performance and cost. Scale up is easily possible up to 32 GB Memory. Due to the use of (HAProxy) load balancing + connection pooling, DB performance shouldn't be a performance issue for quite some time.

Backups are performed every day (diff) and weekly (full) through pgbackrest.

Setup

  • HAProxy load balancer as single point of entry. Forwards to connection poolers (pgbouncer) for primary and read replicas.
  • All read queries are load balanced across the read-replicas. Primary is used for writes. Forgejo will have support for splitting read/write queries starting in v12.
  • PGBouncer "transaction" mode (which would be a bit faster) does not work with Forgejo/Gitea. Forced to use "session" mode instead.

Backup

Backup details
  • Via CRON and pgbackrest (see pgbackrest for details) to S3 - /etc/cron.d/pgbackrest-codefloe
  • Full backup once a week (00 3 * * 0)
  • Diff backup daily (00 3 * * 1-6)
su postgres
pgbackrest info
ansible-playbook deploy_pgcluster.yml -t point_in_time_recovery -e "disable_archive_command=false"

Manual:

su postgres
# restore
pgbackrest --stanza=codefloe --set=20250413-030002F restore --delta
# start PG
/usr/pgsql-17/bin/pg_ctl start \
  -D /var/lib/pgsql/17/data \
  -w -t 3600 \
  -o "--config-file=/var/lib/pgsql/17/data/postgresql.conf" \
  -o "-c restore_command='pgbackrest --stanza=codefloe archive-get %f %p'" \
  -o "-c archive_command=/bin/true"

Restore: https://autobase.tech/docs/management/restore

Major upgrade

  1. Check for compatibility with new version: ansible-playbook -D -e "pg_old_version=16 pg_new_version=17" --tags 'pre-checks,upgrade-check' -i inventory -D pg_upgrade.yml
  2. Perform upgrade: ansible-playbook -e "pg_old_version=16 pg_new_version=17" -i inventory -D pg_upgrade.yml
  3. Update variable postgresql_version to new version

Connecting

export PGHOST=10.10.5.2
export PGPORT=5000 # primary
export PGPORT=5001 # replicas
export PGUSER=postgres
export PGDATABASE=postgres
export PGPASSWORD=

Troubleshooting

etcd

If etcd is unhealthy, e.g. due to inconsistent certificates, the easiest is to wipe the cluster and start fresh:

  1. rm -rf /var/lib/etcd/* on each node

  2. Run the etcd_cluster autobase playbook via

    - name: Run Autobase etcd_cluster
      ansible.builtin.import_playbook: vitabaks.autobase.etcd_cluster
    

    This will recreate all certs and restore the etcd cluster. It will NOT wipe any patroni data. Patroni will continue to work as before.

patroni

If patroni becomes unhealthy, it might also because the certs referenced in /etc/patroni/patroni.yml are not matching all hostnames and nodes. To regenerate them, run the config_pgcluster autobase playbook with -e tls_cert_regenerate=true.

Ceph

WIP

User Management

Two-fold: UNIX user on the hosts and users for accessing hosted services (secret store, monitoring, cloud).

Access to hosts is declared transparently in the active_users dictionary in group_vars/all.yml through the hosts variable, including the month when access was granted.

Secret Store

OpenBao (WIP)

CI/CD

  • One amd64 and arm64 runner is provided globally
  • Users can add their own runners (for both Crow or Actions)

Forgejo Runner Setup

The infrastructure runs Forgejo Actions runners for CI/CD workflows, supporting both amd64 and arm64 architectures.

Architecture

Runner Deployment:

  • Deployed via Ansible role devxy.cicd.forgejo_runner
  • Playbook: playbooks/playbook-forgejo-runner.yaml
  • Runs on hosts in the ci_agent inventory group
  • Runner version managed via Renovate (currently v11.3.1)

Container Runtime:

  • Uses Docker-in-Docker (DinD) for job isolation
  • Supports both IPv4 and IPv6 networking
  • Custom network subnets for container isolation

Cache Architecture:

  • Distributed cache system for Actions artifacts and dependencies
  • Cache host: artemis (10.10.5.5) runs the cache server on port 4001
  • Cache proxy: Each runner node runs a local proxy on port 4000
  • Runners access cache via Docker bridge gateway: http://{{ ansible_docker0.ipv4.address }}:4000
  • Cache directory: /opt/data/forgejo-actions-cache (on artemis)
  • Shared cache secret for authentication across all runners

Image Registry:

All container images are mirrored from data.forgejo.org/oci/ for reliability and reduced external dependencies.

Network Configuration

IPv4/IPv6 Dual Stack:

  • DinD network: 172.80.0.0/16 (IPv4), fd00:d0ca:2:1::/80 (IPv6)
  • Internal network: fd00:d0ca:2:2::/80 (IPv6)
  • Host networks: Runners can access Forgejo instances on internal IPs

Docker Configuration:

  • Docker bridge gateway dynamically resolved via ansible_docker0.ipv4.address
  • Default bridge: 172.17.0.1 (typically, but queried dynamically)
  • Custom address pools prevent IP conflicts across multiple runners

HAProxy

General

  • IPv4 and IPv6
  • HTTP3 via quic-enabled WolfSSL

Debugging commands

# View stick table entries
echo "show table per_ip_rates" | socat stdio /var/lib/haproxy/stats

# Watch specific IP
echo "show table per_ip_rates data.http_req_rate" | socat stdio /var/lib/haproxy/stats | grep <IP>

# Clear an IP from rate limit table
echo "clear table per_ip_rates key <IP>" | socat stdio /var/lib/haproxy/stats

Installation

Alma9 ships version 2.4 (2023), hence installing from source. This is needed anyhow to provide http3 support using openssl/quictls.

zenetys/rpm-haproxy provides an approach to easily build HAProxy from source, though it is lacking some aarch64 libs. To bundle HAProxy with a custom SSL lib, it needs to be built from source anyhow.

fail2ban

fail2ban provides multi-layered intrusion prevention for both SSH and HAProxy, using a firewall backend that adapts to the host's environment. Bans persist across reboots via an SQLite database.

Architecture

  • Firewall Backend:
    • firewalld with rich rules on non-Kubernetes nodes (modern RHEL 9 approach using nftables)
    • iptables on Kubernetes nodes (k3s) to avoid conflicts with kube-router network policies
    • Automatic detection via which kubectl command
  • Log Backend: systemd journal for real-time log monitoring
  • Ban Persistence: SQLite database (/var/lib/fail2ban/fail2ban.sqlite3) with 24-hour purge age
  • Ports Protected: SSH on ports 22 and 2222, HTTP/HTTPS on ports 80 and 443

SSH Protection (4-Layer System)

SSH protection uses 4 complementary jails targeting different attack patterns:

Jail Purpose Max Retry Find Time Ban Time Target
sshd Basic brute-force 5 failures 10 min 1 hour Standard login attempts
sshd-aggressive Scanner detection 10 failures 5 min 24 hours Persistent scanners
sshd-ddos DoS/flooding 20 failures 1 min 2 hours High-frequency attacks
sshd-long-term Slow attacks 15 failures 1 hour 24 hours Patient attackers

Configuration: /etc/fail2ban/jail.d/sshd.conf

Design rationale:

  • Multiple jails with different time windows catch attackers using various strategies
  • Short windows (1-10 min) catch brute-force attempts
  • Long window (1 hour) catches slow, persistent attacks that spread attempts over time
  • Graduated ban times: temporary bans for quick attempts, 24-hour bans for persistent threats

HAProxy Protection (3-Layer System)

HAProxy protection monitors bad requests, DoS attempts, and scanning behavior:

Jail Purpose Max Retry Find Time Ban Time Target
haproxy-badreq Bad requests 10 requests 10 min 1 hour Malformed HTTP
haproxy-ddos DoS flooding 100 requests 1 min 2 hours High-volume attacks
haproxy-scanner Scanner detection 10 requests 5 min 24 hours Vulnerability scanning

Configuration: /etc/fail2ban/jail.d/haproxy.conf

All HAProxy jails use the same filter (haproxy-badreq) at /etc/fail2ban/filter.d/haproxy-badreq.conf, which detects <BADREQ> entries in HAProxy logs.

Useful Commands

Useful commands
# View all active jails
fail2ban-client status

# View specific jail status and banned IPs
fail2ban-client status sshd
fail2ban-client status haproxy-badreq

# Unban a specific IP
fail2ban-client unban <IP>

# Check ban database
sqlite3 /var/lib/fail2ban/fail2ban.sqlite3 "SELECT * FROM bans;"

# View recent bans
journalctl -u fail2ban -n 100 | grep Ban

# Reload configuration (preserves bans)
fail2ban-client reload

Firewall Integration

On non-Kubernetes nodes (using firewalld):

# View fail2ban rich rules
firewall-cmd --list-rich-rules

# View all firewalld zones
firewall-cmd --list-all-zones

# Check if an IP is blocked
firewall-cmd --query-rich-rule='rule family="ipv4" source address="<IP>" reject'

On Kubernetes nodes (using iptables):

# View fail2ban chains
iptables -L -n --line-numbers

# View specific fail2ban chain
iptables -L f2b-sshd -n -v

Docker Compatibility

The fail2ban configuration is designed to work seamlessly with Docker:

  • firewalld uses a docker-forwarding policy that allows Docker container traffic
  • Docker manages its own docker0 interface and bridge network
  • No manual zone assignment for Docker interfaces (avoids ZONE_CONFLICT errors)
  • On k3s nodes, firewalld is completely disabled to prevent conflicts with kube-router

Maintenance

To announce a planned maintenance, use forgejo-notification:

export FORGEJO_HOST="cf"
export FORGEJO_DATA_DIR="/opt/data/forgejo/custom"

forgejo-notification add \
  --title "General Maintenance" \
  --message "Regular server and database maintenance. Estimated duration: 30 minutes" \
  --start "2025-10-28 20:00 CET"

forgejo-notification list

When the maintenace starts, use assets/haproxy/maintenance-mode.sh (deployed to every server running HAProxy and added to $PATH) to enable maintenance mode:

  • haproxy-maintenance enable git 30
  • haproxy-maintenance add-bypass 192.168.1.100
  • haproxy-maintenance status
  • haproxy-maintenance disable git

See also haproxy-maintenance -h.

Networking

Note

Not in use right now

To optimize latency between regions, private wireguard networks were created between the nodes running haproxy. On all besides the main node running git, the SSH port was changed and Port 22 is watched by haproxy. This way, haproxy can redirect traffic to the main node running git over the internal network.

Internal networking

VSwitch setup

  1. Create VLAN interface on robot server:

    # this becomes the private ipv4 of the robot server
    nmcli connection add type vlan \
        con-name vswitch4023 \
        mtu 1400 \
        dev eno1 \
        id 4023 \
        ip4 10.10.5.3/24 \
        gw4 10.10.5.1 \
        ipv4.routes "10.10.0.0/16 10.10.5.1"
    
    nmcli connection up vswitch4023
    
  2. Route all 10.x requests from the robot servers through the vswitch:

    ip route add 10.10.0.0/16 via 10.10.5.1
    

Troubleshooting:

  • ip route get <ip>
  • traceroute -n <ip>

Multi Region latency

https://jawher.me/wireguard-ansible-systemd-ubuntu/

How to test: spin up a remote server next to the region proxy and measure time when connecting directly to the domain (without regional proxy running) and compare with the time when connecting through the proxy (= internal network).

  • US (east)-DE: 0.211 ms (32% - 63% speedup)
    • without WG: 0.313783s - 0.562196s (depending how congested the route is)
    • with WG: 0.211686s (quite stable)

Monitoring

Discourse

Installation

git clone https://github.com/discourse/discourse_docker.git /var/discourse
cd /var/discourse
chmod 700 containers
  1. Copy samples/standalone.yml and create containers/app.yml
  2. Edit containers/app.yml and comment out ports and default nginx templates
  3. Set domain name and configure mail
  4. Run /var/discourse/launcher rebuild app

Everything will be bundled in one container named app. Additonal webserver config is required to point to unix@/var/discourse/shared/standalone/nginx.http.sock (socket of bundled discourse nginx)

Alter config in /var/lib/discourse/containers/app.yml. Then run ./launcher rebuild app.

Initially started out with the Bitnami installation. However, sidekiq was not working properly and there is little support for it and a lot for the official one.
Downside: it is a big monolith and one cannot really selectively choose the individual compoments (e.g. PG version or Redis provider). Anyhow, the official one also has some default config tweaks which let the instance feel more smooth.