infrastructure-as-code

IaC for CodeFloe

ansible crowci infrastructure-as-code opentofu

Find a file

renovate-bot 897408762a Some checks failed ci/crow/push/install-deps Pipeline was successful Details ci/crow/push/check-PR/9 Pipeline was successful Details ci/crow/push/check-PR/11 Pipeline was successful Details ci/crow/push/check-PR/7 Pipeline was successful Details ci/crow/push/main/2 Pipeline was successful Details ci/crow/push/main/11 Pipeline was successful Details ci/crow/push/main/8 Pipeline was successful Details ci/crow/push/check-PR/10 Pipeline was successful Details ci/crow/push/check-PR/4 Pipeline was successful Details ci/crow/push/main/4 Pipeline failed Details ci/crow/push/check-PR/8 Pipeline was successful Details ci/crow/push/check-PR/14 Pipeline was successful Details ci/crow/push/check-PR/2 Pipeline was successful Details ci/crow/push/main/7 Pipeline failed Details ci/crow/push/main/9 Pipeline was successful Details ci/crow/push/main/1 Pipeline was successful Details ci/crow/push/main/10 Pipeline was successful Details ci/crow/push/check-PR/5 Pipeline was successful Details ci/crow/push/main/3 Pipeline was successful Details ci/crow/push/main/13 Pipeline was successful Details ci/crow/push/main/5 Pipeline was successful Details ci/crow/push/check-PR/1 Pipeline was successful Details ci/crow/push/main/14 Pipeline was successful Details ci/crow/push/check-PR/12 Pipeline was successful Details ci/crow/push/main/12 Pipeline was successful Details ci/crow/push/check-PR/3 Pipeline was successful Details ci/crow/push/check-PR/13 Pipeline was successful Details ci/crow/push/check-PR/6 Pipeline was successful Details ci/crow/push/main/6 Pipeline was successful Details chore(deps): update dependency crowci/crow to v5.2.0		2026-01-17 00:52:05 +00:00
.crow	chore(deps): update ansible-deps (major) (#34 )	2025-11-15 10:07:38 +00:00
assets	chore: increase size for codefloe logo	2025-12-28 12:12:17 +01:00
diagrams	chore: add excalidraw diagram	2025-11-03 12:10:18 +01:00
docs	feat: add maintenance setup (page, script)	2025-08-26 08:59:25 +02:00
environments	chore(deps): update dependency crowci/crow to v5.2.0	2026-01-17 00:52:05 +00:00
playbooks	refactor: switch to codefloe harbor registry	2026-01-09 16:04:42 +01:00
scripts	feat: deploy buildx volume delete script to all nodes	2025-10-15 10:17:52 +02:00
ssh-keys	chore: replace ssh key	2025-04-20 22:47:53 +01:00
templates	improve haproxy service capabilities	2025-08-20 22:25:45 +02:00
.ansible-lint	chore: use private route for actions runner	2025-06-10 20:55:44 +02:00
.ecrc	dump	2025-03-06 10:19:41 +01:00
.editorconfig	dump	2025-03-06 10:19:41 +01:00
.gitignore	gitignore	2025-07-20 21:35:19 +02:00
.gitleaksignore	feat: dynamically allowlist internal servers in firewalls	2026-01-10 16:37:50 +01:00
.markdownlint.yaml	chore: clean	2025-07-30 14:03:59 +02:00
.pre-commit-config.yaml	chore(deps): update pre-commit hook ansible-community/ansible-lint to v26.1.1	2026-01-17 00:51:43 +00:00
.prettierrc.json	dump	2025-03-06 10:19:41 +01:00
.terraform.lock.hcl	feat: dynamically allowlist internal servers in firewalls	2026-01-10 16:37:50 +01:00
.yamllint.yaml	feat: add forum dns	2025-04-29 09:36:30 +02:00
ansible.cfg	chore: remove collections_local/	2025-10-29 10:08:45 +01:00
CODEOWNERS	chore: add CODEOWNERS	2025-07-12 10:24:56 +02:00
diagram.drawio	chore: add drawio file	2025-04-27 12:25:52 +02:00
Justfile	feat: add misc-dev and add CI in HA mode	2025-07-20 21:34:50 +02:00
README.md	chore: clean readme, add more collapse blocks	2025-11-15 12:40:30 +01:00
renovate.json	chore: extend from codefloe's renovate config	2025-11-01 10:03:52 +00:00
requirements.yml	chore(deps): update ansible-deps	2026-01-15 00:42:17 +00:00

README.md

Table of Contents

OpenTofu
Ansible
Architecture overview
Hardware

Disk benchmark

Sequential Operations (1M block size)
Random Operations (4k block size)

Postgres Benchmark

Backups

Restore
Useful restic commands

Maintenance Protocol
Git

Changelog panel in home view
CodeFloe-specific mods

Setup
Backup
Major upgrade
Connecting

Troubleshooting

etcd
patroni

Ceph
User Management
Secret Store
CI/CD

Forgejo Runner Setup

Architecture
Network Configuration

HAProxy

General
Debugging commands
Installation
fail2ban

Architecture
SSH Protection (4-Layer System)
HAProxy Protection (3-Layer System)
Useful Commands
Firewall Integration
Docker Compatibility

Maintenance

Networking

Internal networking

VSwitch setup
Multi Region latency

Monitoring
Discourse

Installation

OpenTofu

OpenTofu is used for everything related to infrastructure provisioning: servers, network, storage, DNS, etc.

Environments live in environments/. Each has its own state file stored in S3.

Ansible

Environment-specific inventories are stored in environments.
Playbooks are stored in playbooks/<env>.
Roles are stored in roles/<env>.
Collections are stored in collections/<env>.

Playbooks can be executed locally via the Justfiles rules:

pb <playbook name> <env>: Executes a playbook locally.
pb-dry <playbook name> <env>: Executes a playbook in "check mode"

Architecture overview

Postgres DB (HA) via autobase
"Storage" (images, LFS, packages) outsourced to S3 (Hetzner)
Avatars stay on disk (as loading from CDN takes too long)
Backups (repos, avatars, ssh keys, etc.) backed up to S3 (Hetzner & Backblaze) via restic
Hetzner Servers are used

Hardware

Hardware table

Name	Env	CPU	Mem	Disk	OS	GB 6 SC	GB 6 MC	Used for	Costs/m (€)
minerva	prod	Intel XEON E-2176G	64 GB DDR4 ECC	2x 960 GB SAMSUNG MZQLB960HAJR-00007 NVME	Alma10	1749	7352	Git, DB	36.7
hades	prod	Intel XEON E-2276G	64 GB DDR4 ECC	2x 960 GB SAMSUNG MZQLB960HAJR-00007 NVME	Alma10	1749	7352	Git, DB	37.7
demeter	prod	Intel XEON E-2176G	64 GB DDR4 ECC	2x 960 GB SAMSUNG MZQLB960HAJR-00007 NVME	Alma10	1749	7352	Git, DB	40.7
artemis	prod	AMD Ryzen 7 PRO 8700GE	64 GB DDR5 ECC	2x 500 GB SAMSUNG MZVL2512HCJQ-00B0 NVME	Alma9	2676	11864	CI/CD	47.3
gaia	prod	Apple M4	32 GB DDR5	1x 500 GB APPLE SSD AP0512Z	macOS 26	3781	14858	CI/CD	-
---	---	---	---	---	---	---	---	---	----
misc	prod	ARMv8	8 GB DDR5	80 GB NVME SSD	Alma9	1079	3490	CI/CD, status, Forum	6.49
cf-dev	dev	ARMv8	4 GB DDR5	40 GB NVME SSD	Alma9	1035	1869		3.79
misc-dev	dev	ARMv8	4 GB DDR5	40 GB NVME SSD	Alma9	1035	1869	CI/CD, status, Forum	3.79

Written bytes on disks when acquired:

hades:
- NVME0: 159 TB (read), 1.3 PB (write)
- NVME1: 63 TB (read), 1.3 PB (write)
minerva:
- NVME0: 97 TB (read), 55 TB (write)
- NVME1: 85 TB (read), 62 TB (write)
demeter:
- NVME0: 95 TB (read), 51 TB (write)
- NVME0: 139 TB (read), 47 TB (write)

Disk benchmark

Sequential Operations (1M block size)

Server	Read BW (MB/s)	Read Lat Avg (ms)	Read Lat 99th (ms)	Write BW (MB/s)	Write Lat Avg (ms)	Write Lat 99th (ms)
gaia	5125.30	0.20	0.49	2721.74	0.37	0.89
artemis	3488.88	0.29	0.37	1036.53	0.96	2.09
demeter	1943.37	0.51	0.73	1282.39	0.78	1.34
pgnode2-dev	1537.45	0.65	1.20	1101.16	0.90	1.35

Random Operations (4k block size)

Server	Read IOPS	Read Lat Avg (ms)	Read Lat 99th (ms)	Write IOPS	Write Lat Avg (ms)	Write Lat 99th (ms)
demeter	12130	0.08	0.31	74743	0.01	0.02
gaia	14448	0.07	0.08	7740	0.13	1.50
artemis	12765	0.08	0.09	43692	0.02	0.02
pgnode2-dev	6252	0.16	0.33	8257	0.12	0.33

#!/bin/bash
# Enhanced FIO Disk Benchmark Script with Markdown Output to .txt

TEST_FILE="/tmp/fio_test_file" # Change path if needed
TEST_SIZE="2G"                 # Size of test file
RUNTIME="30"                   # Duration of each test in seconds
BLOCK_SIZE_SEQ="1M"            # Block size for sequential tests
BLOCK_SIZE_RAND="4k"           # Block size for random tests
NUMJOBS="1"                    # Number of parallel jobs
RESULT_FILE="fio_results.txt"  # Markdown log file

# Ensure jq is installed
if ! command -v jq &> /dev/null; then
    echo "Error: jq is not installed. Install it with: sudo apt install jq"
    exit 1
fi

# Store results for Markdown table
declare -a TABLE_ROWS

run_test() {
    local name=$1
    local rw=$2
    local bs=$3
    local mode=$4 # read or write for JSON parsing

    fio --name="$name" --rw="$rw" --bs="$bs" --size="$TEST_SIZE" \
        --numjobs="$NUMJOBS" --time_based --runtime="$RUNTIME" \
        --group_reporting --filename="$TEST_FILE" --direct=1 \
        --output-format=json > "${name}.json"

    # Extract metrics from JSON and round to 2 decimals
    BW=$(jq -r ".jobs[0].$mode.bw_bytes/1048576" "${name}.json" | awk '{printf "%.2f", $1}')
    IOPS=$(jq -r ".jobs[0].$mode.iops" "${name}.json" | awk '{printf "%.2f", $1}')
    LAT_AVG=$(jq -r ".jobs[0].$mode.lat_ns.mean/1000000" "${name}.json" | awk '{printf "%.2f", $1}')
    LAT_99=$(jq -r ".jobs[0].$mode.clat_ns.percentile[\"99.000000\"]/1000000" "${name}.json" | awk '{printf "%.2f", $1}')

    # Save row for Markdown table
    TABLE_ROWS+=("| $name | $BW | $IOPS | $LAT_AVG | $LAT_99 |")
}

echo "=== Running FIO Disk Benchmark ==="
run_test "SeqRead" "read" "$BLOCK_SIZE_SEQ" "read"
run_test "SeqWrite" "write" "$BLOCK_SIZE_SEQ" "write"
run_test "RandRead" "randread" "$BLOCK_SIZE_RAND" "read"
run_test "RandWrite" "randwrite" "$BLOCK_SIZE_RAND" "write"

# Cleanup
rm -f "$TEST_FILE"

# Prepare Markdown table
{
    echo "### FIO Benchmark Results ($(date +'%Y-%m-%d'))"
    echo "| Test Type | BW (MB/s) | IOPS | Avg Lat (ms) | 99th Lat (ms) |"
    echo "|-----------|-----------|------|--------------|---------------|"
    for row in "${TABLE_ROWS[@]}"; do
        echo "$row"
    done
    echo
} | tee -a "$RESULT_FILE"

echo "Benchmark complete. Results appended to $RESULT_FILE"

Postgres Benchmark

Postgres benchmark

sudo -u postgres createdb pgbench_test
sudo -u postgres pgbench -i -s 10 pgbench_test
sudo -u postgres pgbench -c 30 -j 4 -T 120 pgbench_test # write
sudo -u postgres pgbench -S -c 30 -j 4 -T 120 pgbench_test # read

Node	Write TPS	Read TPS	Read Latency (ms)	Write Latency (ms)
demeter	19971	160344	0.187	0.256

Backups

Static assets are backed up via restic. If possible, backups are stored in the nbg1 region (and assets in the fsn1 region).

Each backup task has its own CRON systemd timer
Scripts are stored in /opt/restic/
"packages" backups are separate as the source lives in S3 already. A mirror of the bucket is synced every hour via rclone.

Restore

Restore instructions

Restic backups are configured per-host in host_vars/. To restore from a backup:

Set up env

export RESTIC_REPOSITORY="s3:https://nbg1.your-objectstorage.com/<repo>"
export RESTIC_PASSWORD="<restic-password>"
export AWS_ACCESS_KEY_ID="<s3-access-key>"
export AWS_SECRET_ACCESS_KEY="<s3-secret-key>"
export AWS_DEFAULT_REGION="nbg1"

List available snapshots
```
restic snapshots
```

Restore a specific snapshot

# Restore to original location
restic restore <snapshot-id> --target /

# Or restore to a different location
restic restore <snapshot-id> --target /tmp/restore

# Restore only specific files/paths
restic restore <snapshot-id> --target /tmp/restore --path <path>

Useful restic commands

# Check repository integrity
restic check

# Show differences between snapshots
restic diff <snapshot1-id> <snapshot2-id>

Maintenance Protocol

(Copy to issue and apply step by step)

1. Run just pb restic prod with restic_backup_now: true to create fresh backups of all important data
2. Enable maintenance mode:
- 2.1. Haproxy: run haproxy-maintenance enable git 30 (adjust time as needed)
- 2.2. Gatus: Tweak the maintenance-windows setting in the host_vars config file and enabled it by applying just pb gatus prod
3. Reboot nodes (codefloe, misc, PG replicas, PG master)
4. Additional tasks
5. When done earlier: remove maintenance mode: haproxy-maintenance disable git. Both Gatus and HAProxy maintenance expire automatically but users can't interact with the service until maintenance mode is up.

Git

Custom icons

Something small enough to escape casual notice.

repo carbon:repo-source-code
org carbon:building
comment mdi-light:comment
PR carbon:pull-request
tag carbon:tag
settings carbon: settings
merge: carbon:pull-request
mirror octicon:mirror
bell carbon:notification
plus mdi:plus
trash ph:trash-light
lock carbon:lock
unlock carbon:unlock
pin: mdi-light:pin
pin-slash: mdi-light:pin-off
mute: mdi-light:volume-mute
unmute: mdi-light:volume-high
key: material-symbols-light:key-outline
copy: octicon:copy-24
git-merge: carbon:merge
smiley: ph:smiley-wink-light
repo-forked: carbon:fork
star: carbon:star
eye: lineicons:eye
pulse: ph:pulse-light
question: material-symbols-light:help-outline
tools: carbon: settings
issue-opened: octicon:issue-opened-24
issue-closed: octicon:issue-closed-24
code: material-symbols-light:code
database: material-symbols-light:database-outline
git-branch: carbon:branch
history: material-symbols-light:history
milestone: octicon:milestone-24
search: material-symbols-light:search
sign-out: carbon:logout
book: carbon:book
pencil: material-symbols-light:ink-pen-outline
light-bulb: octicon:light-bulb-24
info: carbon:information
report: carbon:warning
person: carbon:user
server: circum:server
project-symlink: si:projects-line
package: material-symbols-light:package-2-outline
calendar: mdi-light:calendar
people: carbon:group
container: octicon:container-24
download: material-symbols-light:download
cpu: carbon:chip
rss: mdi-light:rss
terminal: material-symbols-light:terminal
globe: material-symbols-light:globe
filter: material-symbols-light:filter-list
repo-push: octicon:repo-push-24
file-zip: material-symbols-light:folder-zip-outline
clock: mdi-light:clock
apps: material-symbols-light:apps
note: material-symbols-light:notes-rounded

Notes:

Icons must be changed in the source code directly as they are included during the UI build
Icon sizes can only be changed by adjustinz the size param in the svg component, e.g. <svg-icon name="octicon-repo" :size="20" class="tw-ml-1 tw-mt-0.5"/>
The respective icon-name class must be added to the SVGs, e.g. class="svg octicon octicon-comment-discussion"

Changelog panel in home view

Custom dashboard_tmpl.html
Custom discourse-embed.js, server as a local asset. The default one from Discourse is super minimal and inflexible.
Custom CSS (discourse-topics.css) which aligns the embedded content to the Forgejo UI style.
Discourse docs on this topic

CodeFloe-specific mods

Date	PR	Purpose	Merged into Forgejo	FJ refs
2025-07-24	#2	Improved commit-history view on mobile	❌️	Issue
2025-07-24	#8	Version helper for semver version in footer form release branches	❓️
2025-08-11	#11	Support for file icon sets	❓️

DB

Postgres HA, self-managed on Hetzner (Cloud) VMs. While the Hetzner NVME disks on cloud VMs are not the most performant ones in the market, they provide a good balance between performance and cost. Scale up is easily possible up to 32 GB Memory. Due to the use of (HAProxy) load balancing + connection pooling, DB performance shouldn't be a performance issue for quite some time.

Backups are performed every day (diff) and weekly (full) through pgbackrest.

Setup

HAProxy load balancer as single point of entry. Forwards to connection poolers (pgbouncer) for primary and read replicas.
All read queries are load balanced across the read-replicas. Primary is used for writes. Forgejo will have support for splitting read/write queries starting in v12.
PGBouncer "transaction" mode (which would be a bit faster) does not work with Forgejo/Gitea. Forced to use "session" mode instead.

Backup

Backup details

Via CRON and pgbackrest (see pgbackrest for details) to S3 - /etc/cron.d/pgbackrest-codefloe
Full backup once a week (00 3 * * 0)
Diff backup daily (00 3 * * 1-6)

su postgres
pgbackrest info
ansible-playbook deploy_pgcluster.yml -t point_in_time_recovery -e "disable_archive_command=false"

Manual:

su postgres
# restore
pgbackrest --stanza=codefloe --set=20250413-030002F restore --delta
# start PG
/usr/pgsql-17/bin/pg_ctl start \
  -D /var/lib/pgsql/17/data \
  -w -t 3600 \
  -o "--config-file=/var/lib/pgsql/17/data/postgresql.conf" \
  -o "-c restore_command='pgbackrest --stanza=codefloe archive-get %f %p'" \
  -o "-c archive_command=/bin/true"

Restore: https://autobase.tech/docs/management/restore

Major upgrade

Check for compatibility with new version: ansible-playbook -D -e "pg_old_version=16 pg_new_version=17" --tags 'pre-checks,upgrade-check' -i inventory -D pg_upgrade.yml
Perform upgrade: ansible-playbook -e "pg_old_version=16 pg_new_version=17" -i inventory -D pg_upgrade.yml
Update variable postgresql_version to new version

Connecting

export PGHOST=10.10.5.2
export PGPORT=5000 # primary
export PGPORT=5001 # replicas
export PGUSER=postgres
export PGDATABASE=postgres
export PGPASSWORD=

Troubleshooting

etcd

If etcd is unhealthy, e.g. due to inconsistent certificates, the easiest is to wipe the cluster and start fresh:

rm -rf /var/lib/etcd/* on each node
Run the etcd_cluster autobase playbook via
```
- name: Run Autobase etcd_cluster
  ansible.builtin.import_playbook: vitabaks.autobase.etcd_cluster
```
This will recreate all certs and restore the etcd cluster. It will NOT wipe any patroni data. Patroni will continue to work as before.

patroni

If patroni becomes unhealthy, it might also because the certs referenced in /etc/patroni/patroni.yml are not matching all hostnames and nodes. To regenerate them, run the config_pgcluster autobase playbook with -e tls_cert_regenerate=true.

Ceph

WIP

User Management

Two-fold: UNIX user on the hosts and users for accessing hosted services (secret store, monitoring, cloud).

Access to hosts is declared transparently in the active_users dictionary in group_vars/all.yml through the hosts variable, including the month when access was granted.

Secret Store

OpenBao (WIP)

CI/CD

One amd64 and arm64 runner is provided globally
Users can add their own runners (for both Crow or Actions)

Forgejo Runner Setup

The infrastructure runs Forgejo Actions runners for CI/CD workflows, supporting both amd64 and arm64 architectures.

Architecture

Runner Deployment:

Deployed via Ansible role devxy.cicd.forgejo_runner
Playbook: playbooks/playbook-forgejo-runner.yaml
Runs on hosts in the ci_agent inventory group
Runner version managed via Renovate (currently v11.3.1)

Container Runtime:

Uses Docker-in-Docker (DinD) for job isolation
Supports both IPv4 and IPv6 networking
Custom network subnets for container isolation

Cache Architecture:

Distributed cache system for Actions artifacts and dependencies
Cache host: artemis (10.10.5.5) runs the cache server on port 4001
Cache proxy: Each runner node runs a local proxy on port 4000
Runners access cache via Docker bridge gateway: http://{{ ansible_docker0.ipv4.address }}:4000
Cache directory: /opt/data/forgejo-actions-cache (on artemis)
Shared cache secret for authentication across all runners

Image Registry:

All container images are mirrored from data.forgejo.org/oci/ for reliability and reduced external dependencies.

Network Configuration

IPv4/IPv6 Dual Stack:

DinD network: 172.80.0.0/16 (IPv4), fd00:d0ca:2:1::/80 (IPv6)
Internal network: fd00:d0ca:2:2::/80 (IPv6)
Host networks: Runners can access Forgejo instances on internal IPs

Docker Configuration:

Docker bridge gateway dynamically resolved via ansible_docker0.ipv4.address
Default bridge: 172.17.0.1 (typically, but queried dynamically)
Custom address pools prevent IP conflicts across multiple runners

HAProxy

General

IPv4 and IPv6
HTTP3 via quic-enabled WolfSSL

Debugging commands

# View stick table entries
echo "show table per_ip_rates" | socat stdio /var/lib/haproxy/stats

# Watch specific IP
echo "show table per_ip_rates data.http_req_rate" | socat stdio /var/lib/haproxy/stats | grep <IP>

# Clear an IP from rate limit table
echo "clear table per_ip_rates key <IP>" | socat stdio /var/lib/haproxy/stats

Installation

Alma9 ships version 2.4 (2023), hence installing from source. This is needed anyhow to provide http3 support using openssl/quictls.

zenetys/rpm-haproxy provides an approach to easily build HAProxy from source, though it is lacking some aarch64 libs. To bundle HAProxy with a custom SSL lib, it needs to be built from source anyhow.

fail2ban

fail2ban provides multi-layered intrusion prevention for both SSH and HAProxy, using a firewall backend that adapts to the host's environment. Bans persist across reboots via an SQLite database.

Architecture

Firewall Backend:
- firewalld with rich rules on non-Kubernetes nodes (modern RHEL 9 approach using nftables)
- iptables on Kubernetes nodes (k3s) to avoid conflicts with kube-router network policies
- Automatic detection via which kubectl command
Log Backend: systemd journal for real-time log monitoring
Ban Persistence: SQLite database (/var/lib/fail2ban/fail2ban.sqlite3) with 24-hour purge age
Ports Protected: SSH on ports 22 and 2222, HTTP/HTTPS on ports 80 and 443

SSH Protection (4-Layer System)

SSH protection uses 4 complementary jails targeting different attack patterns:

Jail	Purpose	Max Retry	Find Time	Ban Time	Target
`sshd`	Basic brute-force	5 failures	10 min	1 hour	Standard login attempts
`sshd-aggressive`	Scanner detection	10 failures	5 min	24 hours	Persistent scanners
`sshd-ddos`	DoS/flooding	20 failures	1 min	2 hours	High-frequency attacks
`sshd-long-term`	Slow attacks	15 failures	1 hour	24 hours	Patient attackers

Configuration: /etc/fail2ban/jail.d/sshd.conf

Design rationale:

Multiple jails with different time windows catch attackers using various strategies
Short windows (1-10 min) catch brute-force attempts
Long window (1 hour) catches slow, persistent attacks that spread attempts over time
Graduated ban times: temporary bans for quick attempts, 24-hour bans for persistent threats

HAProxy Protection (3-Layer System)

HAProxy protection monitors bad requests, DoS attempts, and scanning behavior:

Jail	Purpose	Max Retry	Find Time	Ban Time	Target
`haproxy-badreq`	Bad requests	10 requests	10 min	1 hour	Malformed HTTP
`haproxy-ddos`	DoS flooding	100 requests	1 min	2 hours	High-volume attacks
`haproxy-scanner`	Scanner detection	10 requests	5 min	24 hours	Vulnerability scanning

Configuration: /etc/fail2ban/jail.d/haproxy.conf

All HAProxy jails use the same filter (haproxy-badreq) at /etc/fail2ban/filter.d/haproxy-badreq.conf, which detects <BADREQ> entries in HAProxy logs.

Useful Commands

Useful commands

# View all active jails
fail2ban-client status

# View specific jail status and banned IPs
fail2ban-client status sshd
fail2ban-client status haproxy-badreq

# Unban a specific IP
fail2ban-client unban <IP>

# Check ban database
sqlite3 /var/lib/fail2ban/fail2ban.sqlite3 "SELECT * FROM bans;"

# View recent bans
journalctl -u fail2ban -n 100 | grep Ban

# Reload configuration (preserves bans)
fail2ban-client reload

Firewall Integration

On non-Kubernetes nodes (using firewalld):

# View fail2ban rich rules
firewall-cmd --list-rich-rules

# View all firewalld zones
firewall-cmd --list-all-zones

# Check if an IP is blocked
firewall-cmd --query-rich-rule='rule family="ipv4" source address="<IP>" reject'

On Kubernetes nodes (using iptables):

# View fail2ban chains
iptables -L -n --line-numbers

# View specific fail2ban chain
iptables -L f2b-sshd -n -v

Docker Compatibility

The fail2ban configuration is designed to work seamlessly with Docker:

firewalld uses a docker-forwarding policy that allows Docker container traffic
Docker manages its own docker0 interface and bridge network
No manual zone assignment for Docker interfaces (avoids ZONE_CONFLICT errors)
On k3s nodes, firewalld is completely disabled to prevent conflicts with kube-router

Maintenance

To announce a planned maintenance, use forgejo-notification:

export FORGEJO_HOST="cf"
export FORGEJO_DATA_DIR="/opt/data/forgejo/custom"

forgejo-notification add \
  --title "General Maintenance" \
  --message "Regular server and database maintenance. Estimated duration: 30 minutes" \
  --start "2025-10-28 20:00 CET"

forgejo-notification list

When the maintenace starts, use assets/haproxy/maintenance-mode.sh (deployed to every server running HAProxy and added to $PATH) to enable maintenance mode:

haproxy-maintenance enable git 30
haproxy-maintenance add-bypass 192.168.1.100
haproxy-maintenance status
haproxy-maintenance disable git

Networking

Note

Not in use right now

To optimize latency between regions, private wireguard networks were created between the nodes running haproxy. On all besides the main node running git, the SSH port was changed and Port 22 is watched by haproxy. This way, haproxy can redirect traffic to the main node running git over the internal network.

Internal networking

VSwitch setup

Create VLAN interface on robot server:

# this becomes the private ipv4 of the robot server
nmcli connection add type vlan \
    con-name vswitch4023 \
    mtu 1400 \
    dev eno1 \
    id 4023 \
    ip4 10.10.5.3/24 \
    gw4 10.10.5.1 \
    ipv4.routes "10.10.0.0/16 10.10.5.1"

nmcli connection up vswitch4023

Route all 10.x requests from the robot servers through the vswitch:
```
ip route add 10.10.0.0/16 via 10.10.5.1
```

Troubleshooting:

ip route get <ip>
traceroute -n <ip>

Multi Region latency

https://jawher.me/wireguard-ansible-systemd-ubuntu/

How to test: spin up a remote server next to the region proxy and measure time when connecting directly to the domain (without regional proxy running) and compare with the time when connecting through the proxy (= internal network).

US (east)-DE: 0.211 ms (32% - 63% speedup)
- without WG: 0.313783s - 0.562196s (depending how congested the route is)
- with WG: 0.211686s (quite stable)

Monitoring

Via Grafana devXY instance (Public dashboard sharing doesn't work right now)
PSI support must be manually enabled for pressure information through node-exporter
Node exporter running on each node
- Dashboard: https://grafana.devxy.io/public-dashboards/c70ae4d237ee4f2687993091cfad7051
- Upstream source: https://grafana.com/grafana/dashboards/1860-node-exporter-full/
Postgres exporter running on DB nodes
- Dashboard: https://grafana.devxy.io/public-dashboards/3f363326408444edb6cfbb98c0a828e5
- Upstream source: https://grafana.com/grafana/dashboards/9628-postgresql-database/
HAProxy exporter running on each node
- Upstream source: https://grafana.com/grafana/dashboards/12693-haproxy-2-full/
Git:
- Upstream source: https://grafana.com/grafana/dashboards/17802-gitea-dashbaord/

Discourse

Installation

oAuth2
Install plugins
Backups: Daily to S3
Install guide

git clone https://github.com/discourse/discourse_docker.git /var/discourse
cd /var/discourse
chmod 700 containers

Copy samples/standalone.yml and create containers/app.yml
Edit containers/app.yml and comment out ports and default nginx templates
Set domain name and configure mail
Run /var/discourse/launcher rebuild app

Everything will be bundled in one container named app. Additonal webserver config is required to point to unix@/var/discourse/shared/standalone/nginx.http.sock (socket of bundled discourse nginx)

Alter config in /var/lib/discourse/containers/app.yml. Then run ./launcher rebuild app.

Initially started out with the Bitnami installation. However, sidekiq was not working properly and there is little support for it and a lot for the official one.
Downside: it is a big monolith and one cannot really selectively choose the individual compoments (e.g. PG version or Redis provider). Anyhow, the official one also has some default config tweaks which let the instance feel more smooth.

README.md Unescape Escape

OpenTofu

Ansible

Architecture overview

Hardware

Disk benchmark

Sequential Operations (1M block size)

Random Operations (4k block size)

Postgres Benchmark

Backups

Restore

Useful restic commands

Maintenance Protocol

Git

Changelog panel in home view

CodeFloe-specific mods

DB

Setup

Backup

Major upgrade

Connecting

Troubleshooting

etcd

patroni

Ceph

User Management

Secret Store

CI/CD

Forgejo Runner Setup

Architecture

Network Configuration

HAProxy

General

Debugging commands

Installation

fail2ban

Architecture

SSH Protection (4-Layer System)

HAProxy Protection (3-Layer System)

Useful Commands

Firewall Integration

Docker Compatibility

Maintenance

Networking

Internal networking

VSwitch setup

Multi Region latency

Monitoring

Discourse

Installation

README.md