Guardian is a monitoring and auto-remediation system for a small Linux fleet. It demonstrates Prometheus + Alertmanager + Grafana observability, Ansible-driven configuration, and a controlled webhook runbook loop.
- Control plane: Prometheus, Alertmanager, Grafana, remediation webhook, Caddy (TLS ingress) via
docker-compose. - Fleet model: workload hosts plus a dedicated drill host for safe failure injection.
- Alert loop: metric breach -> Alertmanager -> Slack + webhook -> whitelisted runbook -> recovery signal.
Source-of-truth docs:
docs/PRD.mddocs/ARCHITECTURE.mddocs/DEPLOYMENT.md
ansible/ Fleet and control-plane provisioning playbooks/roles
exporter/ Custom Python exporter service and tests
webhook/ Remediation webhook service, runbooks, tests
prometheus/ Scrape config and alert rules
alertmanager/ Alert routing config
grafana/ Datasource/dashboard provisioning
runbooks/ Human-readable runbooks
scripts/ Drill and operator helper scripts
docs/ PRD, architecture, SLO, MTTR, postmortem template
- Copy environment template:
Set
cp .env.example .env
SLACK_WEBHOOK_URLto your Slack incoming webhook. The Slack destination is configuration, not application state. - Start stack:
docker compose up --build
- Access:
- Prometheus:
http://localhost:9090 - Alertmanager:
http://localhost:9093 - Grafana (direct):
http://localhost:3000 - Grafana via Caddy TLS:
https://grafana.localtest.me - Public remediation endpoint via Caddy TLS:
https://webhook.localtest.me/remediate
- Prometheus:
- Exporter tests:
uv run --project exporter --extra dev pytest exporter/tests - Webhook tests:
uv run --project webhook --extra dev pytest webhook/tests - Run one exporter test:
uv run --project exporter --extra dev pytest exporter/tests/test_app.py::test_metrics_endpoint_emits_expected_metrics - Run one webhook test:
uv run --project webhook --extra dev pytest webhook/tests/test_webhook.py::test_unknown_alert_returns_400 - Python lint:
uv run --with ruff ruff check exporter webhook - Ansible lint:
uv run --with ansible-core --with ansible-lint ansible-lint ansible/site.yml - Deploy syntax check:
uv run --with ansible-core ansible-playbook --syntax-check -i ansible/inventory/hosts.ini ansible/site.yml
webhook/uv.lock and exporter/uv.lock are intentionally ignored. Guardian does not currently treat subproject uv.lock files as committed source of truth; local testing should use uv run --project ... and keep those lockfiles out of git.
The full operator guide is in docs/DEPLOYMENT.md. The short version is:
- Configure inventory in
ansible/inventory/hosts.ini. Adding or removing monitored hosts is an inventory change followed by an Ansible run. - Ensure each host has:
- SSH access from the Ansible runner
- Python installed
- a deploy user with the required sudo rights
- firewall rules allowing the control plane to scrape exporter ports
- Provide runtime secrets to the Ansible runner. Ansible writes them to
/opt/guardian/.envon the control-plane VPS with0600permissions. - Manual deploy is the default operating path. GitHub-hosted runners cannot currently reach the fleet over SSH, so CI deploy remains disabled unless you later introduce a self-hosted or otherwise network-reachable runner.
- Configure GitHub Actions repository settings if deploys will later run from a self-hosted or otherwise network-reachable runner:
Secrets:
ANSIBLE_SSH_PRIVATE_KEY,ANSIBLE_KNOWN_HOSTS,SLACK_WEBHOOK_URL,GRAFANA_ADMIN_PASSWORD,WEBHOOK_INTERNAL_TOKEN,GUARDIAN_HMAC_SECRETVariables:SLACK_CHANNEL,GRAFANA_HOST,WEBHOOK_HOST - Validate playbooks:
uv run --with ansible-core ansible-playbook --syntax-check -i ansible/inventory/hosts.ini ansible/site.yml
- Apply playbook:
uv run --with ansible-core ansible-playbook -i ansible/inventory/hosts.ini ansible/site.yml
- For the current live environment, the operator workflow is:
set -a source <(ssh guardian-skyserver 'sudo cat /opt/guardian/.env') set +a uv run --with ansible-core ansible-playbook --private-key ~/.ssh/guardian_deploy_ed25519 -i ansible/inventory/hosts.ini ansible/site.yml
Trigger synthetic stress on a target host:
scripts/induce-cpu-spike.sh <ssh-host> [duration-seconds]Record timings in docs/MTTR.md after each drill.
The current live inventory is:
guardian-hostphoton-hostdrill-host
collaborate-host is intentionally parked until its scrape path is moved behind a stable route or tunnel.