This project is an MVP autonomous SRE assistant that:
- ingests incidents (logs/errors),
- matches them to Markdown runbook sections,
- suggests a fix command,
- asks for human
Yes/Noapproval, - runs the command locally on
Yes, - stores a full audit trail with state transitions.
Suggestion generation is Claude API only (no local fallback).
- Python 3.11+
- FastAPI
- SQLite (durable state + events)
- Qdrant (self-hosted, free) for vector retrieval
- BM25 in app for lexical retrieval
- Plain HTML/JS dashboard for demo approval flow
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
# start free self-hosted Qdrant
docker run -p 6333:6333 -p 6334:6334 qdrant/qdrant
export CLAUDE_API_KEY="your_key_here"
# optional override, default is claude-3-5-sonnet-latest
# export CLAUDE_MODEL="claude-3-5-sonnet-latest"
# matcher config
export QDRANT_URL="http://localhost:6333"
# optional
# export QDRANT_COLLECTION="runbook_sections"
# export MATCHER_EMBEDDING_MODEL="sentence-transformers/all-MiniLM-L6-v2"
uvicorn src.sre_agent.main:app --reloadOpen:
- Dashboard:
http://127.0.0.1:8000/ - API docs:
http://127.0.0.1:8000/docs
- POST incident to
/incidents. - System parses runbooks under
demo/runbooks. - Hybrid matcher combines Qdrant semantic retrieval + in-app BM25 lexical scoring.
- Claude API generates suggested command from matched section.
- Incident moves to
AWAITING_APPROVAL. - Approve with
POST /incidents/{id}/approve. - Executor runs allowed command and updates final state.
For your target architecture, send Databricks-collected logs/metrics into /incidents via a small adapter job/stream. The ingestion schema is intentionally simple for rapid integration.
You can also run a built-in Databricks poller so this app pulls from your central log table directly.
Set:
export ENABLE_DATABRICKS_POLLER=true
export DATABRICKS_SERVER_HOSTNAME="adb-xxxx.azuredatabricks.net"
export DATABRICKS_HTTP_PATH="/sql/1.0/warehouses/xxxx"
export DATABRICKS_ACCESS_TOKEN="dapi..."
export DATABRICKS_SOURCE_TABLE="main.observability.sre_logs"Optional:
export DATABRICKS_POLL_INTERVAL_SECONDS=15
export DATABRICKS_POLL_BATCH_SIZE=50
export DATABRICKS_CHECKPOINT_KEY=defaultExpected Databricks table columns:
event_ts(timestamp)source_id(string; monotonic per timestamp for deterministic checkpointing)service(string)message(string)trace(string)
The app persists poll checkpoints in local SQLite (databricks_checkpoint table) and processes each pulled row through the same incident pipeline used by /incidents.
Use the demo container to emit an OOM-style production log and write it to your Databricks source table.
./demo/docker-incident/run_demo.shWhat it does:
- builds a local image
sre-incident-generator - runs a real memory-pressure worker with a strict container memory limit
- captures actual container exit metadata (
ExitCode,OOMKilled) and real container logs - inserts one row into
DATABRICKS_SOURCE_TABLEwith the captured trace
The poller then picks it up on the next interval and creates an incident in the dashboard.
- Commands are checked against an allowlist.
- Shell metacharacters are blocked.
- Human approval is mandatory before execution.
- If Claude API is unavailable or returns invalid output, incident processing fails (intentional, no fallback mode).