Autonomous SRE Agent (Hackathon MVP)

This project is an MVP autonomous SRE assistant that:

ingests incidents (logs/errors),
matches them to Markdown runbook sections,
suggests a fix command,
asks for human Yes/No approval,
runs the command locally on Yes,
stores a full audit trail with state transitions.

Suggestion generation is Claude API only (no local fallback).

Stack

Python 3.11+
FastAPI
SQLite (durable state + events)
Qdrant (self-hosted, free) for vector retrieval
BM25 in app for lexical retrieval
Plain HTML/JS dashboard for demo approval flow

Quickstart

python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

# start free self-hosted Qdrant
docker run -p 6333:6333 -p 6334:6334 qdrant/qdrant

export CLAUDE_API_KEY="your_key_here"
# optional override, default is claude-3-5-sonnet-latest
# export CLAUDE_MODEL="claude-3-5-sonnet-latest"

# matcher config
export QDRANT_URL="http://localhost:6333"
# optional
# export QDRANT_COLLECTION="runbook_sections"
# export MATCHER_EMBEDDING_MODEL="sentence-transformers/all-MiniLM-L6-v2"

uvicorn src.sre_agent.main:app --reload

Open:

Dashboard: http://127.0.0.1:8000/
API docs: http://127.0.0.1:8000/docs

Main Flow

POST incident to /incidents.
System parses runbooks under demo/runbooks.
Hybrid matcher combines Qdrant semantic retrieval + in-app BM25 lexical scoring.
Claude API generates suggested command from matched section.
Incident moves to AWAITING_APPROVAL.
Approve with POST /incidents/{id}/approve.
Executor runs allowed command and updates final state.

Databricks Integration Point

For your target architecture, send Databricks-collected logs/metrics into /incidents via a small adapter job/stream. The ingestion schema is intentionally simple for rapid integration.

You can also run a built-in Databricks poller so this app pulls from your central log table directly.

Databricks pull mode (separate worker style)

Set:

export ENABLE_DATABRICKS_POLLER=true
export DATABRICKS_SERVER_HOSTNAME="adb-xxxx.azuredatabricks.net"
export DATABRICKS_HTTP_PATH="/sql/1.0/warehouses/xxxx"
export DATABRICKS_ACCESS_TOKEN="dapi..."
export DATABRICKS_SOURCE_TABLE="main.observability.sre_logs"

Optional:

export DATABRICKS_POLL_INTERVAL_SECONDS=15
export DATABRICKS_POLL_BATCH_SIZE=50
export DATABRICKS_CHECKPOINT_KEY=default

Expected Databricks table columns:

event_ts (timestamp)
source_id (string; monotonic per timestamp for deterministic checkpointing)
service (string)
message (string)
trace (string)

The app persists poll checkpoints in local SQLite (databricks_checkpoint table) and processes each pulled row through the same incident pipeline used by /incidents.

Docker-based incident simulation (pushes to Databricks)

Use the demo container to emit an OOM-style production log and write it to your Databricks source table.

./demo/docker-incident/run_demo.sh

What it does:

builds a local image sre-incident-generator
runs a real memory-pressure worker with a strict container memory limit
captures actual container exit metadata (ExitCode, OOMKilled) and real container logs
inserts one row into DATABRICKS_SOURCE_TABLE with the captured trace

The poller then picks it up on the next interval and creates an incident in the dashboard.

Safety Notes

Commands are checked against an allowlist.
Shell metacharacters are blocked.
Human approval is mandatory before execution.
If Claude API is unavailable or returns invalid output, incident processing fails (intentional, no fallback mode).

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
.agents		.agents
.ralph		.ralph
demo		demo
docs		docs
src/sre_agent		src/sre_agent
.dockerignore		.dockerignore
.gitignore		.gitignore
APPROVAL_AND_EXECUTION.md		APPROVAL_AND_EXECUTION.md
README.md		README.md
docker-compose.qdrant.yml		docker-compose.qdrant.yml
plan-autonomousSreAgent.md		plan-autonomousSreAgent.md
pyproject.toml		pyproject.toml
requirements-dev.txt		requirements-dev.txt
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Autonomous SRE Agent (Hackathon MVP)

Stack

Quickstart

Main Flow

Databricks Integration Point

Databricks pull mode (separate worker style)

Docker-based incident simulation (pushes to Databricks)

Safety Notes

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Autonomous SRE Agent (Hackathon MVP)

Stack

Quickstart

Main Flow

Databricks Integration Point

Databricks pull mode (separate worker style)

Docker-based incident simulation (pushes to Databricks)

Safety Notes

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages