Skip to content

mod-1/buildathon26

Repository files navigation

Autonomous SRE Agent (Hackathon MVP)

This project is an MVP autonomous SRE assistant that:

  • ingests incidents (logs/errors),
  • matches them to Markdown runbook sections,
  • suggests a fix command,
  • asks for human Yes/No approval,
  • runs the command locally on Yes,
  • stores a full audit trail with state transitions.

Suggestion generation is Claude API only (no local fallback).

Stack

  • Python 3.11+
  • FastAPI
  • SQLite (durable state + events)
  • Qdrant (self-hosted, free) for vector retrieval
  • BM25 in app for lexical retrieval
  • Plain HTML/JS dashboard for demo approval flow

Quickstart

python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

# start free self-hosted Qdrant
docker run -p 6333:6333 -p 6334:6334 qdrant/qdrant

export CLAUDE_API_KEY="your_key_here"
# optional override, default is claude-3-5-sonnet-latest
# export CLAUDE_MODEL="claude-3-5-sonnet-latest"

# matcher config
export QDRANT_URL="http://localhost:6333"
# optional
# export QDRANT_COLLECTION="runbook_sections"
# export MATCHER_EMBEDDING_MODEL="sentence-transformers/all-MiniLM-L6-v2"

uvicorn src.sre_agent.main:app --reload

Open:

  • Dashboard: http://127.0.0.1:8000/
  • API docs: http://127.0.0.1:8000/docs

Main Flow

  1. POST incident to /incidents.
  2. System parses runbooks under demo/runbooks.
  3. Hybrid matcher combines Qdrant semantic retrieval + in-app BM25 lexical scoring.
  4. Claude API generates suggested command from matched section.
  5. Incident moves to AWAITING_APPROVAL.
  6. Approve with POST /incidents/{id}/approve.
  7. Executor runs allowed command and updates final state.

Databricks Integration Point

For your target architecture, send Databricks-collected logs/metrics into /incidents via a small adapter job/stream. The ingestion schema is intentionally simple for rapid integration.

You can also run a built-in Databricks poller so this app pulls from your central log table directly.

Databricks pull mode (separate worker style)

Set:

export ENABLE_DATABRICKS_POLLER=true
export DATABRICKS_SERVER_HOSTNAME="adb-xxxx.azuredatabricks.net"
export DATABRICKS_HTTP_PATH="/sql/1.0/warehouses/xxxx"
export DATABRICKS_ACCESS_TOKEN="dapi..."
export DATABRICKS_SOURCE_TABLE="main.observability.sre_logs"

Optional:

export DATABRICKS_POLL_INTERVAL_SECONDS=15
export DATABRICKS_POLL_BATCH_SIZE=50
export DATABRICKS_CHECKPOINT_KEY=default

Expected Databricks table columns:

  • event_ts (timestamp)
  • source_id (string; monotonic per timestamp for deterministic checkpointing)
  • service (string)
  • message (string)
  • trace (string)

The app persists poll checkpoints in local SQLite (databricks_checkpoint table) and processes each pulled row through the same incident pipeline used by /incidents.

Docker-based incident simulation (pushes to Databricks)

Use the demo container to emit an OOM-style production log and write it to your Databricks source table.

./demo/docker-incident/run_demo.sh

What it does:

  • builds a local image sre-incident-generator
  • runs a real memory-pressure worker with a strict container memory limit
  • captures actual container exit metadata (ExitCode, OOMKilled) and real container logs
  • inserts one row into DATABRICKS_SOURCE_TABLE with the captured trace

The poller then picks it up on the next interval and creates an incident in the dashboard.

Safety Notes

  • Commands are checked against an allowlist.
  • Shell metacharacters are blocked.
  • Human approval is mandatory before execution.
  • If Claude API is unavailable or returns invalid output, incident processing fails (intentional, no fallback mode).

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors