Inspiration
- Ops teams drown in alerts but still need human approvals; An actual on-call engineer wakes up late at night, goes through their runbook, finds the matching incident and the appropriate response, pays close attention to ensure no mistake is being made while making changes to the production environment directly. We wanted an autonomous helper that speaks runbooks, keeps humans in the loop, and can escalate through real collaboration channels like Slack.
- Building around Nia (context optimization) plus Slack/K8s workflows let us showcase an SRE copilot that feels more like a command center than a webform.
What it does
- Ingests incidents (manual POST, Databricks poller, or the supplied demo generator), matches to runbooks via hybrid BM25+Qdrant retrieval, optimizes the context with Nia, and asks Claude for a remediation command.
- Surfaces incidents on a retro command-center dashboard, funnels approvals to Slack with interactive buttons, executes locally or inside specific K8s pods, then logs every transition and generates follow-up proposal diffs.
- Ships a full /guide docs experience plus NAV links so users can explore flows without reading the repo.
How we built it
- FastAPI backend with SQLite for durable incident/event/proposal state; Qdrant + rank-bm25 + sentence-transformers for the matcher; Anthropic Claude for suggestions; Nia context optimizer built per SPEC.
- Static dashboard/docs written in vanilla HTML/CSS/JS but fully redesigned with ambient scanlines, LED status badges, live clock, toasts, and multi-section documentation.
- Slack approval pipeline implemented with slack-sdk WebClient + webhook router, plus new executors for SSH and K8s pods; Databricks poller streams incidents into the same flow.
Challenges we ran into
- Reconciling diverged branches (frontend vs arnav) that rewrote core modules differently (Nia features vs Slack/K8s). Merge conflicts spanned db schema, main workflow, and requirements, so we had to carefully unify both feature sets without losing behavior.
- Keeping severity/pod metadata consistent through models, DB migrations, ingestion, and UI, while ensuring legacy DBs auto-migrate.
- Validating commands safely for local/SSH/K8s execution and handling missing external services (Slack tokens, K8s clusters) during testing.
Accomplishments that we're proud of
- Retro-futuristic dashboard + comprehensive docs site that make the agent feel like a command ops console.
- Full Slack approval lifecycle with optional K8s targeting, Databricks ingestion, and automated runbook proposal generation all living together.
- Hardened persistence (severity + pod metadata), robust executor abstractions, and end-to-end audit logging for every optimization/approval step.
What we learned
- Maintaining feature-rich autonomy agents demands disciplined migrations and shared schemas—otherwise each branch invents conflicting realities.
- UX polish (scanlines, toasts, nav links, /guide) matters even for backend-heavy hackathon projects; it helps judges understand complex flows fast.
- Blending human approvals (Slack) with autonomous remediation requires careful guardrails in executor validation and state machine transitions.
What's next for SRE:agent
- Wire up real Slack/K8s/Databricks credentials in staging to smoke-test the entire pipeline, including the demo scenario.
- Expand tests (unit + integration) to cover matchers, context optimizer fallbacks, and executor validation, possibly with mocked external services.
- Add multi-tenant support (projects/environments) and richer runbook editing right from the dashboard, using the proposal pipeline to auto-commit approved updates.
Built With
- claude
- css
- fastapi
- html
- javascript
- nia
- python
- qdrant
- rank-bm25
- sentence-transformers
- slack-sdk
- sqlite
Log in or sign up for Devpost to join the conversation.