SRE-GPT — Devpost Submission


Inspiration

Modern engineering teams are drowning in alerts. The average incident response time is over 30 minutes — and most of that time is wasted on manual diagnosis, Slack threads, and human error under pressure.

We asked: what if the on-call engineer never had to wake up at 3am again?

SRE-GPT was born from that question. We wanted to build an agent that doesn't just notify — it acts. An agent that reasons about what went wrong, attempts to fix it, and only escalates when truly necessary.


What it does

SRE-GPT is a fully autonomous Site Reliability Agent that:

  1. Monitors a production API (FastAPI + PostgreSQL) in real-time — measuring latency, error rate, and availability every 60 seconds
  2. Detects anomalies when metrics cross configurable thresholds
  3. Notifies Dynatrace instantly via MCP — sending structured incident events to the observability platform
  4. Reasons about the root cause using Gemini 3.1 Flash Lite AI
  5. Self-heals through a cascaded remediation strategy:
    • First attempts auto-repair (resetting faults, waking up the service)
    • If that fails, triggers a real rollback to the previous stable deployment via the Render REST API
  6. Reports — generates a full AI-written post-mortem with root cause analysis, timeline, and corrective actions
  7. Visualizes everything on a live glassmorphism dashboard with incident timeline, terminal logs, and sparkline charts

How we built it

Target API: A real FastAPI application with PostgreSQL (hosted on Render) exposing CRUD endpoints for a Todo service. It includes fault injection endpoints (/simulate/latency, /simulate/errors) for realistic demo scenarios.

Agent: A Python polling loop running every 60 seconds that:

  • Measures real HTTP latency and error rates directly against the production API
  • Integrates with Dynatrace via the official MCP server (@dynatrace-oss/dynatrace-mcp-server)
  • Uses the Google GenAI SDK to query Gemini 3.1 Flash Lite for root cause analysis
  • Executes the remediation cascade autonomously

MCP Dynatrace: The agent uses the Dynatrace MCP server running locally over HTTP/SSE. When an incident is detected, the agent calls send_event to push a structured CUSTOM_ALERT event directly into the Dynatrace platform — making the incident visible in the observability layer.

Rollback: Real redeployment via the Render REST API — the agent fetches the deployment history, identifies the last stable revision, and triggers a new deploy.

Dashboard: A single-file HTML/CSS/JS dashboard (no build step, no framework) that polls status.json every 15 seconds and renders live metrics, incident timeline, terminal-style action log, and Gemini AI post-mortems.


Challenges we ran into

Dynatrace API complexity: The new Dynatrace platform uses a different authentication architecture than the classic API. We had to navigate between API tokens, Platform tokens, and the MCP server's SSE protocol to find a working integration path.

Windows compatibility: Most SRE tooling assumes Linux. Getting the MCP server, Python agent, and deployment pipeline working smoothly on Windows required working around PowerShell escaping quirks and AppLocker restrictions.

Real vs. simulated metrics: Early versions mocked Dynatrace data. We refactored to measure real HTTP latency and error rates directly against the production API — making the monitoring genuinely meaningful.

Render free tier cold starts: The free tier sleeps after inactivity, causing 30-60s cold starts that initially triggered false positive anomalies. We solved this by increasing timeouts and tuning thresholds.


Accomplishments that we're proud of

  • Full autonomous loop: The agent detects, repairs, and reports without any human input — end to end
  • Real MCP integration: Dynatrace events are genuinely sent to the observability platform via the official MCP protocol, not simulated
  • Cascaded remediation: The agent tries to fix before rolling back — a more intelligent approach than a naive rollback-first strategy
  • Production-grade dashboard: A glassmorphism UI with live charts, timeline, terminal logs, cooldown bar, and EN/FR language toggle — all in a single HTML file
  • Gemini post-mortems: The AI-generated reports are surprisingly detailed and technically accurate

What we learned

  • MCP (Model Context Protocol) is genuinely powerful for connecting AI agents to observability platforms — the 20+ Dynatrace tools available via MCP open up possibilities far beyond what we implemented
  • Autonomous agents need graceful degradation — every integration can fail, and the agent must keep working even when Dynatrace or Gemini are unavailable
  • The hardest part of building agents isn't the AI — it's the plumbing: authentication, timeouts, error handling, and state management
  • A cascaded remediation strategy (repair → rollback → alert) is significantly more useful than a simple threshold-trigger-rollback approach

What's next for SRE-GPT

  • Multi-service monitoring: Scale the agent to watch multiple APIs simultaneously, with service dependency mapping
  • Davis CoPilot integration: Unlock full Dynatrace AI analysis once Platform Token support is available in trial environments
  • Predictive detection: Use Dynatrace's Davis Analyzers (forecast + anomaly) to detect incidents before they breach SLOs — not just after
  • Slack/email alerting: Send human escalation notifications via Dynatrace's send_slack_message and send_email MCP tools when rollback fails
  • GitHub Actions integration: Automatically create a GitHub issue with the post-mortem when a rollback is triggered
  • Multi-cloud: Extend rollback support beyond Render to AWS Lambda, Google Cloud Run, and Railway

Built With

Share this project:

Updates