AutoDuty
An AI-powered automated incident remediation system — your autonomous SRE that detects, investigates, and fixes production bugs in real time.
Inspiration
On-call engineering is one of the most stressful parts of software development. When a production bug hits at 3 AM, someone has to wake up, trace through logs, find the root cause, write a fix, and deploy it — all under pressure. We asked ourselves: what if an AI agent could handle that entire workflow autonomously? AutoDuty was born from the desire to eliminate the toil of incident response and let engineers sleep through the night while an AI SRE watches over their applications.
What it does
AutoDuty monitors a live web application for runtime errors and, when one is detected, kicks off a fully autonomous investigation and remediation pipeline:
- Detects — API route handlers are wrapped with an error-reporting layer (
withAutoduty()) that catches exceptions and 5xx responses, automatically sending error reports (tracebacks, source code, logs) to the AutoDuty backend. - Investigates — An AI agent clones the application's GitHub repository and explores the codebase using tools like file reading, grep, and directory listing to diagnose the root cause.
- Reproduces — The agent writes and executes test scripts inside isolated cloud sandboxes (via Modal) to confirm the bug exists, including headless browser automation with Playwright for UI-level issues.
- Fixes — The agent applies targeted code changes across one or more files to resolve the issue.
- Verifies — The fix is validated by re-running the reproduction scripts in the sandbox to confirm the bug is resolved.
- Ships — With one click of human approval, AutoDuty creates a GitHub Pull Request with a detailed description, diffs, root-cause analysis, and sandbox verification results.
A real-time admin dashboard lets you watch the entire process unfold live — agent thoughts, tool calls, sandbox output, browser screenshots, and unified diffs — all streamed via Server-Sent Events.
How we built it
Frontend — Built with Next.js (App Router), React, TypeScript, and Tailwind CSS. The frontend serves double duty: it hosts "NovaBuy," a demo e-commerce storefront with intentionally planted bugs, and the admin dashboard for monitoring and managing incidents. Framer Motion powers the UI animations, and SSE connections provide real-time streaming of agent activity.
Backend — A FastAPI server orchestrates the entire pipeline. Pydantic AI provides the agent framework with structured outputs and tool calling. The agent has access to 7 tools: read_file, write_file, search_and_replace, grep, list_directory, run_sandbox, and run_browser.
Sandboxed Execution — Modal provides isolated cloud containers for running test scripts (Node.js 20 + tsx) and browser automation (Playwright + Chromium), ensuring the agent's code execution is safe and reproducible.
AI Models — The system supports hot-swapping between Anthropic Claude, Google Gemini, and OpenAI GPT-4o at runtime, giving flexibility to choose the best model for the job.
GitHub Integration — PyGithub handles repository cloning, branch creation, multi-file commits via the Git tree API, and pull request generation.
Challenges we ran into
- Reliable sandboxed execution — Getting Modal containers to spin up quickly with all the right dependencies (Node.js, tsx, Playwright, Chromium) while keeping them fully isolated was a significant engineering challenge. We had to carefully design container images and handle edge cases like timeouts and resource limits.
- Real-time streaming architecture — Building a responsive live feed of agent activity required a custom async pub/sub event bus with per-incident queues, SSE endpoints, and careful frontend state management to render thoughts, tool calls, terminal output, and browser screenshots as they happen.
- Multi-file fix coordination — Many real-world bugs span multiple files. Tracking edits across the cloned repository, generating accurate unified diffs, and committing everything atomically to a GitHub PR required careful bookkeeping in the
RepoContextlayer. - Agent reliability — Getting the AI agent to consistently follow a structured 4-phase workflow (Explore & Diagnose, Reproduce, Fix, Verify) without going off the rails required extensive prompt engineering and retry logic with feedback from failed attempts.
Accomplishments that we're proud of
- End-to-end autonomy — From the moment an error fires to a verified fix ready for human approval, the entire pipeline runs without human intervention. The agent can investigate, reproduce, fix, and verify bugs across a real codebase.
- Live observability — The real-time dashboard gives full transparency into the agent's reasoning process. You can watch it think, read files, run code, take browser screenshots, and produce diffs — all in real time.
- Sandboxed code execution — The agent can safely write and execute arbitrary TypeScript test scripts and Playwright browser automation in isolated cloud containers, with budget limits to prevent runaway execution.
- One-click PR creation — The entire fix, complete with root-cause analysis, diffs, and verification results, can be shipped as a GitHub Pull Request with a single button click.
- Multi-model flexibility — Supporting Claude, Gemini, and GPT-4o with runtime switching means the system isn't locked into any single provider.
What we learned
- Structured agent workflows matter — Giving the AI agent a clear multi-phase plan (explore, reproduce, fix, verify) dramatically improved reliability compared to letting it free-roam.
- Sandboxing is essential — Letting an AI agent execute arbitrary code demands robust isolation. Modal's container infrastructure made this feasible, but the design choices around what to include in the container images were critical.
- Real-time UX builds trust — Showing every step of the agent's reasoning process in a live dashboard made the system feel transparent and trustworthy rather than like a black box.
- Error context is everything — The quality of the agent's investigation depends heavily on the richness of the error report. Including the full traceback, source code, and logs in the initial report gave the agent a massive head start.
What's next for AutoDuty
- Production-grade monitoring integrations — Connect to real observability platforms like PagerDuty, Datadog, and Sentry to ingest incidents from production systems instead of just the demo app.
- Persistent incident storage — Move from in-memory storage to a proper database for incident history, analytics, and audit trails.
- Automated deployment — Close the loop by not just creating PRs, but optionally auto-merging and deploying verified fixes through CI/CD pipelines.
- Multi-language support — Extend the sandbox and agent tooling to support Python, Go, and other backend languages beyond TypeScript/Node.js.
- Team collaboration features — Add Slack/Teams notifications, incident assignment, approval workflows, and escalation policies for team-based incident response.
- Learning from past incidents — Build a knowledge base of past fixes so the agent can recognize recurring patterns and resolve similar issues faster over time.
Log in or sign up for Devpost to join the conversation.