Building AgentForge: A Self-Healing API Reliability Agent
Inspiration
Production incidents follow a depressingly familiar playbook: a metric spikes at 2am, a pager goes off, a sleepy engineer logs in, reads the same dashboard they read last month, and makes the same call they made last month. The diagnosis is rarely novel — it is usually a pattern that has been seen before. That repetition felt like exactly the kind of problem an agent should own. The question we wanted to answer was: can you build a system that not only responds to incidents autonomously, but actually gets better at it the more incidents it handles?
What We Built
AgentForge is an autonomous API reliability agent. It continuously polls live endpoints, detects anomalies using a z-score over a 60-second rolling window, sends the symptoms to an LLM for root-cause diagnosis, and then executes a corrective action — reroute traffic, fire an alert, or hold and watch. Every resolved incident is written to a case store. The next time a similar symptom pattern appears, those past cases are injected directly into the LLM prompt so it has real historical context, not just a generic system instruction. A separate auto-tuner adjusts the anomaly detection threshold in real time: true positive outcomes tighten sensitivity, false positives loosen it.
The full stack was:
- FastAPI for the mock chaos server (toggleable latency spikes, error bursts, and degraded responses)
- Railtracks for the agent orchestration loop
- Nexla for real-time event ingestion and normalization, with a fallback HTTP poller
- DigitalOcean (Gradient SDK, Llama 3.3 70B) for LLM inference
- Lovable for the dashboard frontend
How We Built It
The core loop is about 30 lines of code — and that simplicity was intentional. on_event in agent/loop.py does exactly four things: detect, retrieve context, diagnose, act. Everything else is a module that can be swapped independently. We built in strict sequencing: chaos server first, then the agent loop, then ingestion, then the dashboard — each verified before wiring the next layer in.
The self-improvement mechanism was the most deliberate design decision. Rather than fine-tuning a model (expensive, slow, opaque), we chose prompt-time injection of structured case records. The get_similar function in memory/case_store.py scores past incidents by symptom overlap and returns the top-k matches. The auto-tuner in memory/auto_tune.py maintains a single float — threshold_k — and nudges it up or down based on action outcomes. Simple, interpretable, and demonstrably effective within a short demo window.
Challenges
Time. Five hours is not much runway. We front-loaded the agent core and left the dashboard to Lovable, which let us keep engineering focus on the parts that actually needed custom logic.
LLM reliability. The model occasionally returns plain English instead of the required JSON. We added a retry loop with a fallback default rather than trying to parse natural language — pragmatic under a time constraint, and it kept the demo stable.
Demonstrating learning live. The self-improvement story only lands if you can see it. Wiring the response-time delta chart on the dashboard to the case store, and scripting the exact demo sequence (baseline → first incident → recovery → second incident), was as much a communication challenge as an engineering one.
What We Learned
The biggest takeaway was that "self-improvement" does not require complex machinery. A case store, a similarity score, and a single tunable float were enough to produce a measurable and visible improvement in a live demo. The architecture also reinforced a broader principle: the right amount of complexity is exactly what the task requires. The agent loop being 30 lines is not a shortcut — it is the point.
Built With
- digitalocean
- fastapi
- gradientsdk
- lovable
- python
Log in or sign up for Devpost to join the conversation.