Agent Forge

Building AgentForge: A Self-Healing API Reliability Agent

Inspiration

Production incidents follow a depressingly familiar playbook: a metric spikes at 2am, a pager goes off, a sleepy engineer logs in, reads the same dashboard they read last month, and makes the same call they made last month. The diagnosis is rarely novel — it is usually a pattern that has been seen before. That repetition felt like exactly the kind of problem an agent should own. The question we wanted to answer was: can you build a system that not only responds to incidents autonomously, but actually gets better at it the more incidents it handles?

What We Built

AgentForge is an autonomous API reliability agent. It continuously polls live endpoints, detects anomalies using a z-score over a 60-second rolling window, sends the symptoms to an LLM for root-cause diagnosis, and then executes a corrective action — reroute traffic, fire an alert, or hold and watch. Every resolved incident is written to a case store. The next time a similar symptom pattern appears, those past cases are injected directly into the LLM prompt so it has real historical context, not just a generic system instruction. A separate auto-tuner adjusts the anomaly detection threshold in real time: true positive outcomes tighten sensitivity, false positives loosen it.

The full stack was:

FastAPI for the mock chaos server (toggleable latency spikes, error bursts, and degraded responses)
Railtracks for the agent orchestration loop
Nexla for real-time event ingestion and normalization, with a fallback HTTP poller
DigitalOcean (Gradient SDK, Llama 3.3 70B) for LLM inference
Lovable for the dashboard frontend

How We Built It

The core loop is about 30 lines of code — and that simplicity was intentional. on_event in agent/loop.py does exactly four things: detect, retrieve context, diagnose, act. Everything else is a module that can be swapped independently. We built in strict sequencing: chaos server first, then the agent loop, then ingestion, then the dashboard — each verified before wiring the next layer in.

The self-improvement mechanism was the most deliberate design decision. Rather than fine-tuning a model (expensive, slow, opaque), we chose prompt-time injection of structured case records. The get_similar function in memory/case_store.py scores past incidents by symptom overlap and returns the top-k matches. The auto-tuner in memory/auto_tune.py maintains a single float — threshold_k — and nudges it up or down based on action outcomes. Simple, interpretable, and demonstrably effective within a short demo window.

Challenges

Time. Five hours is not much runway. We front-loaded the agent core and left the dashboard to Lovable, which let us keep engineering focus on the parts that actually needed custom logic.

LLM reliability. The model occasionally returns plain English instead of the required JSON. We added a retry loop with a fallback default rather than trying to parse natural language — pragmatic under a time constraint, and it kept the demo stable.

Demonstrating learning live. The self-improvement story only lands if you can see it. Wiring the response-time delta chart on the dashboard to the case store, and scripting the exact demo sequence (baseline → first incident → recovery → second incident), was as much a communication challenge as an engineering one.

What We Learned

The biggest takeaway was that "self-improvement" does not require complex machinery. A case store, a similarity score, and a single tunable float were enough to produce a measurable and visible improvement in a live demo. The architecture also reinforced a broader principle: the right amount of complexity is exactly what the task requires. The agent loop being 30 lines is not a shortcut — it is the point.

Built With

digitalocean
fastapi
gradientsdk
lovable
python

Updates

Likhit Juttada started this project — Mar 28, 2026 07:25 PM EDT

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.