Ever wanted to increase your agentic model's inference time by 200+% by using one single line?

Inspiration

Coding agents like opencode and Devin are powerful, but they keep making the same mistakes. Forgot the database migration? Again. Drifted the frontend types from the backend? Again. Skipped the tests? Again. Every single agent run starts from a clean slate. There's no memory of what went wrong last time, no feedback loop, no learning.

We've all sat there watching an agent repeat the exact mistake we rejected five minutes ago. It felt like Groundhog Day, except the groundhog is burning your API credits. We started asking: what if the agent could actually remember? What if every time you rejected a tool call or said "no, you forgot the migration," that rejection became a guardrail for every future run?

That's where the aviation analogy clicked for us. After every plane crash, investigators pull the black box, file the evidence, update the safety checklists, and brief future pilots before takeoff. AI agents don't have any of that. No recorder, no investigation, no safety briefing. We wanted to build the NTSB for coding agents: a blackbox recorder that turns every failure into institutional memory.

What it does

Autopsy is a diagnostic and prevention layer for AI coding agents. It's not an app you log into. It's a single command you run in your project directory, and from that point on, your coding agent gets smarter every time it fails.

curl -fsSL https://install.autopsy.surf/install.sh | bash

That's it. One line. From there, Autopsy:

Records every event from an opencode session in real time: tool calls, file edits, chat messages, permission requests, rejections. All of it flows through a batched event pipeline into Postgres, without ever slowing down the agent.
Investigates rejected runs using a deterministic rule-based classifier (did you touch a schema file but skip the migration directory? did you change backend types but forget to regenerate frontend types?) plus an optional Gemma LLM enhancer that reads the actual diffs and rejection reasons to produce richer failure analysis.
Builds a failure knowledge graph in Postgres + pgvector. Not a separate graph database, just two tables: graph_nodes and graph_edges. Nodes represent runs, tasks, files, components, symptoms, failure modes, and fix patterns. Edges carry confidence scores, temporal metadata, and evidence linking every connection back to the run that proved it. The graph grows automatically with every failure.
Preflights every new task. When the agent starts working on something new, Autopsy embeds the task description, searches for similar past failures via approximate nearest neighbor lookup, then walks 3 hops through the failure graph (Run to Symptom to FailureMode to FixPattern) to retrieve the most relevant warnings. Those warnings get injected directly into the agent's system prompt before it writes a single line of code. The agent literally sees "past similar tasks failed because the agent forgot the migration and frontend type regeneration. Make sure to touch migrations/ and */generated/" in its context window.
Runs postflight code checks after edits. When the agent goes quiet, Autopsy automatically runs lint, typecheck, and tests against the working tree. If anything fails, that failure is treated as a real rejection and fed back into the graph, so future preflights learn from automated check failures too, not just human rejections.
Detects frustration without explicit rejection. If the user says "that's wrong, undo it" or "you broke everything," the agent has a custom tool (autopsy_register_rejection) that it can call to file the failure into the graph without ending the session. The agent keeps working and can attempt a fix, but the lesson is already recorded.
Visualizes everything on a local dashboard: a timeline of every event in a run, the autopsy report (failure mode, symptoms, fix pattern), the full failure graph rendered as a 3D force-directed visualization, and badges showing where Autopsy caught something before it happened.

The closed loop is the whole point. Agent fails. Autopsy records. Autopsy investigates. Graph grows. Next similar task starts. Autopsy warns. Agent does it right the first time.

How we built it

Service (Python 3.12, FastAPI, SQLAlchemy 2.x async, Postgres 16 + pgvector): The investigation lab. Handles event ingestion, run assembly, failure classification, graph construction, vector embeddings, and the preflight retrieval pipeline. All async, all idempotent. About 165 tests covering routes, the finalizer pipeline, traversal, classifier, extractor, and embeddings.
Plugin (TypeScript, runs inside opencode): The black box recorder. Hooks into opencode's event bus, tool execution, permission system, and system prompt transform. Events are batched (200ms or 32 events, whichever comes first) and sent fire-and-forget so we never add latency to the LLM stream. Preflight calls have an 800ms hard timeout with fail-open behavior: if the service is down, the agent just keeps working normally.
Dashboard (Next.js 16, React 19, Tailwind, App Router): Mission control. Server-rendered run list, SSE-driven live timelines that update as events arrive, a per-run autopsy panel, and a Three.js 3D force-graph explorer for the failure knowledge graph with filtering by failure mode, component, and change pattern.
Embeddings: Google gemini-embedding-001 with Matryoshka truncation to 768 dimensions (free tier, 1500 req/min). The leading dimensions stay semantically meaningful even at the reduced size. We also embed per-file patches for hybrid retrieval, so even if the task wording is different, structurally similar diffs still surface relevant past failures.
Full pipeline: Plugin captures events, batches them to the service, assembler writes run rows and artifacts, outcome triggers the finalizer chain (classifier with 5 deterministic rules plus optional Gemma, entity extractor, graph writer with 9 node types and 8 edge types, embedding writer for task/failure/fix/patch/error vectors), and preflight reads it all back via ANN search plus a recursive CTE that walks the graph with per-edge temporal decay and counter-evidence dampening.

Everything runs locally. One docker-compose for Postgres, one FastAPI process, one Next.js dev server. No cloud services required (Gemini API key is optional and free).

Challenges we ran into

Building the full closed loop in one weekend was the hardest part. This isn't a single feature; it's an entire pipeline: event capture, batched ingestion, run assembly, failure classification, entity extraction, graph construction, vector embedding, semantic retrieval, graph traversal, system prompt injection, and a dashboard to see it all. Every piece had to work for the demo to make sense, because the demo IS the loop.

Keeping the plugin non-blocking while still capturing enough state for meaningful analysis was a constant tension. The agent is talking to an LLM in real time, and we can't add perceptible latency. We ended up with fire-and-forget batching for events, tight timeouts with fail-open for preflight, and debounced postflight checks that only run during genuine idle periods (not between tool calls in an agentic loop).

Designing the scoring math so it actually works in practice. Temporal decay means old failures lose weight exponentially, so a bug you fixed three months ago doesn't keep triggering warnings forever. Counter-evidence dampening means that if you've successfully completed similar tasks five times since the one failure, the risk score drops accordingly. Getting the half-life and dampening coefficients to feel right took iteration.

Capturing rejection reasons turned out to be surprisingly fragile. opencode's permission.replied bus event doesn't include the user's free-text feedback. We had to build three fallback mechanisms: querying opencode's local HTTP server for the permission details, the custom autopsy_register_rejection tool for implicit frustration, and a manual "Why?" form on the dashboard.

Accomplishments that we're proud of

The closed-loop demo actually works. You can watch an agent fail a task, see Autopsy record and classify it, then start a similar new task and see the warning appear in the agent's system prompt before it generates a single token. The agent does the right thing on the first try. That's not a mock or a slide. It's a real agent runtime with real preflight injection.

165 service tests passing, covering the full pipeline from event ingestion through graph traversal. Idempotent writes everywhere (events, nodes, edges, embeddings), so the plugin can retry and the analyzer can re-run without doubling anything.

Working integration with a live agent runtime, not a synthetic demo environment. The plugin runs inside opencode, hooks into real bus events, and injects real system prompt addenda. If you uninstall Autopsy, opencode works exactly the same as before. It's purely additive.

One-line install. curl | bash brings up the full stack, installs the plugin, writes the env vars, and you're running. ~/.autopsy/stop.sh tears it all down.

What we learned

Building for AI agents is fundamentally different from building for humans. The "user" of your system prompt injection is a language model, so the warning text has to be calibrated for how LLMs read instructions, not how people read documentation. We also learned that the most valuable signal for failure analysis isn't the code diff; it's the rejection reason. A user saying "you forgot the migration" is worth more than any amount of static analysis on the patch.

We also came to appreciate how much of the problem is retrieval precision, not generation quality. The hard part isn't writing a good warning message. The hard part is knowing WHEN to warn and WHAT to warn about. That's why the graph structure and the scoring math matter more than the LLM call, and why we kept the LLM out of the preflight critical path entirely.

What's next

Quantitative measurement harness. We want to answer "Autopsy cut repeated agent mistakes by X%" with real numbers, not just a demo loop.
More agent runtimes. The service API is agent-agnostic. Any agent that can POST events and call /v1/preflight can use Autopsy. We want adapters for Devin, Cursor, Claude Code, and the generic OpenAI Agent SDK.
Cross-team failure graphs. Right now the graph is per-project. But if one team's agent learns that "adding a Prisma field requires a migration," every team's agent should know that. Shared failure memory across an organization.
Ranked retrieval improvements. The current ANN + 3-hop CTE returns top-K by raw cosine distance with temporal decay. Weighting by project scope, task-type affinity, and file-overlap would noticeably improve precision on noisier graphs.