The Recursive Developer
How to build systems that improve themselves, and what happens to you when they do.
MCP: The Nervous System You’re Probably Not Building Yet
If you’ve been following the agentic AI space for the last year, you’ve heard MCP described as “USB-C for AI,” a universal protocol that lets any AI model connect to any data source or tool through a standard interface. Anthropic launched it in November 2024, donated it to the Linux Foundation in December 2025, and by early 2026 it has co-founding support from Anthropic, Block, and OpenAI, with Google, Microsoft, AWS, Bloomberg, and Cloudflare as platinum members. The ecosystem has blown past 10,000 community-built servers and 97 million monthly SDK downloads.
But most developers are sleeping on what MCP actually means for them personally.
The enterprise pitch is about solving the “N×M integration problem,” the unsustainable reality of building custom connectors between every AI model and every internal tool. That’s a CTO’s problem. The developer’s version: every MCP server you build is a capability that every agent in your stack inherits instantly.
Build an MCP server that wraps your PostgreSQL database. Now Claude Code can query it. So can your Cursor agent. So can any custom agent you spin up tomorrow. You built the integration once. Every agent gets it for free. Build another one for your CI pipeline. Another for your documentation system. Another for your error tracking. Each server is a permanent addition to the nervous system that your agents operate within.
This is how tool registries get built from the bottom up. Not as a corporate initiative with a project plan and a Jira epic. As a developer building the tools they want their agents to have, one server at a time.
The practical mechanics are straightforward. An MCP server is a separate program that exposes capabilities (tools the agent can call, resources it can read) through a standardized interface. The agent discovers what’s available by asking the server. The server responds with schemas and metadata. The agent picks the tool it needs, calls it, gets a result. It’s the same tool-use loop from Post 2, but with a universal adapter layer underneath.
The part that matters for production: once you configure an MCP server for a repository, agents can use those tools autonomously without asking your permission. This isn’t a bug. It’s a design choice. And it means you need to think about MCP server configuration the same way you think about IAM policies. What am I granting access to, and what’s the blast radius if the agent does something dumb?
The security model that’s emerging in production follows a gateway pattern. The agent holds only a gateway API key. The gateway holds the downstream service secrets. This gives you a single point of governance for authentication, rate limiting, and audit logging. The MCP specification explicitly flags the “Confused Deputy Problem,” where a model can be tricked by prompt injection into calling tools it shouldn’t, using valid credentials that belong to the user, not the attacker. It’s the same separation-of-concerns instinct that drives every good infrastructure decision: don’t give the agent the keys to the kingdom. Give it a supervised hallway to the rooms it needs, and design the hallway so that even a confused agent can’t reach rooms it shouldn’t.
The developers who are getting the most out of this right now aren’t the ones building the most sophisticated agents. They’re the ones building the richest tool environments for their agents to operate in. The agent’s intelligence is rented from a model provider. The tool environment is owned infrastructure. That’s where the compounding advantage lives.
The Self-Extending Agent: When the Tool Environment Builds Itself
The purest expression of this principle is Armin Ronacher’s Pi, a coding agent with the shortest system prompt of any agent he’s aware of and exactly four tools: Read, Write, Edit, Bash. That’s it. No plugin marketplace. No extension registry. If you want Pi to do something it can’t do yet, you don’t download an extension. You ask the agent to extend itself.
This is the recursive developer thesis in miniature. Ronacher’s own extensions (a task manager, a code review workflow, a multi-agent communication layer) were all written by the agent to his specifications. The agent writes the code, hot-reloads it into the running session, tests it, and loops until the extension works. “None of this was written by me,” Ronacher notes. “It was created by the agent to my specifications.”
What makes Pi architecturally interesting is that sessions are trees, not linear histories. You can branch into a side-quest (fix a broken tool, investigate a dependency, prototype an approach) then rewind to the main branch and get a summary of what happened. Side-quest context doesn’t pollute the main conversation. The branching model solves a problem every agentic developer has hit: the agent goes down a rabbit hole and now your entire session context is contaminated with irrelevant exploration.
Pi also avoids a cost trap that MCP creates by design. MCP tools load into the system context, and every tool registration consumes tokens just by existing. Pi’s extension system maintains state outside the model context entirely, so capabilities scale without burning through the context budget. That’s not an implementation detail. It’s an architectural philosophy: the tool environment should grow without the per-session overhead growing with it.
The OpenClaw project takes this further, treating the workspace itself as an operating system for agents. Agent identity is defined through plain files: SOUL.md for purpose and behavior, TOOLS.md for capabilities, IDENTITY.md for personalization, HEARTBEAT.md for execution configuration. All diffable. All version-controllable. All modifiable by the agent itself. The workspace isn’t just where the agent works. It’s what the agent is.
Geoffrey Huntley makes the architectural point even more bluntly: “Cursor, Windsurf, Claude Code, and Amp are just a small number of lines of code running in a loop of LLM tokens.” Every coding agent reduces to about 300 lines of code wrapping five primitives (read, list, bash, edit, search) in a token-consuming loop. The agent is commodity. The tool environment is the differentiator. The developer who builds the richest set of tools for that loop to call has the most capable agent, regardless of which model sits behind it.
The Configuration Stack: Infrastructure as Code for Cognition
In Post 2, I introduced the idea that context engineering is configuration, not conversation. This is what it looks like when you take it seriously.
The most effective practitioners are building a three-layer configuration stack that governs how agents operate in their codebase. Peter Steinberger put it bluntly: “I don’t design codebases to be easy to navigate for me, I engineer them so agents can work in it efficiently.” Each layer serves a different purpose, and getting the boundaries right is the difference between an agent that consistently ships quality code and one that reinvents your architecture every session.
Layer 1: CLAUDE.md — The Constitution
CLAUDE.md is the project-level document that gets injected into every agent session automatically. It’s your constitution: the rules that never change, the conventions that must always be followed, the context that every task requires regardless of scope.
What goes in:
Package and tooling mandates. “We use yarn, not npm.” “We use pytest, not unittest.” “All API routes go through the gateway service.” These are the decisions that, if the agent gets wrong, create hours of cleanup.
Architectural boundaries. “The data layer never imports from the UI layer.” “All database access goes through the repository pattern.” “No direct HTTP calls from business logic.” The Codex team enforced theirs as Types → Config → Repo → Service → Runtime → UI with custom linters that catch violations mechanically.
Code style that can’t be linted. Naming conventions. Error handling patterns. How you structure tests. The stuff that’s too subtle for eslint but too important to leave to the model’s defaults.
What not to do. “Never use any types.” “Never commit directly to main.” “Never modify migration files after they’ve been applied.” The anti-patterns that you’ve been burned by before.
What stays out: anything task-specific, anything that changes frequently, anything that’s better served by a more targeted document. CLAUDE.md should be stable. If you’re editing it every session, you’re using it wrong.
The payoff for getting this right is measurable. An Arize AI study showed that optimizing CLAUDE.md instructions improved SWE-Bench Lite task accuracy by 10.87%, with zero changes to the model, no fine-tuning, no custom infrastructure. Just better configuration.
The OpenAI Codex team kept their equivalent (AGENTS.md) under 100 lines, functioning as a table of contents that points to deeper sources of truth. That’s the right instinct. The constitution should be scannable, not encyclopedic. An agent that has to process 2,000 lines of CLAUDE.md before it starts working is burning context on overhead.
Layer 2: AGENTS.md / Task-Specific Context — The Playbooks
Below the constitution sits task-specific configuration. This pattern has gained enough traction that AGENTS.md is now an open standard under the Linux Foundation’s Agentic AI Foundation, co-launched by Google, OpenAI, Factory, Sourcegraph, and Cursor, with over 60,000 repositories adopting it. These are the playbooks for recurring types of work: “how to add a new API endpoint,” “how to write a migration,” “how to set up a new service.” Each one is a self-contained reference that the agent can pull in when the task demands it.
The Codex team’s insight here is that each playbook should be written so that “a single, stateless agent, or a human novice, can read it from top to bottom and produce a working, observable result.” That’s not a documentation standard. That’s a compilation target. You’re writing instructions that a machine can execute. Every ambiguity is a potential failure mode.
Layer 3: ExecPlans — The Work Orders
For complex, multi-step work, the top practitioners treat plans as first-class artifacts checked into the repo. The Codex team’s ExecPlans include mandatory sections for progress tracking, surprises and discoveries, decision logs, and retrospectives. They’re living documents that the agent updates as it works, creating an audit trail that any other agent (or human) can pick up later.
This is where the recursive potential starts to show. The ExecPlan isn’t just a plan. It’s a record of how the plan was executed, what went wrong, and what was learned. Feed that back into your CLAUDE.md or playbooks, and you’ve closed a learning loop. The system got better because it tracked its own mistakes.
The Enforcement Layer: Hooks
Configuration alone isn’t enough. Configuration says “please do this.” Hooks say “you must do this, and if you don’t, the operation fails.”
Claude Code hooks let you intercept tool calls and enforce behavior deterministically. Format on save. Lint before commit. Block destructive commands unless explicitly authorized. Log every MCP tool invocation. These aren’t suggestions the agent might follow. They’re mechanical constraints the agent cannot bypass.
The combination of declarative configuration (CLAUDE.md) and imperative enforcement (hooks) is what gives this stack its production reliability. The configuration shapes intent. The hooks guarantee compliance. Together, they’re infrastructure as code for cognition, version-controlled, testable, and reproducible across every agent session.
The Flywheel: When the Harness Improves Itself
Everything I’ve described so far is static. You build the harness, the agent operates within it. The harness is good because you made it good. It stays good because you maintain it.
The flywheel is what happens when the agent starts maintaining the harness itself. And improving it.
Mitchell Hashimoto (creator of Terraform, Vagrant, and Ghostty) articulated the principle that makes this work: “Anytime you find an agent makes a mistake, you take the time to engineer a solution such that the agent never makes that mistake again.” Look at Ghostty’s AGENTS.md file: every line corresponds to a specific past agent failure that is now mechanically prevented. It’s load-bearing infrastructure that accumulates preventive constraints over time. In isolation, that’s just good engineering. At scale, it’s a compounding advantage. Every mistake becomes a new linter rule, a new test, a new entry in CLAUDE.md, a new hook. The harness absorbs the failure and immunizes itself against the entire class of errors.
This isn’t just practitioner intuition. A 2025 ICLR workshop paper on SICA (Self-Improving Coding Agent) showed agents that directly edit their own scripts, proposing modifications, re-evaluating performance, keeping changes that improve metrics, achieving 17-53% performance improvements through iterative self-editing loops. The principle works mechanistically, not just philosophically.
Garbage Collection Agents
The Codex team’s garbage collection story is worth telling in full because it illustrates how the flywheel starts. They were spending 20% of their engineering time every Friday manually cleaning up “AI slop,” accumulated code quality issues from agent-generated code. One day a week, devoted entirely to entropy management. That didn’t scale.
Their solution: encode “golden principles” directly into the repository and build recurring cleanup agents that scan for deviations, update quality grades, and open targeted refactoring PRs on a regular cadence. The manual Friday cleanup became an automated continuous process. The codebase stays cleaner because the system is designed to self-correct.
The Improvement Loop
Kief Morris at Thoughtworks described the full flywheel in detail. In practice it works like this:
Agents execute within the harness, following the specs, tests, and constraints you’ve defined.
Agents evaluate their own performance using test results, build times, production metrics, error rates, and user feedback as signals.
Agents propose improvements to the harness: new tests to cover gaps, updated documentation, tighter linter rules, better prompts.
Low-risk improvements get auto-applied. A new test that covers a discovered edge case? Auto-merge. An updated docstring that matches the current implementation? Auto-merge.
High-risk improvements go to a human backlog. A proposed architectural change? A new dependency? A modification to the deployment pipeline? That goes to a human for review and prioritization.
The critical design decision is the risk threshold. Too low, and nothing gets auto-applied, so the flywheel stalls. Too high, and the system modifies itself in ways that create more problems than it solves. The teams that get this right treat the threshold as a tunable parameter, starting conservative and opening up as trust builds.
What This Looks Like Monday Morning
If you’re sitting at your desk wondering what to actually build, here’s the concrete version.
Brian Lovin articulated the core principle: “If the agent ever asks you to do something manually, you should 1) stop 2) think really really hard about how to give the agent the tools it needs so it can do the thing by itself.” Every manual intervention is a signal that the harness is incomplete. The fix isn’t doing the work for the agent. It’s extending the laboratory so the agent can do it next time.
His phase-based methodology (Instrument, Diagnose, Iterate, Report) is a flywheel in miniature. First, build the measurement harness: benchmarks, timing utilities, Chrome DevTools MCP for performance traces. Then let the agent diagnose what the data says. Then iterate on hypotheses one at a time, re-running benchmarks between each change. Finally, generate a report with before/after comparisons. The agent runs the whole cycle autonomously because the instruments were already in place.
Specific things to build:
Write a hook that logs every tool call your agent makes. After a week, analyze the logs. Which tools get called most? Which ones fail? Which ones produce results the agent throws away? That’s your first improvement signal.
Add a scheduled task that runs your test suite and reports on coverage gaps. Not a CI pipeline. A task that the agent runs proactively, identifies untested code paths, and generates test stubs for them.
Build a “documentation freshness” checker. An agent that diffs your README, API docs, and inline comments against the actual code. When they drift beyond a threshold, it opens a PR with updates.
Create a “pattern violation” detector. Feed your CLAUDE.md conventions to an agent and have it scan recent PRs for deviations. Use it as a feedback loop, not a gate. It tells you which conventions agents violate most, so you can either improve the convention or improve the CLAUDE.md description of it.
Propagate patterns across projects. Peter Steinberger runs agents across his entire portfolio of Go projects simultaneously. When he figures out a new pattern, he tells the agent to find all his recent projects and implement the change everywhere, updating changelogs as it goes. The flywheel doesn’t have to be contained to one repo.
None of these are exotic. All of them compound.
The Cost Meta-Game
Every cycle of the flywheel costs tokens. Every garbage collection run, every documentation freshness check, every pattern violation scan consumes compute. Which means cost engineering isn’t just about running your primary tasks cheaply. It’s about making the improvement loop itself economically sustainable.
This is where model routing becomes a competitive advantage, not just a savings tactic.
The price spread across available models has widened to over 1,000x. Budget-tier models run $0.02-0.25 per million input tokens. Mid-range models sit at $1-5. Frontier reasoning models charge $20+. That’s not a market with gradients. It’s a geological cross-section with distinct strata. And the developers who win are the ones who match each task to the right stratum.
The routing principle is simple: use the cheapest model that can do the job. But price isn’t the only axis. Huntley’s model quadrant maps models along two dimensions: high safety versus low safety, and oracle (deep reasoning, summarization) versus agentic (biases toward action, tool-calling). Sonnet is what he calls “a robotic squirrel that just wants to do tool calls, it doesn’t spend too much time thinking; it biases towards action.” Oracles get wired in as tools for verification and reasoning checks. The routing decision isn’t just “what’s cheapest?” It’s “what behavioral profile does this task need?”
In practice: your garbage collection agent that checks for dead code? It doesn’t need Opus. Haiku can identify unused imports just fine. Your documentation drift checker? Sonnet handles it. You save Opus for the moments that actually require deep reasoning, the architectural decisions, the complex debugging sessions, the novel code generation where getting it wrong costs more than getting it right slowly.
RouteLLM, published at ICLR 2025, formalized this with a matrix factorization router that achieved 95% of GPT-4’s quality while only routing 26% of calls to the expensive model, roughly 48% cheaper than naive random routing. With training data augmentation, they got the same quality using only 14% expensive-model calls. This isn’t theoretical. These are published numbers with reproducible results.
The full cost engineering stack for a self-improving system combines four techniques:
Model routing: 70% cheap / 20% mid / 10% frontier for major savings on mixed workloads.
Prompt caching: stable prefixes (system prompts, CLAUDE.md, tool definitions) cached at 90% token discount. The “Don’t Break the Cache” paper showed 41-80% total cost reduction across providers.
Batch APIs: 50% discount for work that doesn’t need real-time results. Garbage collection and documentation checks are perfect candidates.
Semantic caching: identical or near-identical queries served from cache. 30-50% savings on high-repetition workloads.
Combined, these can achieve 90% reduction from naive pricing. That means a flywheel that would cost $10,000/month at list prices runs at $1,000. At that price point, continuous self-improvement isn’t a luxury. It’s a line item.
What Happens When You Scale This Up
Everything I’ve described so far fits in one developer’s head. One harness, one flywheel, one codebase. What happens when you run hundreds of agents for weeks?
Cursor answered this by building a web browser from scratch, over a million lines of code across a thousand files, written entirely by agents running close to a week, consuming trillions of tokens. Their architecture split agents into three roles: planners that continuously explore the codebase and spawn tasks (including sub-planners, making planning itself parallel and recursive), workers that pick up tasks and focus exclusively on completing them, and a judge that evaluates whether to continue after each iteration.
The first thing they learned is that flat hierarchies don’t work. When they gave all agents equal standing, agents became risk-averse. They avoided hard problems and made small, safe changes. Nobody took responsibility for end-to-end implementation. The same diffusion-of-responsibility problem that plagues human teams, reproduced in silicon.
The second finding: lock-based coordination killed throughput. Twenty agents slowed to the effective output of two or three, with most time spent waiting for locks. The fix was optimistic concurrency control, borrowed directly from database systems. Agents read state freely, but writes fail if the state changed since the last read. Conflicts trigger retries instead of blocking. Throughput recovered immediately.
And the third finding, the one that matters most for this article’s thesis: they tried adding an integrator agent to merge worker outputs, and it made things worse. “Workers were already capable of handling conflicts themselves.” Removing the extra coordination layer improved the system. Simpler harness, better results.
The punchline lands in a quote that should be pinned above every agentic architect’s desk: “A surprising amount of the system’s behavior comes down to how we prompt the agents. Getting them to coordinate well, avoid pathological behaviors, and maintain focus over long periods required extensive experimentation.” The architecture mattered. The infrastructure mattered. But the prompts mattered more. The configuration stack (the CLAUDE.md, the playbooks, the system prompts) isn’t a nice-to-have bolted on after the engineering is done. It’s the engineering.
The Harness Experiments Nobody’s Talking About
Cursor isn’t alone. A cohort of projects is running divergent experiments on the same question this article asks (how do you build the system that improves itself?) and their answers are instructively different.
Nous Research’s Hermes Agent takes a softer approach to self-improvement than Pi’s code-based extensions. Instead of writing executable tools, it creates Skill Documents, searchable markdown files that capture how it solved hard problems, which patterns worked, and what tools it used. These skills follow the open agentskills.io standard and load automatically when similar tasks arise. It’s procedural memory via markdown: safer than self-written code, but less capable of fundamentally extending the agent’s abilities. The agent “remembers” how to approach problems but can’t give itself new mechanical capabilities.
Amp (Sourcegraph) implements the oracle pattern that Huntley’s quadrant predicts: a fast worker (Claude Sonnet) handles tool use and code generation, while a read-only oracle (OpenAI o3) provides deep analysis and architectural review without writing code. The oracle doesn’t act. It advises. The worker acts on the advice. Two-model calls per complex task means higher latency, but the separation prevents the reasoning model from going full “robotic squirrel” when what you need is judgment, not action.
Block’s Goose is betting on local-first sovereignty, with built-in inference, open model downloads, no external runtimes required, and an experimental peer-to-peer compute sharing model where users share idle capacity with explicit trust controls. It’s the opposite of the API-dependent flywheel I’ve been describing, optimized for the developer who doesn’t want their agent’s capabilities to depend on a provider’s pricing page.
The through-line across all of these: the real innovation is happening in the harness, not the model. Pi bets on self-extension. Hermes bets on procedural memory. Cursor bets on hierarchy. Amp bets on model specialization. Goose bets on sovereignty. They’re all experimenting with different answers to the same design question. The answers that survive will define what “development environment” means for the next decade.
Where This Ends (Or Doesn’t)
Let me be honest about the trajectory, because the hacker audience I’m writing for would see through anything less.
METR (Model Evaluation and Threat Research) has been tracking this rigorously. The length of tasks AI agents can complete at 50% reliability doubled approximately every seven months between 2019 and 2024. Starting in 2024, that pace accelerated to roughly every four months. As of February 2026, Claude Opus 4.6 achieves 50% success rate on tasks requiring approximately 12 hours of human effort. The Codex team ran agents for six hours straight on iterative refinement. Replit’s Agent 3 works autonomously for over three hours with self-verification.
The trajectory is quantified, not speculative. At the current pace, week-long autonomous tasks arrive by late 2026. METR projects that month-long projects become feasible by decade’s end. Each expansion changes what “development” means.
When the window was five minutes, agents were autocomplete. You stayed in the loop on every decision.
When it hit thirty minutes, agents became junior developers. You defined the task, they did the implementation, you reviewed the result.
When it reaches hours, agents are team members. You set goals, they plan and execute, you review at milestones.
When it reaches days, and I believe it will, probably in 2027, agents are departments. You set strategy. They deliver projects.
At each stage, the same pattern holds: the developer’s job moves one level up the abstraction stack. You stop writing code and start writing tests. You stop writing tests and start writing specs. You stop writing specs and start designing harnesses. You stop designing harnesses and start designing the systems that improve harnesses.
This is what I mean by “the recursive developer.” The recursion isn’t just technical, agents calling agents calling agents. It’s professional. Each layer of capability you build makes you responsible for the next layer up. The developer who masters harness engineering in 2026 will be designing self-improving systems in 2027 and governing autonomous engineering teams in 2028.
The Competency Question, Revisited
In Post 2, I raised the concern that developers who outsource implementation stop building the skills to evaluate output. That concern doesn’t go away at this level. It gets sharper.
If you’re building a system that improves itself, you need to understand deeply enough to evaluate whether the improvements are actually improvements. A garbage collection agent that deletes “unused” code that’s actually called via reflection. A documentation updater that “corrects” a doc to match a bug rather than the intended behavior. An auto-applied linter rule that enforces a convention that made sense three refactors ago.
The recursive developer needs more judgment, not less. The abstraction goes up but the accountability doesn’t delegate.
This is why I keep coming back to the harness. The harness is the artifact of your judgment. The tests encode what you know about correctness. The CLAUDE.md encodes what you know about the system. The hooks encode what you know about risk. When the agent operates within that harness, it’s operating within the boundaries of your expertise. When the flywheel improves the harness, it’s extending those boundaries, but only if you’re reviewing the extensions with the same rigor you’d apply to any code review.





