Deep targeted agentic coding
Spec-guided, massively parallel, massively collaborative
Welcome to the first edition of KLDiv! Currently on my mind is (1) how much coding agents are changing everything programming related (2) the fascinating advances in RL over LLMs and the hardware infrastructure supporting that, and (3) the economic forces shaping and being shaped by the AI in general. So drop me a line if you want to geek out on such topics. This first post is on a slightly unusual take on on (1).
Picture this. You log into a github-like site. In the trending page there are entries of coding projects with attached “improvement specifications”. Each specification tells you what needs to be improved and how to measure improvement. You browse and after a while you decide that trying to improve some arcane version of LFBGS used in forecasting packages is a worthy goal to spend some of your remaining cloud infra provider credits. You check: someone else has already done some auto code optimization runs for that specification already that yielded marginal improvements seven months ago, maybe fresh runs will yield better solutions. You fire up a CLI command (or tell claude/codex/etc to do so) and a swarm of agents are spun up, massively in parallel, with the specification’s evals as an objective and are orchestrated in various complex ways to tackle the problem. A few hours (days?) later, the code optimization run yields a much more meaningful improvement (increase in capabilities did double in those 7 months after all) …you hit ‘publish’, spin up a PR and call it a day.
This might be one of the possible worlds we are heading into: where we match compute with coding tasks depending on their utility in a massively parallel, massively collaborative way, all of this happening through precise specifications and guides on what to improve and how to measure improvement. The credit assignment and emerging market dynamics of such a world are interesting to ponder. How do we collectively decide where to invest compute? What are the rewards for finding good solutions? How do we orchestrate all of this?
This is deep targeted agentic coding (deep coding for short?) a different beast from vibe coding or other interactive/parallel flavors of AI assisted coding, but something that complements it in various ways.
Many styles of code generation
Today's AI-assisted development exists in two flavors:
Interactive coding assistants
Tools like Cursor, Claude Code, Codex CLI, and GitHub Copilot excel at in-the-moment code generation. They're reactive, responding to immediate developer needs with single-shot solutions. While powerful for rapid prototyping and day-to-day coding, they lack the ability to systematically explore solution spaces or optimize beyond the immediate context. Their ability to incorporate external signals is generally limited and while modern coding models can go on for even a few hours, their ability to explore multiple paths is limited by the number of steps they can backtrack within their chains of thought.
Delegated coding orchestrators
Running multiple of the above coding agents in parallel has also gained popularity. Tools like Conductor and Codex take this route and are a good fit for tackling multiple well scoped and orthogonal features in parallel. The edge of this approach is breadth: get as many things done in parallel as possible before review or a further round of interactive iterations to refine the results. This approach, however, shares the same weakness as the above.
Deep targeted campaigns
On the other extreme, systems like AlphaEvolve and direct RL over LLM methods represent hand-crafted optimization pipelines. Researchers spend the bulk of their time building custom evaluation harnesses, defining reward/verification/fitness functions, and implementing evolutionary algorithms for specific problems. These tools have generally been relegated to the realm of research. But it doesn’t have to be the case! Many use coding use cases can be tied to some evaluation function downstream that can be captured with some care: be it latency, efficiency, cost, accuracy, etc.
While there is currently massive investment in tools for the first two classes of AI-assisted coding, the third class has been comparatively neglected and the space is instead filled with bespoke and siloed pipelines.
The power of parallelization
There’s a pattern in a subset of recent breakthroughs of generative AI applied to coding tasks. Math.inc's Gauss system deployed many many copies of its agent working in parallel to generate a lean formalization of the Prime Number Theorem. AlphaEvolve demonstrated how evolutionary approaches with multiple parallel branches could discover novel algorithmic improvements. Scientific coding agents were able to improve single-cell sequencing analysis pipelines, forecasting strategies, and geospatial methods, among other things through paralleled orchestration. Parallel refinement strategies show order-of-magnitude improvements when LLMs are given the computational space to explore solution spaces in parallel, they are basically an army of very intelligent monkeys.
Even as the models get more capable and distill longer and better agentic traces, parallelization will always be there to multiply their effect on targeted problems as needed.
Additionally, the field is now primed with many backend offerings that are ready to serve such massively parallel coding endeavors. From modal labs to morph labs to prime intellect it probably has never been easier to spin up such workflows at scale with very little overhead. Given the surge of open source models, this can also be done in low cost settings for targeted coding that warrants breadth with a budget.
Thus, we are in a curious position where we aren’t held back by fundamental technical limitations like model capability or cumbersome backends. Rather, these conditions point to a scenario where the bottleneck is our ability to orchestrate its exploration effectively across a multitude of targeted tasks. This requires putting some thought into ways to publish and contribute to tricky areas or problems that require deep resource allocation, and how this interfaces with us and the agents we spawn.
Four possible pillars of deep coding
Here’s what I think a framework to streamline deep coding could look like. In a nutshell, it’s grounded in specifications that point where to optimize, with linked evaluations that tell you what to optimize, that are then used by paralleled orchestrators that encode how to optimize, and within a market where the aforementioned specifications are wired with incentives that tell the agentic swarms why to optimize. Note that these artifacts (the specs, the evals, the orchestrators) need not be written by humans! (well the evals should at least be thoughtfully designed with humans in the loop). Their function is to present an interface between code optimization (codeopt?) runs by multiple parties that are trackable, comparable, and auditable. It’s a language to be used for the purposes of organizing the deep coding efforts at large.
1. Specs
The first artifact is the specification (spec). These should be fairly straightforward, describing what needs to be improved, where to find it, and what evals need to be run. A spec could look something like this:
version: 0.1
name: myspec
description: improve the sorting methods
evals: [myeval]
pins:
- id: my_method
language: python
files: ["src/methods.py"]
symbol: "my_method"
ast_query: null
The tricky thing to define here are the “pins”, which are basically itemized locators of what to optimize and that should be:
Identity: Stable across code revisions, allowing tracking across evolutionary branches
Locator: How to find the code (file globs, symbol names, regex delimiters, or tree-sitter queries)
Interface constraints: Immutable contracts that proposed edits must respect to prevent breaking changes
Note, again, that the spec carries no optimization hints. It doesn't know or care whether you're using evolutionary algorithms, reinforcement learning, or simulated annealing. It simply declares what can change and what must stay constant. This allows the same spec to be reused across different optimization strategies, evaluation criteria, and even, possibly, different codebases with similar structures.
2. Evals
The second pillar is scalable evaluation. Codeopt runs generate thousands of candidate solutions that must be tested, benchmarked, and compared. The evaluation layer needs to be both rigorous and flexible.
An evaluation definition becomes another declarative artifact, separate from but linked to specs. They necessarily carry some data generator: either a script that literally generates the data or a dataset itself. Additionally, it carries a bunch of metrics that could be aggregated in some way down the line, if needed, and a minimal acceptance criteria as well as possible maximum budget of compute. It could look something like this
version: 0.1
name: walltime
inputs:
generator: "./bench/gen_inputs.py --sizes 1024,1024,1024 --reps 10"
metrics:
- id: latency_ms
kind: timer
command: "./bench/bench_my_kernel --reps 100"
parse: "regex:(?<=p50_ms=)([0-9.]+)"
- id: correctness
kind: checker
command: "./bench/check_correctness --tolerance 1e-5"
parse: "exit_code==0"
aggregate:
objective: "min(latency_ms) subject_to correctness==true"
tie_breakers: ["mean(latency_ms)"]
accept:
rule: "latency_ms <= 0.95 * baseline.latency_ms and correctness==true"
budgets:
candidate_timeout_s: 120
total_wall_clock_h: 2
This structure has some desired properties:
Multi-metric optimization: While we can judge codeopt runs through aggregated metrics, the LLM agents are more than capable to consider sub-metrics as well, which act as a form of regularization and guards against overfitting to a particular fitness function. This is a crucial aspect that has been highlighted in systems like AlphaEvolve.
Minimum acceptance criteria: These can further guide desired properties of the resulting code, as final validations before even attempting to publish a codeopt run result.
Budgets: Similar to minimum acceptance criteria, but referring more to computational budgets, like should the code timeout after some predetermined amount.
Evals are non-prescriptive in where they have to be run (other than perhaps system specifications of the codebase itself) but it is reasonable to expect that they will be run massively in parallel in some on-demand compute backend like a Ray cluster or a compute/cloud vendor like modal/morph/prime intellect/sf compute.
3. Optimization algorithms
The third pillar is the layer that tackles optimizing the aforementioned evals. Optimization algorithms for generating code that maximizes some evaluation criteria generally come in two flavors: those that directly finetune an LLM using reward signals and those that leverage multi-agent orchestration with an outer optimization loop.
There has been an overwhelming amount of attention given to the first approach thanks to successes like deepseek R1 and reasoning models in general. However, these methods are likely an overkill for everything but the most demanding task-specific code generation since they require three heavy infrastructure pieces: an efficient inference engine for RL rollouts, a parallelized setup to update the weights, and general logging for careful monitoring of training dynamics.
The second approach, orchestration at scale, has been comparatively much less explored, with genetic algorithms and relatively simple actor-critic forming the bulk of the approaches tried. I believe that there is much more to explore here, as a whole literature of black box optimization could in theory be applied to the outer loop with the LLM agent as a core component, it would be interesting to see adaptations of methods like simulated annealing, replica exchange, the cross-entropy method, and the like applied to this problem, especially in instances that demand diversity of exploration with vast solution spaces. The main benefit of this approach is that one need only have to scale the inference engine for it to work — no complicated copies of models and parallelization infra for tuning weights. Nevertheless, the appetite for such investigations appears to be low since the current meta has been that a generalist model can almost always distill complicated orchestration rollouts, amortizing compute along the way. Still this seems to be the most accessible, plug and play method for deep coding campaigns.
4. Contribution workspaces
The fourth and final pillar is having a workspace where the various codeopt runs for a spec might live. This could be as simple as a directory structure containing patches/diffs hashed in some manner (alongside the run configuration runs and effort/budget it used) so that they could be upload/published and shared. Ideally, workspaces and codeopt candidates would fulfill the following properties:
Hashed reproducibility: Every candidate is identified by the hash of its patch plus some other metadata based on the run, the spec, and the eval, making results universally addressable and verifiable. A published result can be replayed from the recorded patch, environment, seed, and eval definition to reproduce metrics if needed.
Lineage tracking: Especially useful if candidates used existing solutions in a workspace as context for the next codeopt run.
Credit attribution: In a collaborative setting, contributions are tracked at the patch level. If Miku’s optimization improves on Rin's, which built on Teto’s baseline, the lineage preserves everyone's contribution. This creates natural incentive alignment for sharing intermediate results. (Imagine if you attach bounties to specs!)
With this structure, published runs become starting points for new explorations. A possible consequence is cross-pollination, where a team working on database query optimization can import successful patterns from a team optimizing compiler passes, as long as the structural patterns align. Or even better, such public workspaces could be readily available for next-gen coding models to distill.
Deep coding naturally yields coding commons
I’ve laid out the structure above to maximize for potential collaboration. That’s an aspect that I feel is a lacking in current AI coding tools, which is a missed opportunity. We are in one of those rare inflection points where we rewire the way we code so why not take advantage of it and foster community building!
One can imagine, for example, a public registry that tracks not just individual contributions but improvement chains. Organizations could post optimization challenges with bounties. A company struggling with database query performance could publish their spec and eval, letting the global community compete for solutions with a really low barrier for contribution: just fire some codeopt runs. It is a mechanism for turning coding explicitly into specification markets
An additional interesting thing about deep coding is that it has eval metrics and criteria attached to it and all attempts at solving the problem are recorded. Successful optimizations can be readily analyzed to extract reusable patterns and used to learn useful, generalizable things like e.g. transformations that improve cache locality, specific refactorings that reduce memory allocation, API usage patterns minimize latency, code abstractions that improve statistical models, even prompt optimization, fused kernel strategies, and other things for LLM self-improvement. Everything that is code can be improved and learned upon in a massively Pareto frontier.
Parting thoughts
Before AI coding assistants/developers, there was really only one way to code: inside a text editor in an IDE. There’s many ways to code now: either by interacting with directly with the bot in the IDE, by tab completion, or by delegating entire swaths of projects in sequence or in parallel. I don’t think any particular mode will win as I use them all to some extent in my day to day work.
And I believe that we’re not done yet exploring assisted coding settings. Deep coding is an additional modality of how things could further evolve. I’m giving implementing the above framework a go here — still work in progress that I hope to update on regularly!


