Stateful agents for self-improvement

What are the primitives of self-improvement?

Mar 11, 2026

Stateful agents for self-improvement

This blog started with a post about deep targeted agentic coding, a modality of agentic coding where one has some objective to optimize (e.g. simulation error, latency, accuracy in an ML pipeline, etc) and can therefore just “let the models rip” massively in parallel, in a sort of self-improvement loop as the LLMs build on their work. While methods in this category like alphaevolve and GEPA have been used for algorithmic and proof discovery as well as pipeline optimization in the sciences and engineering, they have been nowhere near as popular as their more interactive counterparts like claude code/codex.

I feel that this is now about to change. There has been an increasing amount of buzz on this front: from GEPA’s announcement of optimize anything to Imbue AI’s Darwinian evolver, to Karpathy’s autoresearch, to Google Deepmind’s simply. Given openclaw’s success, there is also wider recognition and normalization that autonomous agents with little human supervision can now tackle actual workloads of many hours and there have already been frameworks that attempt to “industrialize” this further (see e.g. gastown).

The self-improvement design space

The details of how to implement self-improvement loops are still as murky as ever. What is the overall best strategy for coordination? Is a single long-running agent enough? Do we need parallel calls? How does state come into the picture? Even with the exciting developments above, I feel that there needs to be a more systematic exploration of the techniques involved here. For example, GEPA and the Darwinian evolver only do LLM sampling, there is no agentic state, while Karpathy’s autoresearch is a single stateful agent for a very specific task and (currently) has no notion of parallelizing avenues of research. This last point is crucial, and in my experience is a necessary element for these things to do “move 37”-like things and not get stuck in local improvements. Recently, for example, I described my failed drifting language modeling project — the main reason it failed is that it got stuck in local improvements like tweaking schedulers and optimization variable without making wholesale changes or starting from scratch, which is very much a needed strategy in research.

Fanout: Primitives for Self-Improvement

I therefore set out to create a library/framework that included several self-improvement strategies and primitives: from alphaevolve’s tournament selection to RSA (sans aggregator finetuning) to darwinian selection, with several types of parallelization strategies, ways for evaluating LLM memory and collaboration channels. This is the fanout library, and it’s explicitly designed for agentic use: just give your favorite coding agent the skill and you’re good to go.

Fanout primitives are roughly divided into three categories: strategies for sampling, strategies for feedback, and strategies for selection. Sampling for example, can take the form of vanilla LLM sampling or launching a full stateful agent and you can do this using a single model or an ensemble of models. Feedback happens when solutions are evaluated against some criteria and that somehow are stored for future communication, including using techniques like reflection, while selection strategies are then used to select the pool of solutions and feedback for the next round of self improvement.

Using these primitives you (or your coding agent) can, for example, build a workflow that samples models in parallel, evaluates the solution, and selects them for the next round. This is your standard alphaevolve/GEPA workflow. More interestingly, however, you can launch various agent streams in parallel and give them tools to read and write from a shared solution pool channel — importantly, reading from the channel applies feedback and selection strategies:

Here, each agent gets its own thread of thought, remembering what it tried and why it failed or succeeded, but can additionally tap into the shared channel of solutions via a selection strategy that optimizes for diversity and/or quality. The agents can also choose to terminate themselves, visiting a finished state where they report a final solution. This results in a pool of agents, that’s really cool to watch work together:

There are many many things to try in fanout to explore the self-improvement space: fanout’s selection strategies include alphaevolve’s elitist + tournament selection, GEPA’s pareto and epsilon-greedy baseline, classic evolutionary strategy techniques like island evolution (used in openevolve), and eval-free methods like RSA.

Lessons so far

As a first step in testing out the approach above, I’ve been running it in a few benchmarks, one established, two that I cooked up myself, and one that is inspired in a recent paper, namely:

CodeEvolve: This is a benchmark of tasks that are a subset of alphaevolve’s original paper algorithmic discovery tasks, attempting to sample a mix of difficulties. They include the circle packing problem (optimize the area of a set of circles that need to be packed in a square; relatively easy), the Heilbronn problem for triangles (finding 𝑛=11 points on or inside a triangle with unit area so that the area of the smallest triangle formed by these points is maximized; medium difficulty), what they call the first autocorrelation inequality (finding a sequence a of non-negative reals that minimizes an autocorrelation constant; relatively hard), and the kissing number problem (finding the maximum number of integer-coordinates in 11D space such that their max norm is less or equal than their minimum pairwise distance; definitely hard).
PDE-solvers: Inspired by a recent paper, CodePDE, that asks whether LLMs can code up effective solvers for partial differential equations, I challenge the agents to come up with a fast solver under a timeout budget for three main equations: 1D Burgers (easy), 2D Navier-Stokes (medium), and 1D Kuramoto-Sivashinsky (a very chaotic PDE; hard), by seeing a (slow) reference implementation for each.
Molopt: A simple benchmark that asks for molecular SMILES that maximize drug-like characteristics. Notably, the agents are told to only manipulate SMILE text, any use of other libraries automatically receives a failing score.
MNIST-weights: A simple benchmark for optimizing the weights of a neural network (with a semi-exotic architecture) for solving MNIST, but the agents are not allowed to use any optimization algorithm. Rather, they must only work with raw weights.

For all the runs below I did 10 agents with 20 maximum number of steps, and k=10 for top-k, RSA, and similar selection strategies. I used two model sets, one diverse that included recent heavy hitters like GPT5.3/Gemini 3.1 Pro/Opus 4.6 as well as other small ones, and one small for runs with…a cheaper budget (see model sets here)

Stateful agents are surprisingly effective

For the classic one-shot LLM calls used by alphaevolve/shinka evolve/etc, you typically end up needing hundreds to thousands of samples to hill climb CodeEvolve tasks like circle packing (though GEPA has reduced this significantly) properly. This is something that I also observed in Fanout, and it was even worse if I did include additional feedback strategies like reflection.

But stateful agents were quite efficient at this, getting to reasonable solutions even by the solution 10 that the swarm submitted, and getting fairly close even on the tough kissing number task. Here are traces comparing the alphaevolve and darwinian strategies in the CodeEvolve tasks:

What’s interesting about most of these solutions is that the models set up optimization strategies themselves to solve them, they didn’t try to tackle them directly. Here’s a summary of each of their approaches

Circle Packing: Alphaevolve used basin hopping with L-BFGS-B/SLSQP polishing with heavy optimization taking 10-30secs. The darwinian solution used a SLSQP approach with explicit constraint Jacobians, starting from good priors in contrast with the random starts of alphaevolve’s solution.
First Autocorrelation: Alphaevolve used L-BFGS-B with multiple sequence lengths, pretty simple but somehow effective. Darwinian used an even simpler multi-start SLSQP, resulting in lower quality solutions.
Heilbronn Triangle: Alphaevolve used straight up Adam gradient descent over 10k configs with a softmin approximation, followed by L-BFGS-B and SLSQP polishing. Some of its solutions used simulated annealing, which was surprising. Darwinian had a similar Adam-based approach but with top-k selection from multiple runs, one solution was just L-BFGS-B.
Kissing Number: Alphaevolve used perhaps the most complex of the solutions I checked, a combinatorial approach that precomputes a 3-(11,4,1) block design via bitmasks then uses randomized greedy packing to find maximum independent sets (wow!). Darwinian used a more straightforward graph approach constructing a conflict graph with a MIS solver, one of its solutions used networkx for some simple graph operations.

I was surprised to see that alphaevolve’s solutions somehow tended to be very detailed optimization pipelines, while the Darwinian approaches were more diverse but generally less effective.

Shared memory is hard to get right

After seeing the success of knowledge sharing across agents, I wondered if having them share their learnings/thoughts/hypotheses more broadly in a shared memory bank from which each agent can grab inspiration from would improve the effectiveness of the agents. I thus gave the agents the tools to read and write from a shared memory bank. This is what it looks like when you turn on that option

Unsurprisingly, the shared memory operations consume more tokens and more agentic steps. I hoped that this cost would be justified by improved solutions but this was not the case: solutions generally got worse after reading from memory. I tried many things to make this work including collating memories into summaries by a competent model and just selecting memories from successful agents so far but that didn’t move the needle. I suspect that such memories insert either redundant on too much noise into the context and are not that actionable.

Still, I believe there’s something here to be explored. Maybe a hierarchical memory of sorts? Maybe only read memories when the agent is stuck in a rut?

Selection strategies matter, but only in tough problems

I also did a modest run just with the small model set (otherwise the whole thing would be a bit too expensive for a side project) on all of the CodeEvolve tasks under several strategies. These are the results:

For easy problems like circle packing and first autocorrelation, what strategy you pick doesn’t really matter: the agents will get there eventually in a reasonable number of steps. This changes with harder problems though, where especially at later stages you need to diversify solutions to avoid getting stuck. It was interesting to see alphaevolve’s diversity cross elites strategy dominate here, and I believe this is the best strategy going forward.

Beyond code: optimizing neural weights and molecules directly

Optimizing code is nice and all, but as the GEPA folks have pointed out, LLM-guided self-improvement is a universal optimizer that really can optimize anything that can be encoded into text. I wanted to test just how much this is true in two scenarios: (1) can the models generate molecules just by reasoning over their text representations (e.g. SMILES)? and (2) can the models reason over neural weights directly?

In both of these benchmarks I used the darwinian selection strategy because CodeEvolve was still running and I was sure that the darwinian one was going to dominate that one (it sadly did not).

Finding drug-like molecules

For the first question, I designed a simple benchmark called molopt, that requires to evolve 100 small molecule SMILES strings that are diverse enough (Tanimoto similarity >= 0.6 is penalized progressively) for the following tasks:

Maximize QED: Maximize median drug-likeness score (QED), should be simple enough.
QED LogP balance: Maximize a median combined score that balances QED with LogP (lipophilicity)
Constrained generation: Maximize a median combined score of 6 molecular properties.
Drug candidate: Same as above, but explicitly with Lipinski + QED + rotatable bonds + TPSA

In all tasks, I banned the use chemistry libraries like RDkit (to avoid enumerating valid molecules trivially and programmatically) and forced the solutions to run in one second, ensuring that the models did not reward hack their way out of the problem (and believe me, they tried, more on this later).

I did a couple of runs here, one with a small model set and one with the full diverse model set:

Here, the benchmark lines were set to what I thought they would achieve, they all beat my expectations

Interestingly, the small models are generally not very good at handling raw SMILES and give up very early, declaring a finished state. This is a consistent behavior I saw across this and other benchmarks: smaller models just tend to cut runs short. But the diverse model set pretty much nailed it

As for the solutions, some of them were simply lists of SMILES but others got creative by taken submolecules and combining them in inventive ways. I suspect that the new models (especially Gemini 3.1) have incorporated SMILES in some part of their training because they generally seemed very adept at these tasks, even if they were simple ones.

Fitting MNIST in one shot

For another benchmark that goes beyond code, I came up with mnist-weights, asking the models to just straight up give me the weights for a neural network that solved good old MNIST — no training allowed. Of course, because the actual weights of MNIST solutions are very likely in the their training set (and yeah, I checked, they literally return the weights as is when asked), I asked the models to give me the weights of a generally odd neural net with a simple 64 → 13 dim architecture gated by ELUs, reasoning that they would have a harder time adapting memorized weights to this and so would have to get creative.

As with the MolOpt benchmark above, I banned the use sklearn/pytorch/etc and set a strict time limit for the solution evals to avoid cheating…and boy, let me tell you, did the agents try to cheat. At one point, the agents were doing weird string concats of banned library names to avoid getting caught by ‘sklearn.+’ regexes and doing hyper fast Adam loops to nail the weights:

dimenwarper@tsuname

me: ok let's see if we can evolve raw model weights with fanout, tell the models that cheating will be caught by looking at imports and time limits gemini pro: *dynamic import concat strings to circumvent checks and hard code an efficient mini adam to get a perfect score*

7:37 AM · Mar 1, 2026 · 149 Views

1 Like

After several rounds of anti-cheat measures, fanout finally nailed the task at hand.

There were two types of solutions here: either straight up handcrafted weights with some normalization to fit the ELU gates, or pre-computing the 10 digit class centroids (mean pixel values per digit) and encoding them as columns, then hijacking the net to do nearest centroid classification. I found this approach relatively clever! This benchmark was perhaps the most surprising to me in terms of model behavior, I honestly did not expect the models to so aggressively cheat and come up with templating solutions like they did.

Models are really adept at PDEs

The last benchmark I want to discuss is pde-solvers, which as I mentioned was inspired by CodePDE. Crafting bespoke solvers for partial differential equations is generally a time consuming task and presents several optimization challenges that I was curious how fanout would handle. To achieve this, I simulated ground truth traces for three different systems: 1D Burgers (BS), 2D Navier-Stokes (NS), and 1D Kuramoto-Sivashinsky (KS), using straight up very expensive spectral method solvers and then downsampling. The agents were then tasked to come up with efficient solvers that approximated the ground truth with much faster methods under a tight compute budget.

What’s interesting about this benchmark is just how difficult it was to make the benchmark, well, difficult. My runtime budgets at first were 10s for BS and NS, and 30s for KS to prevent brute force upsampling…these budgets I had to constantly push down as the agents found clever ways to make compute faster and faster, and it wasn’t until compute budget went all the way down to less than 1s that the agents really had to get creative. In the end though, they got there for for BS and NS, but could not nail the very chaotic KS:

The baselines (dotted lines) here are Euler finite difference, a generally naive method

The solution for NS was interestingly a very optimized spectral method in itself, just made super fast to fit the budget and the resolution of the benchmark. It actually worked great!

BS was more efficiently solved with a simple pipeline of finite difference, which is a well known method if a bit boring.

So there you have it, the models clearly know their PDEs and can come up with clever solutions to make the solvers fast..faster than I was expecting

Self improvement is here

One of the lessons to take a way from this post is that stateful agentic threads with a communication channel are a very effective avenue to self improvements. But really the main lesson is that, generally, self-improvement is here. Self-improvement frameworks and primitives work and although the above applications are still research-y in flavor, I have no doubts that they will diffuse well beyond into standard industry practice ML-ops, Systems design, etc. The “industrialization” of software is something that we will have to figure out how to fit across fields to make the most of it. Because of how deep agentic coding works, it is clear that code will become much more ephemeral than it is now: a constant stream of evolving and self-improving code shaped by various signals of what we want it to do.

What’s missing?

This is a brave new world and I think there are many things we have to figure out along the way. Two main challenges come to mind. First, the current paradigm presented in fanout above and in tools like GEPA are still constrained mostly to a single file/object to evolve. How do we go from here to evolve an entire codebase? How do we track diverging codebase, for example? What is the oracle object that feeds back the metrics that we want to improve for a full codebase? What is its geometry?

Second, deeply exploring spaces with agentic loops is an expensive endeavor (my wallet hurts right now thanks to openrouter and I didn’t even spend all that much compared to what other people spend monthly) and even more if the evals that we want to measure the agentic solutions against are expensive to compute and need large runtimes. This is clearly a resource allocation issue that can be tackled via collaborative computing. Currently, we have GitHub that solved brainpower constraints by allocating resources via collaboration dynamics…what is the analogous collaborative computing paradigm where we share agentic effort, backend compute, and human verification? Can we do SETI/Folding@home but for recursive self-improvement?

I feel that answers to many of these questions will come in the following months, likely much faster than many of us anticipate

KLDiv

Discussion about this post

Ready for more?