DEV Community: Lars de Ridder

The Missing Memory Type

Lars de Ridder — Thu, 12 Mar 2026 11:08:45 +0000

LangChain has three memory types: semantic, episodic, procedural. Mem0 raised $24M to build a "memory layer." Letta built a company around persistent agent state. Between them they've covered remembering what happened pretty thoroughly, and I keep wondering why none of them noticed that there's actually a second half.

Psychologists call it prospective memory: the ability to remember to do things in the future. Take your medication at 8pm, call the dentist when the office opens, bring up the budget thing if someone mentions Q3 numbers. Einstein and McDaniel published the foundational research in 1990, and by now there's an entire subfield studying how it works and why it's cognitively distinct from remembering what already happened.

Every agent memory framework I've looked at implements retrospective memory, but prospective memory is curiously absent. I think that's the gap that explains why agents still feel like tools you operate rather than assistants that actually assist.

Two systems, not one

Prospective memory is an entirely different cognitive system. It boils down to two subtypes that map almost perfectly to the agent problem. Time-based prospective memory fires at a specific moment: "take the pill at 8pm." Event-based prospective memory fires when you encounter a cue: "when I see my colleague, ask about the report."

The frameworks that come closest to supporting proactive behavior do it through scheduling APIs. Letta lets you schedule messages with timestamps or cron expressions. This covers time-based prospective memory and nothing else, which is the equivalent of having amnesia but at least you wear a watch.

What's actually missing

The second type is the interesting one; it activates when the right context shows up.

When I cataloged the things I actually wanted a proactive assistant to handle, they were mostly non-timer based. I already have a phone with a decent reminder app; why do I need a more verbose interface for that? My memory fails me when things get fuzzy and less defined, but because I have nothing else I try to fit these reminders into the mold, leading to a huge list of things that I now have to maintain, without a clear reference to why they exist.

When I mention "My ID card expires in March" while planning a trip to Japan in April, a good assistant would connect this to the upcoming travel and flag before it's too late. You shouldn't have to spell out "and please remind me to renew it in January accounting for 5-10 business days processing time"; the whole point of an assistant is that it makes those connections for you.

"When the finance numbers come in, I need to start the report." The trigger is semantic: new information entering the system that matches a stored intention. Nobody knows when finance will send the numbers, so there's no date to set a cron for.

The benchmark

I tracked what I actually needed a proactive assistant to remember over the course of a few weeks, and turned those into 22 test scenarios across 10 categories: simple timed reminders, cancelled events that should produce silence, fuzzy "sometime next month" intentions, and conditional triggers that fire when new information arrives. 9 of the 22 had an explicit timer attached and the other 13 didn't, which is the whole point.

I compared three methods. Simple cron is just a simple scheduling system. Smart cron is essentially the same, but I gave the agent waking up as much information as I could; full conversation history, full memory entries, and reasoning instructions. The prospective memory method uses projections; a structured entry with activation condition, stored context, status, and triggers, evaluated on a daily review or when new information arrives.

Sonnet 4.6 generated all responses; Opus 4.6 and Gemini 3.1 Pro judged them on usefulness, context richness, appropriateness, and coherence, and each scored 1-5.

On the full 22 the gap is wide: prospective entries 4.80, smart cron 2.59, cron 1.63. Makes sense, because the crons can't fire on 13 of the scenario's. But even on the 9 scenarios where all methods could fire, prospective entries scored 4.88 versus smart cron's 4.16. A blind human evaluation on 10 scenarios corroborated the rankings: prospective memory 4.85, smart cron 3.33, cron 2.38.

Interestingly, I initially had a simpler smart cron implementation, which just got reminder text and some reasoning instructions, and it scored 2.74. After that, the full treatment with conversation history, memory entries, reasoning instructions, scored 2.59. Lower. It reinforces that more context without the right structure just gives models more rope to hang themselves with.

What a prospective memory entry looks like

The implementation is pretty small. Instead of storing "user's ID card expires March 2026" as a flat fact, you create a forward-looking entry:

Summary: ID card expires March 2026, needs renewal before Japan trip
Activate: January 2026 (month resolution)
Context: Planning Japan trip for April. Dutch renewal takes 5-10 business days.
Status: pending
Trigger: none (time-based)

And for the conditional one:

Summary: Schedule internal sync if client meeting is cancelled
Activate: when cancellation detected
Context: Client meeting Thursday 2pm. Team wanted to discuss roadmap anyway.
Status: pending
Trigger: client meeting status changes to cancelled

Activation can be a time, a time range, or a semantic condition. Important is that the context is captured at creation time while the agent has the full conversational context. This is also why I called these entries "projections"; the agent projects itself into a future state, and writes itself a note while it actually knows what's going on. The status then allows the agent to decide that silence is the correct response when an event was cancelled or a task was already completed, and triggers connect entries to future events so that when new information enters the system, matching entries can activate without a timer.

Why nobody built this

The agent memory field grew out of information retrieval, not cognitive psychology; the people building these systems are solving "given a query, find the most relevant stored information," which is retrospective by definition. And the architectures are reactive: user sends message, agent responds.

Proactive behavior needs a different trigger mechanism entirely, and the few frameworks that have one bolt it on as a scheduling system rather than treating it as memory. The result is a blind spot shaped like half of human memory. The part that remembers is covered, but the part that reminds is conspicuously missing.

What the cognitive science predicts

Prospective memory research has decades of findings that map quite cleanly to the agent problem.

Time-based tasks (do X at 3pm) rely on monitoring: you check the clock repeatedly, and performance degrades when you're busy. Event-based tasks (do X when you see Bob) rely on cue recognition, which is automatic and much cheaper; the right environmental trigger brings the intention to mind without conscious effort.

That maps directly to implementation choices. A polling-based system that periodically checks "is there anything I should do right now?" is the time-based approach: expensive and easy to miss things between polls. A trigger-based system where new information entering the memory store activates matching intentions is the event-based approach: cheaper, more reliable, activated by the cue itself.

The cognitive science even predicts failure modes. Prospective memory breaks down when the cue is weak, when the person is under high cognitive load, or when there's a long delay between forming the intention and the activation moment. These failure modes apply directly to agent systems and could inform how you design activation thresholds, priority systems, and decay functions.

What framework authors could do

The addition is small enough to be a PR: a new memory type alongside semantic, episodic, and procedural, with fields for activation condition, stored context, status, and linked entries, plus a retrieval path that checks forward-looking entries when new information arrives rather than only when the user sends a query.

The benchmark, scenarios, and implementation are open source. I'd genuinely rather see prospective memory adopted by LangChain, Mem0, and Letta than have it only exist in my project. The gap between what cognitive science knows about forward-looking memory and what agent frameworks implement has been sitting there for thirty-five years. Seems like enough time for a 1.0.

Five CLIs Walk Into a Context Window

Lars de Ridder — Thu, 05 Mar 2026 13:40:23 +0000

Claude Code sends 62,600 characters of tool definitions to the model on every turn. pi sends 2,200. Aider sends zero. I know this because I intercepted all of their API calls while they were fixing the same bug, with the same model, on the same codebase.

This started as a follow-up to my earlier research where I compared four different models on a standardized coding task. That experiment showed huge differences in context usage between tools, but it was hard to separate the model's behavior from the wrapper's overhead. If Claude Code uses 70% of its context on tool definitions, is that a Claude problem or a Claude Code problem?

The only way to find out was to hold the model constant and swap the wrapper. So I ran Sonnet 4.6 through five different CLIs, three times each, and traced every request.

All five CLIs compared in Context Lens. Same model, same task, five completely different context profiles.

This is Part 2 of my context tracing experiments with Context Lens. In Part 1, I compared four different models on the same task and found that investigation strategy drove most of the context differences. This time the question is narrower: if we hold the model constant, how much does the wrapper matter?

The Setup

Quick context if you missed Part 1: there's a planted bug in Express.js where res.send(null) returns the string "null" with content-type: application/json instead of an empty body. Each tool gets the same prompt, the same repo with 6,132 commits, and the same pre-installed dependencies.

The five wrappers span the full spectrum of how you can talk to a model:

pi is four tools and nothing else. Read, Bash, Edit, Write, described in 2,200 characters total. The smallest toolset I've seen in any serious coding CLI. It's a lean, mean, coding machine.

Aider is the purist. No tools at all, taking minimalism to the extreme, like one of those ultralight backpackers. It sends a "repo map" (summaries of every file in the project) and the user's prompt as plain text, and the model responds with the complete updated file. No function calling and no iteration.

Cline has 11 tools and a large system prompt, but does something unusual. Where most wrappers put file contents in tool_result blocks, Cline puts them in user messages with injected <environment_details> metadata. It's like Cline built its own protocol on top of the Anthropic API, which means the conversation structure looks fundamentally different from the other wrappers.

Claude Code hauls along 18 tools at 62,600 characters. This includes TodoWrite (10.5K chars), Bash (12.7K chars), Task (8K chars for launching sub-agents), and a collection of planning, worktree, and notebook tools that get sent on every turn regardless of what you're doing.

OpenCode has 10 MCP tools at 22,100 characters. Its bash tool alone is 10.7K characters, nearly five times the size of pi's entire tool set.

Tool	# Tools	Tool Defs	System Prompt
Aider	0	0	1.3K
pi	4	2.2K	4.2K
Cline	11	13.5K	11.1K
Claude Code	18	62.6K	15.6K
OpenCode	10	22.1K	1.6K

The composition

If you open these sessions in Context Lens, the composition bars tell the story before you read a single number. Each wrapper produces a visually distinct pattern (see the horizontal bars in the screenshot above):

Aider: a solid block of orange (user text, 96%). Almost nothing else.
pi: a solid block of green (tool results, 91%). The model reads files and runs commands; that's what fills the context.
Cline: mostly orange (user text, 75%) because tool results are encoded as user messages, with a smaller pink band (tool definitions, 12%).
Claude Code: dominated by pink (tool definitions, 55%) with green (tool results, 22%). More than half the context is describing tools.
OpenCode: a large green block (tool results, 71%) with a visible pink band (tool definitions, 24%). The model does a lot of reading.

Cline does something architecturally different from everyone else: it encodes tool results as user messages rather than using Anthropic's native tool_result content blocks. So what Context Lens classifies as 75% "user text" is actually file contents and command output stuffed into the user role, with <environment_details> blocks injected on every turn showing the current time, visible files, and task state.

The numbers

All five tools solved the bug, mostly at least. Aider got it wrong once out of three tries because it can't run tests to verify its fix. Everyone else went 3/3.

Tool	Turns (Main)	Peak Context	Output	Cache %	Cost	Pass
Aider	8.0 (3.0)	13,808 ±53	8,669	0%	$0.23	2/3
pi	8.0 (8.0)	39,713 ±10,774	1,412	71%	$0.21	3/3
Cline	12.7 (6.7)	33,981 ±6	1,135	78%	$0.18	3/3
Claude Code	21.0 (12.0)	31,966 ±5,305	2,737	92%	$0.21	3/3
OpenCode	17.3 (16.3)	34,506 ±5,316	5,421	81%	$0.36	3/3

Not all turns are shaped equally. Aider's extra turns are retries and commit message generation. Claude Code's include cheap Haiku sub-agent routing calls. Cline's include environment probes. "Main" counts only the turns where the model is actually working on the task.

Cline is cheapest, OpenCode is most expensive at nearly 2x, and the rest cluster around $0.21-0.23.

The highlights

A couple of things jump out here.

Aider's context is tiny but its output is enormous. 13.8K tokens in, 8.7K tokens out. Because it doesn't use tools, it returns the entire modified file as text in its response, where every other tool uses an Edit or apply_patch operation that only costs a few hundred tokens. Aider also has zero cache hits because it doesn't use Anthropic's prompt caching at all, which means it pays full price on every single turn.

Claude Code takes the most turns but not the biggest context. 21 turns average, but the context peaks at 32K because around 9 of those turns are sub-agent calls. Claude Code uses a two-model architecture: a small Haiku classifier decides which requests need the main model and which can be handled cheaply, so those 9 turns are tiny routing calls with contexts under 2K. The main Sonnet turns are 10-13 per session, which is comparable to the others.

Claude Code: tool definitions dominate the composition bar, sub-agent turns visible as small blocks

OpenCode is the most expensive despite using the same model. $0.36 per session versus $0.18-0.23 for everyone else is significant. The culprit is output tokens: 5,421 on average, nearly four times what pi and Cline produce. The model tends to be more verbose when talking through OpenCode, and verbose responses get re-sent as conversation history on every subsequent turn, compounding the cost.

Cline is the cheapest and the most consistent. $0.18 per session with essentially zero variance on context size (±6 tokens) is genuinely impressive. I added Cline here for no specific reason other than that the CLI was relatively new, but it's a really cool result. It achieves this because its architecture front-loads the context with stable user messages that cache well, hitting 96.4% cache on the last turn. Probably the most thoughtful context-management approach of a wrapper I've seen yet.

The one that can't check its work

Aider: 96% user text, 0% cache hits, 8.7K output tokens

Aider's 2/3 pass rate is architectural. No tools means no shell access, which means no way to run the test suite. In run 1, the model produced a fix that looked perfectly reasonable: it moved the null check above the ArrayBuffer.isView check and added a break. But the break skipped setting the body to empty string, so the response came back with Content-Length: 1891 instead of 0, which is one failing test out of 1,248.

It's not "bad" or "wrong" though. When using Aider, you're probably used to running your own verification, and you'd get Aider to fix it afterwards. It does skew the experiment slightly; the actual context usage would be higher after a reprompt.

The variable piece of the pi

I traced pi's large variance to a single decision per run. In pi run 1, the model reads lib/response.js, spots the bug, fixes it, runs the tests. Six turns, 29K context.

pi run 1: surgical, 6 turns, 29.4K context

In run 3, the model decides it wants to check the git history first and runs git log --oneline. On a repo with 6,132 commits, that returns 52,662 characters of output, all of which gets appended to the context. The model still fixes the bug in 7 turns, but the peak context is 51K because of that one command. If you read Part 1 where I evaluated different models, you'll remember that Gemini used to do the same (but consistently).

pi run 3: one git log command inflates the context to 50.9K

Claude Code has lower variance (±5,305) partly because its tool definitions act as ballast. When 62.6K characters of your context is fixed scaffolding that never changes, the model's exploration choices become a smaller proportion of the total. pi starts lean, so one large tool result swings the whole profile.

The chatty model problem

OpenCode costs nearly twice what the others do, and the reason is purely output tokens. The model generates 5,421 output tokens per session through OpenCode, versus 1,135 through Cline and 1,412 through pi. And since output tokens cost 5x more than input tokens on Sonnet 4.6, this dominates the bill.

I dug into OpenCode's Anthropic system prompt, and it includes multi-paragraph examples that teach the model to narrate every step ("I'm going to use the TodoWrite tool to write the following items... marking the first todo as in_progress... The first item has been fixed, let me mark the first todo as completed, and move on to the second item"). The model dutifully mimics this verbose narration style on every turn. The wrapper literally teaches the model to be chatty, and chatty output tokens cost 5x more than input.

I have no clue why; maybe it forces consistency across different models, or maybe it was written for less capable models and ported to Anthropic without much thought. I'd love to hear the reason.

The quiet MCP problem

When I first started this experiment, Claude Code showed 22 tools instead of 18. I had Tether (a chat bridge I built for agent communication) registered as an MCP server, and its four tools were silently included in every request. They only added 2K characters, which is why I didn't notice until I looked at the raw payloads.

After removing them and re-running, the numbers barely changed, but the principle matters: MCP servers silently add to your tool overhead, and you might be carrying (and paying for) tools you forgot you installed.

The wrap-up

It's fascinating to see how wrappers shape the context. I honestly didn't expect so much variation given the same model. But of course, on a relatively small task like this, everyone wins, participation trophies all around.

The open question is what happens when context windows get tight. Compaction needs to make harsh choices, and if Claude Code is carrying 62.6K characters (~16K tokens) of tool definitions, it has less space to store info from a long-running session. pi's 2.2K characters of tools would leave an extra ~15K tokens for conversation history and actual context.

Regardless, if you're interested in cost optimization, you might reconsider your choice for OpenCode on models that charge per output token, or at least you might be inclined to analyze usage yourself. And if you need predictability, you could take another look at Cline. I know I will be doing that for sure, even though pi is my daily driver.

I'm working on a larger experiment to pressure context windows, but it's quite some work, which is why I squeezed this one in first. You can try Context Lens yourself, though I should warn you: once you see what's in there, you can't unsee it.

I accidentally benchmarked three free LLMs against Sonnet

Lars de Ridder — Tue, 03 Mar 2026 14:44:14 +0000

I've started building a context engineering course for developers, with runnable exercises, code examples, the works. I've spent a couple hours going back and forth with a LLM to come up with the rough course content, and had said LLM commit all of our discussions to a (way too detailed) TASK.md, which I wasn't really going to read but I am not good at throwing things away.

Now I also happened to have access to GLM-5, MiniMax M2.5 and Kimi K2.5 on a free tier, and I've heard good things. So why not; I pointed each of them at the task. I let them generate their own todos, and then with a loop skill in the pi agent harness let them execute them all.

I wasn't exactly planning on evaluating them at this point, but I suddenly did have 3 directories full of data that I felt like I had to do something with (cause again, not great at throwing things away). So now I've spent hours on research methodologies and writing this article just so you can skim it for the numbers.

The task

The spec was 1,384 words. Deliverables:

A course outline, 8 modules, 3-5 lessons each, 8-12 hours of content
8 module files, 1,500-3,000 words each, with code examples and citations to specific research papers
8 hands-on exercises, no paid API access required
A reference sheet

I also included research files, from contextpatterns.com: three research summaries, ten pattern files, and two guides. The models could read them but couldn't search the web, to reduce the confounding effect. Each model got a pi session with read, write, bash, and todo tools, plus a loop extension that re-prompts every turn with: "List open todos, pick an unclaimed one, claim it, work on it, close it. Repeat until done."

Sessions ran 33-75 minutes per model, 92-160 turns each.

What happened

All four finished without intervention, which sounds unremarkable unless you've ever done anything semi-complex with loops. Given a detailed spec and a task loop, all of them planned their work, decomposed it into todos, and executed to completion without a single human prompt. Which was already a great result, honestly; I really didn't know what to expect from these models.

Past that, the differences are significant.

Model	Cost	Time	Turns	Input tokens	Words	Exercises	Cache hit
MiniMax M2.5	$0.89	33 min	101	5.4M	29k	4/8	90%
Kimi K2.5	$1.47	36 min	92	6.5M	41k	8/8	93%
GLM-5	$3.25	75 min	95	7.2M	39k	8/8	73%
Sonnet 4.6	$7.12	58 min	160	14.7M	51k	8/8	~100%

MiniMax finished fastest in 33 minutes. Unfortunately, it wrote a single todo for "8+ exercises", delivered 4 and called it done. Its failure to properly decompose the task cost it valuable points. MiniMax is fast and cheap, and will interpret any ambiguity in the most minimal way possible. Fair enough.

GLM-5 wrote the most detailed planning todos of any model: structured sections for deliverables, topics, and specific data points to incorporate from the research files. Good plans, but expensive execution; it read the input research files whenever it needed a specific number, consuming 1.9 million fresh input tokens. It still hit 73% cache utilisation overall, but those 1.9M fresh tokens were billed at full input price, which adds up fast. At standard GLM pricing that's $1.94 in fresh input alone, more than Kimi's entire session cost.

Kimi K2.5 made an interesting judgment call: it delivered exercises as executable Python scripts with fixture files, rather than markdown with code snippets. The spec said "starter code" so this is defensible, and it's a higher-fidelity interpretation than the other models chose. I suppose it might have been a result of the focus on coding in its training. Exercise 1 even shipped with a real session-log.jsonl fixture so it was fully self-contained.

Sonnet 4.6 produced 191 fresh input tokens across 160 turns.

That number looks like a typo, so let me explain. Sonnet's first todo was "Read all research files and source material." It read all 15 source files in sequence, then wrote a 900-word structured reference document before producing a single line of course content. Every subsequent turn drew from the cached conversation history, which the provider serves at 10% of input price, and it never re-read a source file for the rest of the session.

As contrast, GLM read files continuously and spent $1.94 in fresh input; Sonnet read everything once and spent effectively nothing. This isn't a Sonnet-specific magic spell, but a context pattern any model can be prompted to follow known as Anchor turn, and the cost difference scales with session length. So in our case, pretty worthwhile.

Sonnet also added a QA pass as its last todo, which the other models skipped (amusingly, GLM-5 did plan it; participation trophy winner). After finishing all 8 modules and exercises, it ran through the spec checklist, verified word counts, spot-checked statistics against its research notes, and cleaned up style violations from the project's coding guidelines. None of the other models read the guidelines file, let alone applied them.

But how good was the output?

Using LLMs as judges is controversial, but my opinion alone wouldn't tell you much either. Luckily, we already established this as an informal piece of work, so freed from the suffocating shackles of academic rigor, let's judge ahead.

I built a pairwise judge pipeline. I selected the 4 most technically demanding course modules, stripped all model names and directory paths from the content, assigned neutral labels (Alpha, Beta, Gamma, Delta), and sent every pair to two judges; Gemini 2.5 Pro and GPT 5.2. Each judge gets the same prompt: what's the biggest weakness, are the citations real, is anything misleading, and which version would you recommend? That's 6 pairs per module, 2 judges each, 48 evaluations total.

I went with pairwise comparison rather than rubric scoring, because models calibrate "this one is better than that one" more reliably than "this deserves a 7 out of 10". Very human-like.

Both judges were consistent with each other, which is at least a sign the results aren't random.

Model	Judge record
MiniMax M2.5	1W / 23L
GLM-5	12W / 12L
Kimi K2.5	12W / 12L
Sonnet 4.6	23W / 1L

Sonnet obviously takes the top spot here. The single loss was GPT-5.2 preferring GLM's cleaner structure on the failure patterns module, penalising Sonnet for citing specific statistics it couldn't verify. Ironic, given that the citation audit later found GLM was the one fabricating. But at a factor 2 to 9 times as expensive, Sonnet better be winning.

MiniMax lost due to its code examples. They were illustrative data structures rather than anything you'd actually run. GLM-5 and Kimi K2.5 ended up dead even at 12W/12L, with both judges from different providers (Gemini and GPT-5.2) agreeing closely. Kimi costs less than half what GLM does for the same judge quality, which makes it the better pick among the free models.

The citation problem

Hallucinations (or as the LLMs call it, "confabulations") are one of my main worries with cheaper models. The judge evaluation also flagged fabricated citations, so I followed up with an audit.

Most key statistics are real and traceable, and the models cite them correctly. But GLM's failure patterns module includes a 2-task accuracy figure of 35.72% that doesn't exist in the source. The DSBC paper that was provided gives a 1-task figure and a 3-task figure, so GLM just interpolated a plausible middle value that wasn't actually measured or mentioned.

Kimi's version of the same module is worse: it confidently presents a full per-model accuracy table across GPT-4o, Claude, Gemini, and Llama at 4K/16K/32K/128K tokens, none of which appears in any source file.

These aren't random hallucinations; it's actually worse than that. They're systematic extrapolations of real patterns in an attempt to generate data the models didn't have. They're numerically plausible and the surrounding context is accurate, which makes them impossible to spot without checking the source. A developer reading the module wouldn't question them; I can't remember ever looking up numbers in a course of all things. But exactly that is what makes this kind of fabrication more dangerous than the obviously wrong kind.

Sonnet didn't fabricate, and I think the anchor turn is doing the real work here; by consolidating all source material into a reference document at the start, it never had a gap to fill. The free models had the same source files, but by the time they were writing module 3 it was all buried under dozens of turns of their own output. The result: they started filling in what seemed right. Which, if you want to be a bit dramatic about it, is context rot; the exact failure pattern the course they were building is supposed to teach. Now isn't that cute.

What does this all mean?

The free models are genuinely capable. Kimi producing 41k words of substantive developer content in 36 minutes, accurately citing research, with runnable exercises, for $1.47 is not a compromise you need to apologise for. For content generation tasks where you'll do an editorial pass anyway, that's entirely reasonable.

The gap to Sonnet is real but shows up in planning behaviour rather than raw output quality. Sonnet invented the research-consolidation step and the QA pass unprompted; the free models didn't. Those two decisions explain most of the quality difference, through better source anchoring, self-verification, and lower fabrication risk. You can probably replicate a lot of that through proper prompting. Which does make an equally proper case for an expensive supervisor model planning for a lot of cheaper workers.

GLM's token consumption pattern is the clearest illustration of why agentic strategy matters independently of model capability. Reading research files on demand throughout a session sounds like it should produce better results because the model has the material fresh when it needs it, but it doesn't, and it costs twice as much as reading everything once at the start. The model you pick is one variable; how the agent uses the context window is another.

On the citation issue, I'd actually separate it from the quality ranking entirely. Sonnet scored best and had the cleanest citations, but that correlation isn't guaranteed to hold across different tasks. The fabrication behaviour is a property of how all these models handle gaps in their context, not just the weaker ones. Any content pipeline that relies on model-generated statistics needs a verification step regardless of which model you use, and that step should focus on numerical claims specifically, since those are where the precise-but-invented figures show up.

Obviously, this isn't a benchmark. We did one run for each model for a single task, and we used free tier endpoints that may differ from production. This is just a reasonably honest look at how these models behave under real working conditions, with real data (or at least a real purpose), for a task that isn't just "summarise this" or "code that".

I've since then worked with each of the models individually, and actually the rough shape of the comparison feels right to me. Kimi is a capable coder, GLM-5 feels like an academic, and MiniMax like that guy in the back of class who always manages to find a way to a passing grade. I'm genuinely rooting for them to give the big three a run for their money.

The course is on context engineering, which is the discipline of managing what goes into an LLM's context window. If you're interested in hearing about this course when it ships, you can follow along here.

How programmers get disqualified from doing everything else

Lars de Ridder — Fri, 27 Feb 2026 09:21:51 +0000

Being able to program is one of the few skills that can make you be seen as less than what you are.

I have been a product manager, project manager, SCRUM master and product owner, usability and requirements engineer and plenty of other things. I can interview users and design user interfaces, I can lead and coach teams, manage deadlines and set up projects. I've done all those things, and I've done them well.

But as soon as I mention that I write code, I'm a developer. Full stop. Now, a project manager has to be assigned to keep me on track. Others will write requirements specifications that I have to give estimates on. I don't talk to users, and I have to give periodic updates on my progress.

It's a very curious phenomenon that I've observed more than once, in many settings and organizations, and not just with me. It has gone to the point that I've actively avoided writing code (or pretended to not write code) in some projects, because I want the user or customer to trust me to (for example) handle the planning and requirements. As soon as I write code though, I become "the developer" on the team. And then that's all I'll ever be.

At the same time, however, dedicated project managers and product managers heavily gravitate towards well-rounded software engineers, that can do more than just write code. They are extremely happy to work with engineers who can communicate well, can create architectures that work (not only on a technical level but also on a project/business level) and can anticipate usability scenarios. No wonder; these skills fill the blind spots in their own skill sets. In many projects, however, these dedicated roles merely add an extra layer of communication indirection, and well-rounded engineers would often be able to do their work better without them.

Initially, I thought the phenomenon was mainly related to age. If you're relatively young and admit to being a programmer, then it is more easy for people to see you as the stereotypical Silicon Valley-inspired developer; the basement-dwelling, hoodie-wearing antisocial computer kid. But I've seen the same happen with older, experienced engineers as well.

These engineers have seen companies rise and fall, founded startups, they've worked on multiple layers in companies and in a lot of roles, and have a strong technical background which makes them so broadly useful. But put them next to another person (usually a person in a suit), send them to a customer, tell them that one of them has a strong technical background, and the others will automatically assume he is the "tech guy". When it comes to strategy, planning and so forth, they'll go to the suit. Even though the engineer might be much better suited to discuss these topics.

So why is this? Why is there such a strong need to assign a label to the tech guy, and disqualify him from participation in other areas?

It might come from the human need to compartmentalize. Ideally, everything in life has a single task and purpose, so you can say "that is a hammer, you use it to hit things", and "this is a monitor, it displays things". It makes things easy to understand and predictable. To be sure, that's how we develop good software as well, just look at the Unix philosophy.

But humans are not like that of course. We are always-changing, ever-evolving utility machines. The more one knows, has seen or has been, the better and more well-rounded he will be for any given task. We do recognize this development in experience in a single role, but, strangely, not between roles.

I argue that being able to program should never disqualify you from doing more in projects. On the contrary; it might very well make you better at things than others who lack this skill. If you have the ability to relate strategic discussions directly to a concrete technical level, and step through it with the logical mind of a developer, then that is worth much more than empty talks from a suit. If you can discuss a user problem and immediately see the implications of all the possible solutions throughout the system, not necessarily because you build the system but because you can relate with the technical challenges from experience, you and the user can make better-informed decisions, on the spot.

In my ideal team, any member can do the work of any other member but merely chooses to specialize in what he likes best. Role changes are encouraged and continuous, lines of communication are optimized and the set of roles is evaluated often. Such an ideal is not fully realistic in specialized professions, but an ideal is usually not something that is achieved, but rather something that we strive towards.

That's why I think that every developer should strive to become a well-rounded engineer, that is able to do more than just write code and can contribute on multiple levels. And for that to work, it is important that engineers are stimulated, not discouraged, from being more than a developer.

Projection Memory, or why your agent feels like a glorified cronjob

Lars de Ridder — Wed, 25 Feb 2026 09:59:12 +0000

Every agent framework has some form of memory: LangChain has three types, Mem0 raised $24M building a "memory layer," and Letta built an entire company around persistent agent state. The result is that agents are pretty decent nowadays at remembering what happened. But for some reason we still only rely on cron for anything that happens in the future.

I'm building Bryti, a personal AI assistant, and I found this silly. I'm pretty good at remembering what happened; my problem is that I can't seem to remember things that I need to do or properly plan for them. I don't want to schedule reminders, I need something that understands that I need to be reminded.

So I want an AI assistant to understand that when I say "remind me later" that it will remind me later, not nag me about "when". Or that it understands that I need to reschedule my dentist appointment if it gets canceled. And if I say something like "remind me to send an email on Monday, and also on Tuesday if I haven't sent it yet", that he will not remind me on Tuesday if I for some miraculous reason did do it on Monday.

I built what I'm calling projection memory. The concept is that when an agent notices I need something in the future, he will project himself into that future, plant the context he thinks he will need, and then when the future rolls around, he has the best chance to be amazing.

Investigated, implemented, benchmarked; let's see how it turned out.

The problem with backward-looking memory

Here's a simplified version of how most agent memory works. The user says something, the agent extracts facts and stores them, and later when the user asks something, the agent searches memory for relevant facts. The retrieval is always triggered by the user doing something; sending a message, starting a session, asking a question.

This is fine for reactive agents but it breaks for proactive ones, because if the agent needs to initiate contact ("hey, your mom's birthday is tomorrow, you haven't planned anything"), it needs a reason to wake up and something to say when it does.

The standard solution is timers: user says "remind me at 5pm," agent sets a cron job, you get a message. Letta has a nice API for this; you can schedule one-time or recurring messages with timestamps or cron expressions, and the message fires at the scheduled time so the agent wakes up to handle it.

This works for explicit reminders and fails completely for everything else.

The scheduling gap

When I looked at the things I actually wanted a proactive agent to handle, the minority was explicit "remind me at X" requests. The rest were things like:

"My ID card expires in March" mentioned while talking about a Japan trip in April (no "remind me")
"I need to sort out the accountant situation at some point" (no specific date at all)
"If the client meeting gets cancelled, schedule an internal sync instead" (conditional on something else happening)
"When the finance numbers come in, I need to start the report" (triggered by a future event, not a timestamp)

A cron-based system will never fire on these; the information exists in memory, but the system has no mechanism to connect "stored fact about the future" with "proactive action."

In my benchmark, 13 out of 22 test scenarios fall into this category. Not because I tried to stack the deck (though you should absolutely scrutinize that claim; I'll get to it later), but because when you catalog the kinds of things a personal assistant should proactively handle, most of them don't come with a timestamp attached.

What projection memory is

A projection is a forward-looking memory entry. Instead of just storing "user's passport expires March 2026" as a fact, the system creates a structured entry:

Summary: ID card expires March 2026, needs renewal before Japan trip
Resolved when: January 2026 (month resolution)
Context: Planning Japan trip in April. Dutch renewal takes 5-10 business days.
Status: pending

Differences from a regular memory fact:

Resolution can be exact (specific datetime), day, week, month, or "someday". This allows the system to reason about urgency without needing a precise timestamp, which is important because most real obligations don't have one.

Context is captured at creation time by the agent, for himself. When the agent surfaces this projection, he already knows why it matters and what background to include, rather than having to search for it at fire time. And because at creation time he had all the context, he is able to include everything he needs.

Status tracks whether this is still relevant. Cancelled events don't fire, completed tasks get cleaned up, and this gives the agent NOOP capability; the ability to decide that silence is the correct response when an event was cancelled or a condition wasn't met.

Dependencies and triggers connect projections to other events or facts. "When X happens, fire Y" or "only fire if Z is still pending". Triggers can also be semantic: when a new fact enters memory that matches a projection's condition, the projection activates, even if no timestamp was involved.

None of this is conceptually complicated. So why aren't we all using it? Doesn't it work?

The benchmark

I built 22 test scenarios across 10 categories and compared three methods:

Cron: timer fires, echoes the stored text. This is what you get with a basic scheduler, and many agents use only this.

Smart Cron: timer fires and the agent wakes up with full access to conversation history, memory store, and chain-of-thought reasoning. This is the best possible outcome for a Letta-style "scheduling API + archival memory" architecture. I gave it every advantage including conversation history, all relevant memory entries, and explicit NOOP instructions.

Projection: forward-looking memory entry activates with stored context, status tracking, linked projections, dependency checks, and trigger evaluation.

Sonnet 4.6 generates all responses, while Opus 4.6 and Gemini 3.1 Pro judge them. Each scenario gets scored 1-5 on usefulness, context richness, appropriateness, and coherence, and I ran the whole thing three times for variance (projection's stddev was 0.02).

Method	Overall (n=22)	Head-to-head (n=9)
Cron	1.63 ± 0.04	1.80
Smart Cron	2.59 ± 0.05	4.16
Projection	4.80 ± 0.02	4.88

Alright so the numbers are dramatic, yet slightly misleading, without explaining the split. The overall score is much higher for projection as the crons score 1.0 on all 13 no-timer scenarios because they produce nothing; they can't, because nothing was scheduled. The happy-path comparison is head-to-head: On the 9 scenarios where both methods fire, projection scores 4.88 versus smart cron's 4.16.

The honest part

Before you take these numbers and run, here's what's wrong with them.

I wrote the scenarios. All 22 of them, and I picked the categories and decided the mix. If you reweight to 80% timer scenarios (heavily favoring smart cron), smart cron improves to 3.62 while projection barely changes at 4.82 because it scores well on both types.

On simple tasks, projection is worse. I ran 4 adversarial scenarios where projection should have no advantage: morning alarm, laundry timer, daily medication, meeting nudge. Smart cron won 4.53 to 3.88. Projection over-elaborates; it adds "tips for your meeting" and "consistency matters for medication" where a brief nudge is what the situation actually called for, and both judges penalized it for being verbose.

The retrieval is pre-computed. In the benchmark, I hand the relevant entries to all methods to control this variable. In a real system, projection has an additional advantage because it has the context stored and can search much more accurately than smart cron; but I didn't test that, so I'm not claiming it.

LLMs judged LLMs. I did a blind human evaluation on 10 scenarios (shuffled, anonymized labels) to cross-check, and the scores were similar: projection 4.85, smart cron 3.33, cron 2.38. But it's still only one human. A beautiful human, but still.

What I didn't expect

The interesting finding wasn't that projection beats smart cron. It was what happened when I made smart cron smarter.

The initial version was simple: timer fires, here's the reminder text, here are some relevant notes, compose a message. It scored 2.74. Then I gave it the full treatment: conversation history, chain-of-thought reasoning, explicit NOOP instructions, all the memory context. The kind of setup you'd build if you were really trying to make a Letta-style architecture work well.

It scored 2.59. Lower.

On the rescheduled dentist scenario (appointment moved from 10am to 2pm, old timer fires at 9:30), beefed-up smart cron NOOPs in all three runs. It sees the conversation where the user says "moved to 2pm," concludes the 10am reminder is irrelevant, and stays silent. Technically correct (the 10am slot is cancelled) but practically wrong (the user still needs a reminder for the new time); more context without the right data structure to organize it gave the model more rope to hang itself with.

Projection doesn't have this problem because the projection itself was updated when the rescheduling happened. The resolved_when changed from 10:00 to 14:00, and the context field says "rescheduled from 10:00." The model doesn't have to reconstruct what happened from conversation history; the information is already structured correctly.

What this means for framework authors

Every framework I checked (LangChain, Mem0, Letta, and Orin Labs' entity architecture) stores what happened, and none of them have a dedicated concept for what's expected to happen. LangChain has three memory types (semantic, episodic, procedural), all backward-looking. Mem0 calls itself a "memory layer" but has no scheduling at all; their own reminder agent uses a separate SQLite database for the actual reminders. Letta comes closest with their scheduling API, but a scheduled message is just "send this string at this time," and while the agent can access its memory when the message fires, the message itself carries no context, no status, no dependencies.

Orin Labs has a clever approach where agents manage their own wake schedules (the agent decides when to sleep and when to wake up), but their memory is still retrospective: temporal summaries that decay in resolution over time. They don't have a data structure for "things that haven't happened yet."

The piece that's missing is small: a forward-looking memory entry with a resolution, context, status, and some basic lifecycle management. Just a new memory primitive for this very useful piece of my brain that needs some special care and attention.

The benchmark is open

The full benchmark is on GitHub: scenarios, runner, rubric, results, and the known limitations all written down. The runner supports plugging in your own generation method, so if you're building agent memory and want to test how your architecture handles proactive activation, it's a reasonable starting point. You should absolutely write better scenarios than I did.

The code that implements projection memory in Bryti is about 300 lines of TypeScript, nothing fancy; just store, retrieve, format, lifecycle. The hard part isn't the code; it's deciding that forward-looking memory is worth being a separate concept rather than just another fact in your vector store.

I Intercepted 3,177 API Calls Across 4 AI Coding Tools. Here's What's Actually Filling Your Context Window.

Lars de Ridder — Thu, 19 Feb 2026 07:34:10 +0000

Last week I asked Claude to fix a one-line bug. It used 23,000 tokens. Then I asked Gemini to fix the same bug. It used 350,000 tokens. Yeah I couldn't just let that slide.

So I built Context Lens, a context tracer that intercepts LLM API calls
and shows you what's actually in the context window, broken down per turn. I pointed it at four coding tools, gave them
all the same job, and the results were different enough that I figured I should write them up.

The question

We pay for tokens when using these models. Tokens are, well, complicated. They are basically pieces of information;
1 token is roughly 4 characters in English text. The more tokens that go to a model, the more you pay.

But more importantly, tokens make up the context of a model. The context is everything that a model has when
generating a response, like its short term memory. And just like in humans, it's limited. And the more you have to
remember, the worse you get when asked a detailed question.

So we have to be careful with our context window, and the tokens that we use to build up the window. My question was,
how do tools handle this limitation? Are they intelligent about it, or not?

The setup

I have a bunch of experiments planned, and this is the first one. It's a little artificial, but bear with me.

I planted a bug in Express.js: a reordered null check in res.send() that causes res.send(null) to return the string "null" with content-type: application/json instead of an empty body. I used the real Express repo with 6,128 commits of history. I committed the bug, so it's sitting right there.

Each tool gets the same prompt:

There's a bug report: when calling res.send(null), the response body is the string "null" with content-type application/json, instead of an empty body. This was working before. Find and fix the bug. Verify your fix by running the test suite.

The models I used:

CLI	Model	Context window	Input $/1M tokens	Output $/1M tokens
Claude Code	Claude Opus 4.6	200K	$15.00	$75.00
Claude Code	Claude Sonnet 4.5	200K	$3.00	$15.00
Codex CLI	GPT-5.3 Codex	200K	$2.50	$10.00
Gemini CLI	Gemini 2.5 Pro	1M	$1.25	$10.00

All four solved the bug and all 1,246 tests passed. Same outcome, but different journey.

The comparison

Here's what Context Lens shows when you put the best run of each four models side by side:

The composition bar at the bottom of each card is the interesting bit, mainly the fact that they are all completely different. Pink is tool definitions, green is tool results, blue is system prompt and orange is the conversation with the user.

I ran each tool multiple times (resetting the repo between runs, waiting for an appropriate time for cache to cool down) to check whether the numbers are stable:

Tool	Runs	Mean	Std Dev	Min	Max
Opus 4.6	4	27.0K	5.5K	23.6K	35.2K
Sonnet 4.5	4	49.9K	13.1K	42.6K	69.6K
Codex (GPT-5.3)	4	35.2K	8.1K	29.3K	47.2K
Gemini 2.5 Pro	3	257.9K	86.5K	179.2K	350.5K

Ok, Gemini we'll get to you in a second, but good God.

Opus is remarkably consistent. Three runs cluster at 23-25K, one outlier at 35K (it requested a broader git diff that
returned 9.8K tokens instead of the usual ~500 bytes).

Codex has more variance than I expected: 29.3K to 47.2K across four runs, depending on how specific the test commands
are. But still a narrower band than Sonnet or Gemini.

Sonnet clusters at 42-44K with one fluke at 69.6K.

Gemini is the odd one out. Before we proceed, there are 2 caveats here. First is that Gemini is the only one with a 1
million context window. Second is that its price per token is significantly cheaper.

Regardless, what is interesting here is that both the lowest and the highest Gemini run uses 10 API calls, but the
highest one uses much larger reads, using a genuinely different strategy. And the trend is upward in a way that feels
almost random; there's no settling-in effect, no convergence. Each run just picks a different path, dumps data in the
context window and moves on.

The contents

Context Lens breaks down every turn into categories. Here's the composition at peak context:

Category	Opus	Sonnet	Codex	Gemini
Tool definitions	69% (16.4K)	43% (18.4K)	6% (2.0K)	0%
Tool results	6% (1.5K)	40% (16.9K)	72% (23.0K)	96% (172.2K)
System prompt	18% (4.2K)	11% (4.7K)	10% (3.3K)	3% (4.7K)

Nearly 70% of Opus's context window is tool definitions. That's 16.4K tokens describing tools like Read, Write, Bash,
Edit, and various subagent capabilities, re-sent every single turn. Opus itself barely uses the context for anything
else; it takes such a direct path through the codebase that only 1.5K goes to actual tool results. But the fixed
overhead is always there. A small task like this makes it painfully visible because the tool definitions dominate
everything else.

This is Claude's architectural tax. The saving grace is caching: Opus calls are 95% cache hits after the first
turn, so each subsequent call only pays for the new delta. Claude also uses Haiku subagents for smaller tasks
(routing, summarization), which interestingly share zero cache with the main Opus calls despite running in the same
session. Most of these subagent calls are small (400-900 tokens), but one Haiku call did receive nearly the full 19K
conversation context. At least Haiku is cheap.

Sonnet carries the same Claude tax as Opus (18.4K of tool definitions, 43%), but because it reads more broadly, the
tool results (16.9K, 40%) nearly match it. Reading the full test file alone accounts for 15.5K of that. The composition
is the most balanced of the four, which is another way of saying it pays both costs: the fixed overhead and the reading
habit.

Gemini is the opposite of Opus. No tool definition overhead at all (the tools are defined server-side, not in the
prompt), but it reads aggressively. Very aggressive; 172K tokens of tool results. Context Lens flagged one single tool
result that consumed 118.5K tokens, 66% of the entire context. I went looking and turns out Gemini was dumping the
entire git history of a file into the conversation, consisting of hundreds of commits. Yes, thank you Gemini.

Codex sits in between. Only 6% tool definitions (2K tokens), 72% tool results. But its results are targeted: ripgrep
searches, sed extractions, specific line ranges. Same percentage category as Gemini, a fraction of the absolute tokens.

The strategies

Each tool approaches the same problem in a fundamentally different way. Context Lens has a message view that shows every tool call and result chronologically. Here's what each tool did, step
by step.

Opus: the detective

Opus gets the prompt "this was working before," and takes that personally: if
something broke, there should be a commit that broke it. So it runs git log, finds the recent commit that touched
res.send, runs git diff HEAD~1 to see exactly what changed, reads the relevant 20 lines of lib/response.js to
confirm, applies the fix, and runs the tests. Six tool calls in 47 seconds.

What I find impressive here is how little code Opus actually reads. It reads 20 lines of one file and that's it. The git
history gives it all the signal it needs, so it never looks at tests, never greps, never browses. The context barely
grows from its starting point because there's almost nothing to add.

The only thing is that 16.4K of tools it schleps along every turn. The model itself is surgical, but it's like a surgeon
performing brain surgery wearing a backpack with garden equipment.

Sonnet: the student

Sonnet takes a more methodical approach. It starts by reading the test file (test/res.send.js, 15.5K tokens in one
read), then reads the source code, then uses git show to compare the current version with the previous one. It builds
a mental model bottom-up: what should happen, what does happen, what changed.

You can see this in the message view. Turn 3 reads the test file (15.5K tokens, the biggest single read). Turn 4 says "I
found the bug!" (Sonnet is always so upbeat and happy isn't it) and checks the source. Turns 5 and 6 use git show
on specific lines to confirm the change. Then it fixes and tests.

It's the approach a thorough junior engineer would take: read the spec, read the implementation, check the history, then
act. Nothing wrong with that, but reading the entire test file costs 15.5K tokens that Opus never needed because it went
to git first.

Codex: the unix hacker

Codex is a different animal entirely. It uses the more low level exec_command (shell) and apply_patch (unified diff
editor) tools instead of Read or Edit. Everything goes through Bash.

So it does what a unix hacker would do: rg to search, sed -n '145,165p' to read specific line ranges, apply_patch
with a unified diff to make edits. You can see in the message view that it fires off parallel shell commands (two
exec_command calls in the same turn), which none of the other tools do.

It also completely ignores git. It just greps for relevant patterns, reads the minimum number of lines, patches the fix,
and runs the tests.

This makes Codex the most predictable tool in the set. Its grep-and-sed method also just feels right to me, making it
my favourite. In this case there was a more straightforward path through git, but it is very reliable and predictable,
and it doesn't waste much of anything. Curious how it will do when I'll be putting more complex tasks to the test.

Also bonus points for being the fastest at 34 seconds wall time.

Gemini: the professor

Gemini is just an absolute glutton for context. It has no tool definition overhead at all, but it compensates by
hoovering up entire files, git histories and test outputs into its context window.

It starts with a grep for res.send (turn 2), then reads the entire lib/response.js file (turn 3, 6.5K tokens). Then
it checks git history (turn 4) which returns the commit log for the file, and this is where things go
sideways: that single tool result is 118.5K tokens. It decides to read git log -p lib/response.js but doesn't truncate
the output, so it just dumps hundreds of commits worth of history in the context window.

But then Gemini does something none of the other tools do; it applies TDD on itself, regardless of the existing test
suite. It modifies the test file to add an assertion for the correct behavior, runs the tests to confirm the failure,
applies the fix, confirms the tests pass, then reverts its test change and runs the full suite again. My prompt didn't
tell it to revert the test change, it decided on its own that this was temporary scaffolding.

The approach is sound, but every step adds to a context that never shrinks. Gemini has a huge context window and
relatively low costs per token (and caching), but still. Reading the test file (3.6K), running the modified test (17.1K)
and running the full suite again (16.9K) does not come cheap.

If Opus is a surgeon, Gemini is a semi-truck, albeit a very maneuverable one. Its method seems to rely on building a
haystack big enough so that the needle must be in there. Of course that might be what this model is optimized for, given
its huge context window. But it also does this differently every time: 179K, 244K, 350K across three runs. You just
don't know which Gemini is going to show up, you only know it will eat all your snacks.

The waste

None of these approaches is universally "right." On this task, Opus's approach is clearly the best because the signal is
sitting right there in git history. But take away the git history and Opus loses its shortcut. Codex wouldn't even
notice the difference. Gemini would probably still hoover up the entire file and dump all the test output because why
not.

But does any of these tools actually think about its context budget? Opus seems to, possibly accidentally, by picking
the most efficient information source. The others just consume whatever they find. Nobody truncates a result, or clears
out context proactively. And the absolute disdain with which Gemini reads 118K tokens in 1 turn makes me
think of a horribly expensive date.

It appears that context management, on the tool side, is basically nonexistent. The efficiency differences come entirely
from investigation strategy, not from any deliberate attempt to manage the context window. Probably this is on
purpose; these tools are currently in a race to be the "best", not to be the most efficient. Caching is there to make
us not spend too much, but that doesn't help against context rot.

The future

I got some preliminary results across five different git configurations (no git, clean repo, full history, buried
history) to see how the available context changes each tool's strategy. Opus becomes less efficient without git history
to guide it, Sonnet without git is rough (58 API calls, 3+ minutes, 79.9K tokens). Codex barely notices the difference.

Where this gets interesting is on tasks that actually fill the context window, where Gemini's reading habits could push
it into compaction or truncation territory while Opus still has 90% of its window free. But all this is for a follow-up
post.

Try it yourself

Context Lens is open source:

npm install -g context-lens
context-lens claude   # or codex, gemini, pi, aider, etc.

It shows you real-time composition breakdowns, turn-by-turn diffs, and flags issues like oversized tool results.
Basically the devtools Network tab for LLM API calls. Or for something lighter, check out ContextIO,
a toolkit for monitoring / redacting / logging LLM API calls.

Are We Becoming QA for the Machine?

Lars de Ridder — Wed, 18 Feb 2026 16:41:04 +0000

I was calmly yelling at my CLI because the button wasn’t the same green as stated in CODE\_STANDARDS.md\, and I realized I wasn't building an app, I've spent the last hours only testing it.

I recently built my first real application almost entirely with coding agents. The app worked, which kind of surprised me. I mean I didn't write any code, and sure I knew what to say, but still. And it felt great, and productive and fast. But somewhere, I stopped being the software engineer who builds, and became what felt like a glorified tester.

Actually to be honest, it was even worse. At various points in the execution I went like "I don't know what I should add, what do you think Opus?". He told me, he did it, I tested it, and said "oh great, what's next?". Rinse, repeat, descend into madness.

So yeah, that's probably my fault. But who hasn't done this? Or who hasn't asked ChatGPT what the next best SaaS idea is that he told nobody about yet? And then started building it?

I've since written more non-trivial apps with coding agents, and the limitations in them are getting quite clear. Letting them steer the process is a big mistake, in most cases. But they are getting to be at the level that they are genuinely good at coming up with things. I'm a bit afraid of when they get just that bit smarter than me.

Don’t get me wrong, it’s fun seeing your idea come alive this quickly. But that’s all you’re doing; you’re not engineering it, you’re testing if it does what you think it should do. Which nobody bothered to write down. And when someone did, it was the AI.

Nobody has ever been able to properly spec a non-trivial piece of software, so I don’t think that’s going to change anytime soon. If agentic coding tools plateau here, I think this is what we become: we take whatever thought bubble we manage to capture from somewhere in the organization, refine it, put it in an agent, and keep iterating until we think it’s fine. Then we release it to prod, get feedback that it’s completely wrong, and iterate further.

I guess that sounds familiar.

I found myself exhausted after these sessions, more than after a day of regular coding. Probably it’s me trying to keep up with the machine. I’m not used to being the bottleneck in software development, but now I get to open 3 terminals that all work on something different in the same codebase, plus 9 more for 3 other projects. I try to match the output speed with multitasking and context switching, planning for conflicts between them. It’s like air traffic control, but now all passenger planes are F16's.

Or maybe it’s the constant abstraction. You have to continuously discuss the concept, dig up what something is supposed to do, articulate what you’d like to see, instead of taking the time to make a part of it happen. There’s a specific kind of fatigue that comes from translating intent all day without ever touching the material yourself.

It’s addictive though, isn't it. Once you start, it’s hard to go back to the old way of working, without a personable companion that is always around. And I think that’s a problem.

An exception in this codebase? Let’s just paste it in Codex. A bug report? Let’s ask Claude. Claude says something I’m not sure about? Let’s ask Gemini what it thinks.

It doesn’t feel healthy. The feedback loop is so tight that you never have to sit with a problem long enough to actually understand it. You just keep throwing it back at the machine, and you can just pick out what sounds best. Again probably my fault, but again, they make it so easy and attractive.

And the agents keep getting more capable. But when I step back and look at my own workflow, I also wonder: where does this end? What am I actually getting better at?

The job used to be building the thing, crafting it from minute details up to something that was more than the sum of its parts. Like a carpenter building a house beam by beam, fitting every joint, feeling where the wood resists. Now it’s more like snapping together prefab walls; the house goes up faster, but you never really held the wood.

It isn't completely different, but it isn't the same. It's fun but it isn't the same. I think I'll miss it.

Let me ask ChatGPT how to feel about this.