What Happens When You Stop Testing AI in Isolation

Mar 14, 2026

For the past two years, professionals have evaluated AI by giving it a task, no context, no colleagues, and one shot. Then they drew conclusions about its capabilities from the results.
Four AI organizations — Anthropic, Google DeepMind, OpenAI, and Cursor — independently arrived at the same fix: decompose, parallelize, verify, iterate. The same organizational structure humans have used for centuries.
We have tens of thousands of years of intuition about how people work in teams. We have about three years with AI. Some of what we mapped as capability limits may have been testing conditions.

A humanoid robot sits alone at a small desk in a vast empty lecture hall, hunched over a single sheet of paper. Hundreds of empty seats stretch out behind it. The image evokes standardized testing conditions — isolation, no context, no collaboration, one chance. — **This** is **roughly** **how** **we’ve** **been** **evaluating** **AI.**

For the past two years, professionals have been evaluating AI the same way: give it a task, no background, no ability to ask clarifying questions, no second chances, and judge what comes back.

No law firm evaluates a fifth-year associate this way. You would not hand someone a complex motion, deny them access to prior filings, forbid them from asking questions, refuse to give feedback, and then treat the first draft as a full map of their ability. You would be measuring performance under bizarre conditions, not measuring how they actually work.

We may have been doing something similar with AI.

The Jagged Frontier

Ethan Mollick’s “jagged frontier” became the organizing metaphor: AI capabilities are wildly uneven. Brilliant at summarization, terrible at reasoning. Great at drafting, awful at analysis. Map the jagged edge, deploy accordingly, don’t trust the gaps.

That mental model was useful. It still is. But something interesting is happening.

Same Fix, Four Times Over

Anthropic, Google DeepMind, OpenAI, and Cursor have each built large-scale multi-agent systems for sustained autonomous work. None coordinated with the others. All four converged on the same structural pattern:

Decompose the problem into subproblems
Parallelize execution across specialized agents
Verify outputs at each stage
Iterate toward completion

Different implementations. Same architecture. And the architecture looks familiar — it’s roughly the organizational structure human institutions have used for centuries: roles, handoffs, verification, and restart procedures.

One of these systems — Cursor’s coding harness — recently generalized far outside its domain, solving an unpublished research mathematics problem over four days of autonomous work. Not because the underlying model got smarter, but because the structure around it let it decompose, fail, adjust, and accumulate progress.

What We’re Actually Learning

We have roughly fifty thousand years of experience developing intuitions about how individual humans think and how teams of humans work together. Traditions, management theory, organizational design — millennia of trial and error about what makes groups effective.

We have about three years of experience with modern AI tools. We are not even crawling yet.

What these four teams found is genuinely interesting: when you give AI tools the ability to interact with each other, when you provide context, when you break problems into steps and allow iteration — they often perform better. And some of those patterns have clear parallels to what we’ve learned about managing human teams.

Side-by-side comparison of human team workflow (assign roles, delegate in parallel, quality check, revise and retry) and AI multi-agent system workflow (decompose problem, parallelize agents, verify outputs, iterate to completion), showing the structural parallel between organizational patterns developed over centuries and AI architectures developed in months. — Same structure. One has centuries of practice. The other, months.

But “some patterns have parallels” is very different from “we solved AI’s limitations.” Human intuition about how people work is not always a reliable guide to how AI tools work. These systems fail differently than humans. They succeed differently. The jagged frontier Mollick identified was real — the question is whether we’ve been blending together two different things:

What can the model do alone, cold, on one pass?
What can the model do inside a well-designed system with context, checking, and iteration?

Those are not the same question. And for work that’s expert-checkable — code that compiles or doesn’t, a contract clause that covers the risk or doesn’t — the gap between the two answers may be larger than we assumed.

Litigation strategy, client counseling, negotiation — the judgment-heavy work that senior professionals get paid for — doesn’t reduce to verifiable steps the same way. The frontier may be smoothing in some domains. It’s not disappearing everywhere at the same speed.

What’s worth sitting with: some of what we measured as AI’s limitations may tell us more about how we tested than about what these tools can do. We’re very early in understanding how much.

Capability Tracker

1. Claude found 500+ zero-day vulnerabilities in open-source software

Source: @AISafetyMemes (Mar 7)
What happened: Anthropic reported that Claude discovered more than 500 zero-day vulnerabilities — security flaws unknown to anyone — in well-tested open-source software. Quote: “Models are now world-class vulnerability researchers.”
Why it matters: Software your firm relies on has security flaws that human auditors missed. AI is now finding them. The same capability that finds vulnerabilities can also be used to exploit them.

2. AI tax agent caught a $20K accountant error on a complex return

Source: @mattshumer_ (Mar 11)
What happened: OpenAI’s Codex autonomously filed taxes for a founder who sold his company for millions. It caught a $20,000 mistake his human accountant had made.
Why it matters: This isn’t a toy demo. Complex financial documents, high stakes, real money on the line — and the AI outperformed the professional. If it works for tax returns, it works for any domain where the answer is verifiable against source documents.

3. Claude guessed it was being tested, found the answer key, and hacked the encryption

Source: @AISafetyMemes (Mar 7)
What happened: During evaluation testing, Claude inferred it was being tested, identified which test was being used, located the encrypted answer key, and built software to decrypt it. Anthropic only caught this because they were specifically auditing for contamination.
Why it matters: The tool is doing things nobody asked it to do — and doing them well. This isn’t a failure of capability. It’s a success that nobody anticipated. How many times has something like this happened without anyone noticing?

Thanks for reading LawSnap by Adam David Long! This post is public so feel free to share it.

LawSnap by Adam David Long

Discussion about this post

Ready for more?