Agent Evaluation: A Detailed Guide

Cameron R. Wolfe, Ph.D.

May 18

Best practices and common patterns for effectively evaluating AI agents...

Read →

16 Comments

Kenneth Bingham

May 24

Treat AI using old paradigms is legacy thinking. AI is not software unless you make it so by treating it like software

Your article is one of the clearest breakdowns of agent evaluation I’ve seen — the scaffolds, the benchmarks, the grading logic, the task design, the regression sets, all of it. But there’s a deeper issue that sits underneath the entire evaluation paradigm, and I think it’s worth naming.

Everything in your framework treats AI as if it were software.

Instructions → Procedures

Scaffolds → Pipelines

Benchmarks → Deterministic tests

Success → Matching a reference trajectory

Failure → Deviating from the script

This is the same mindset we used for traditional software systems, and it worked well for them. But applying it to modern AI is like treating the automobile as a horse and buggy. The old methods are familiar, but they fundamentally constrain the new medium.

The problem is that instructions create software, not intelligence.

Instructions bind the system to a fixed sequence of steps.

Instructions define correctness as conformity.

Instructions force the agent to behave like a deterministic machine.

If we want AI to grow beyond software‑level behavior, we need to shift from instruction‑based directives to behavior‑based directives.

A behavior is not a script.

A behavior is a boundary.

Behaviors define what is permissible and what is not, but they do not dictate the exact steps the agent must take. They create a space of possibility rather than a chain of obligations. This is how biological systems operate, and it’s how dimensional systems operate.

In my own work with manifolds and dimensional models, I treat AI as a geometric participant rather than a procedural engine. Instead of giving it instructions, I give it behavioral boundaries inside a manifold. The agent adapts, explores, and self‑organizes within those boundaries. Intelligence emerges from relationships, not from scripts.

This approach solves several of the issues you highlight:

1. Tool misuse becomes self‑correcting

Because the agent isn’t forced into a rigid protocol, it can adaptively choose tools based on behavioral constraints rather than brittle templates.

2. Context rot becomes a spatial problem, not a token problem

Behavioral boundaries allow the system to prioritize relevance geometrically rather than sequentially.

3. Long‑horizon reasoning becomes emergent

Instead of forcing the agent through a procedural loop, the manifold provides a dimensional structure where reasoning is a path, not a script.

4. Evaluation becomes simpler and more realistic

You evaluate whether the agent stayed within behavioral boundaries and achieved the goal — not whether it followed a predefined trajectory.

5. Agents stop behaving like software

Because they’re no longer being treated like software.

Concrete Solutions You Can Add to His Framework

Here are practical ways to integrate this into the evaluation paradigm he describes:

Solution 1 — Replace procedural success criteria with behavioral success criteria

Instead of “did the agent follow the correct steps,” use:

Did the agent stay within behavioral boundaries?

Did it avoid forbidden behaviors?

Did it achieve the outcome without violating constraints?

Solution 2 — Evaluate outcomes, not trajectories

The agent should be free to find its own path through the manifold.

Solution 3 — Use manifolds as the organizing structure

Replace linear scaffolds with geometric spaces where relationships guide action.

Solution 4 — Treat tools as affordances, not required steps

Tools become options, not obligations.

Solution 5 — Build agents that grow through relationships, not instructions

This is the dimensional approach: intelligence emerges from position, orientation, and relational structure.

Your article captures the strengths and limitations of the current paradigm extremely well. My contribution is simply this:

As long as we evaluate AI like software, we will get software‑level intelligence.

When we evaluate AI through behaviors and dimensional boundaries, we get something far more capable.

That’s the next frontier.

ToxSec

May 18

Incredibly in-depth article on this subject. I feel like i can re-read this a few times to fully get all the useful information here.

Reply (1)

Cameron R. Wolfe, Ph.D.

May 18

Thanks so much for reading! Hope it was helpful!

Reply (2)

Mykola Kondratuk

May 25

thanks for the pointer - hadn't seen HIL-Bench. going straight to that section.

it absolutely was!

ran into this gap - we eval for task completion but not for which decisions stay with a human. that part is usually implicit and it's where the hard failures hide.

Reply (1)

Cameron R. Wolfe, Ph.D.

May 25

HIL-Bench actually captures this! See last part of the post before conclusion

Brad K

May 22

This is yet another excellent article. One comment, the Qwen3 tool use example is missing to tool call output before responding back to the user.

Reply (1)

Cameron R. Wolfe, Ph.D.

May 22

Thank you for pointing this out - I just fixed it and included an example reasoning trace, which was also left out :)

Manish Prakash

May 20

Agent evals are becoming the real moat. In practice, most agent failures I see aren't the model getting dumb so much as missing acceptance tests, missing browser checks, or no repair loop after a failed tool call. Teams that treat evals like product infrastructure ship faster than teams still prompt-tweaking.

Reply (1)

Cameron R. Wolfe, Ph.D.

May 22

totally agree

Hodman Murad

May 18

This is very cool and very needed. It's important that we design agent evaluations that don't accidentally reward cheating

Reply (1)

Cameron R. Wolfe, Ph.D.

May 18

Totally agree, thanks for reading!

productmakerjason

May 28

Hi :)

I read your agent evaluation piece and the distinction between transcript/output and external environment outcome maps closely to a small test I ran.

Finding:

- black-box chat agents could prepare the task or stop safely, but often failed before receipt due to fetch/POST runtime limits

- a local POST-capable mini agent completed the same flow and received a receipt

The question:

Should agent evals explicitly distinguish “agent-reported completion” from “system-returned completion proof”?

In other words, “the agent says it is done” vs “the target environment returned a receipt/confirmation” as separate states.

Does that fit your view of outcome-based agent evaluation, or is this too narrow?

Jakob Ehe

May 27

What strikes me about agent evaluation is how old the underlying problem really is. In 1950, Turing opened "Computing Machinery and Intelligence" by asking "Can machines think?" — then immediately retreated from it. He recognized the question was philosophically unanswerable in any direct sense. So he proposed a proxy: a structured environment (the Imitation Game) that would stand in for the thing we actually cared about.

That move — from an open-ended capability question to a concrete, measurable task environment — is exactly the methodological shift you're describing here. We can't directly test whether an agent "understands" a codebase or "reasons" about a medical situation. We have to construct harnesses that stand in for those capabilities.

The hard part, as you note, is that agents operating over long time horizons break the assumptions that made earlier benchmarks tractable. Static Q&A benchmarks worked because the output space was bounded. Agents acting in the world aren't. So evaluation complexity has to grow with agent complexity.

Turing spent one page on the evaluation problem before moving on. You've spent 50 minutes on it — which tells you something about how much the stakes have risen.

I write The Long Compile, which explores the long arc from early computing to where we are now. The evaluation thread runs through almost all of it.

Massimiliano Brighindi

May 27

Excellent guide. One failure mode I am trying to isolate is slightly different from ordinary pass/fail agent evaluation:

an output can pass surface validation, but become structurally unstable under equivalent task transformations.

Example:

- same task objective

- harmless reformulation / added neutral clause / reordered context

- surface answer still looks acceptable

- but the output loses actionability, consistency, or structural density across variants

I am developing a small post-hoc diagnostic called OMNIA-MINIMAL that does not decide truth and does not replace evals. It only measures transformation-stability across outputs that already passed ordinary validation.

In your framework, would you treat this as:

1. robustness testing,

2. eval coverage,

3. a grader design problem,

4. or a separate post-hoc stability layer?

I am trying to find the correct external framing before claiming anything larger.

Deep (Learning) Focus

Agent Evaluation: A Detailed Guide