verdict

You ship agents. Do you know if they work?

Production-grade evaluation framework for LLM agents.
Score task completion, reasoning quality, and tool use accuracy — in a single run.

The Problem

You write an agent. Your unit tests pass. You ship it. Users report garbage.

Agents pass local tests but fail in production. Mocks are not your agent. Eval cases are.
Impossible to know which dimension is failing. Did it get the answer wrong, or did it reason its way there badly? Did it hallucinate a tool call? These are different bugs requiring different fixes.
No standard metric for agent quality. "It worked in the demo" is not a number. Without a number, you cannot regress, compare, or improve.

verdict gives you a number. Three numbers, actually.

Installation

pip install agent-evals

Quickstart

from agent_evals import AgentEvaluator, AgentResult, EvalCase, ReasoningStep, ToolCall

# Wrap your agent — any callable that returns AgentResult
def my_agent(prompt: str) -> AgentResult:
    # Replace this with your real agent
    return AgentResult(
        output="The capital of France is Paris.",
        reasoning_trace=[
            ReasoningStep(
                step_index=0,
                thought="This is a geography fact I can answer directly.",
                action="recall_knowledge",
                observation="Paris is France's capital — established fact.",
            )
        ],
        tool_calls=[],
        steps_taken=1,
        estimated_cost_usd=0.0001,
        model="gpt-4o-mini",
    )

evaluator = AgentEvaluator(agent_fn=my_agent)

report = evaluator.evaluate([
    EvalCase(
        id="geo-1",
        prompt="What is the capital of France?",
        expected_output="Paris",
    ),
    EvalCase(
        id="calc-1",
        prompt="Calculate compound interest: $1,000 at 5% annual for 10 years.",
        expected_output="1628.89",
        expected_tools=["calculator"],
    ),
    EvalCase(
        id="reason-1",
        prompt="A bat and ball cost $1.10. The bat costs $1 more than the ball. How much does the ball cost?",
        expected_output="0.05",
        metadata={"difficulty": "cognitive_bias", "category": "reasoning"},
    ),
])

print(report.summary())

Output:

EvalReport [run-abc123]  2025-01-15T14:23:11Z
  Model          : gpt-4o-mini
  Cases          : 3  (passed: 2, threshold: 0.7)
  Task completion: 0.867
  Reasoning      : 0.761
  Tool use       : 0.890
  Overall        : 0.839
  Total cost     : $0.0014

The Three Dimensions

Every EvalCase produces three independent scores in [0.0, 1.0].

1. Task Completion — Did the agent answer correctly?

EvalCase(
    id="factual-1",
    prompt="What is the boiling point of water at sea level?",
    expected_output="100",          # substring match, case-insensitive
)

# Or provide a custom checker:
EvalCase(
    id="json-output",
    prompt="Return a JSON object with keys name and age.",
    expected_output=lambda s: '"name"' in s and '"age"' in s,
)

Scoring: exact substring match → 1.0 · no match → 0.0 · callable → callable result

2. Reasoning Quality — Was the agent's thinking coherent and efficient?

AgentResult(
    output="$1,628.89",
    reasoning_trace=[
        ReasoningStep(
            step_index=0,
            thought="Apply A = P(1 + r)^t with P=1000, r=0.05, t=10",
            action="call_tool:calculator",
            observation="Calculator returned 1628.8946...",
        ),
        ReasoningStep(
            step_index=1,
            thought="Round to 2 decimal places for currency.",
            action="format_response",
            observation="Response formatted.",
        ),
    ],
    ...
)

Scoring factors: number of reasoning steps taken vs steps needed · logical coherence of thought→action→observation chains · loop detection (repeated actions penalised)

3. Tool Use Accuracy — Did the agent call the right tools, correctly?

EvalCase(
    id="search-and-summarise",
    prompt="Find the current price of AAPL and summarise the last 5 news headlines.",
    expected_tools=["web_search", "summarise"],   # order-insensitive
)

# In your AgentResult:
AgentResult(
    ...
    tool_calls=[
        ToolCall(name="web_search", arguments={"query": "AAPL stock price"}, result="$182.40", latency_ms=340),
        ToolCall(name="summarise",  arguments={"text": "..."}, result="...", latency_ms=120),
    ],
)

Scoring: F1 of (expected tools called) × precision/recall · each failed tool call (ToolCall.error set) deducts 0.1, capped at 0.5 penalty

EvalReport API

report = evaluator.evaluate(cases)

# Aggregate metrics
report.mean_task_completion   # float [0, 1]
report.mean_reasoning         # float [0, 1]
report.mean_tool_use          # float [0, 1]
report.mean_overall           # float [0, 1]
report.total_cases            # int
report.passed_cases           # int — cases where overall_score >= pass_threshold
report.total_cost_usd         # float

# Per-case drill-down
for result in report.results:
    print(result.case_id, result.task_completion_score, result.tool_use_score)
    if result.stopped_by_guard:
        print(f"  ⚠ guard tripped: {result.stopped_by_guard}")
    if result.error:
        print(f"  ✗ error: {result.error}")

# Export
import json
with open("eval_report.json", "w") as f:
    json.dump(report.to_dict(), f, indent=2)

# Human-readable summary
print(report.summary())

Production Guards

Guards protect your eval runs from runaway agents, infinite loops, and cost blowouts. They fire between every agent step.

from agent_evals import AgentEvaluator, default_guards
from agent_evals import MaxStepsGuard, CostCeilingGuard, LoopDetectionGuard, CompositeGuard

# Sensible defaults for production
evaluator = AgentEvaluator(
    agent_fn=my_agent,
    guard=default_guards(max_steps=50, cost_ceiling=1.0),
)

# Or compose your own:
evaluator = AgentEvaluator(
    agent_fn=my_agent,
    guard=CompositeGuard([
        MaxStepsGuard(max_steps=30),          # Hard cap on steps
        CostCeilingGuard(ceiling_usd=0.25),   # Per-case cost limit
        LoopDetectionGuard(max_repeats=3),    # Kill repeated actions
    ]),
)

When a guard trips, the case is marked stopped_by_guard in CaseResult — your eval run continues with the remaining cases.

Evaluation Pipeline

flowchart LR
    A[EvalCase list] --> B[AgentEvaluator]
    B --> C{For each case}
    C --> D[Call agent_fn]
    D --> E[Guard check per step]
    E -->|Guard tripped| F[CaseResult — stopped_by_guard]
    E -->|All steps pass| G[Score 3 dimensions]
    G --> H[task_completion_score]
    G --> I[reasoning_score]
    G --> J[tool_use_score]
    H & I & J --> K[overall_score = mean]
    K --> L[EvalReport]
    F --> L

Score Aggregation

flowchart TD
    TC[Task Completion\nsubstring / callable match]
    RQ[Reasoning Quality\nstep efficiency + coherence]
    TU[Tool Use Accuracy\nF1 precision + recall]
    OS[overall_score\nmean of 3 dimensions]
    TC --> OS
    RQ --> OS
    TU --> OS
    OS --> PASS{>= pass_threshold\ndefault: 0.7}
    PASS -->|yes| PASSED[CaseResult: passed]
    PASS -->|no| FAILED[CaseResult: failed]

Integration: sentinel + herald

verdict is part of a cohesive agent infrastructure stack:

Package	Role
verdict	Evaluate agents — measure quality across 3 dimensions
sentinel	Guard production runs — `MaxStepsGuard`, `CostCeilingGuard`, `LoopDetectionGuard` at runtime
herald	Structured logging and alerting — emit eval results to dashboards, Slack, PagerDuty

# Full stack: evaluate → guard → observe
from agent_evals import AgentEvaluator, default_guards, EvalCase

# Run evals in CI
evaluator = AgentEvaluator(agent_fn=my_agent, guard=default_guards())
report = evaluator.evaluate(regression_suite)

# Fail CI if quality drops
assert report.mean_overall >= 0.75, f"Quality regression: {report.mean_overall:.3f}"
assert report.mean_tool_use >= 0.80, f"Tool use regression: {report.mean_tool_use:.3f}"

# Export for herald
import json
with open("artifacts/eval_report.json", "w") as f:
    json.dump(report.to_dict(), f, indent=2)

Data Models at a Glance

@dataclass
class EvalCase:
    id: str
    prompt: str
    expected_output: str | Callable | None = None
    expected_tools: list[str] = field(default_factory=list)
    max_tokens: int = 2048
    metadata: dict = field(default_factory=dict)

@dataclass
class AgentResult:
    output: str
    reasoning_trace: list[ReasoningStep] = ...
    tool_calls: list[ToolCall] = ...
    steps_taken: int = 0
    estimated_cost_usd: float = 0.0
    model: str = ""

@dataclass
class CaseResult:
    case_id: str
    task_completion_score: float   # [0, 1]
    reasoning_score: float         # [0, 1]
    tool_use_score: float          # [0, 1]
    overall_score: float           # mean of above
    steps_taken: int
    estimated_cost_usd: float
    latency_ms: float
    stopped_by_guard: str | None
    error: str | None

Philosophy

Dharma is not just action — it is correct action.

verdict measures whether your agent acted correctly.

An agent that produces the right answer through wrong reasoning is a liability. An agent that calls tools accurately but reasons in circles will fail at scale. True quality is three-dimensional: completion, reasoning, precision.

Shipping agents without evaluation is not speed — it is technical debt you cannot measure.

Contributing

See CONTRIBUTING.md. Run the test suite:

pip install -e ".[dev]"
pytest tests/ -v

_{Built by Darshankumar Joshi · MIT License}

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
agent_evals		agent_evals
assets		assets
benchmarks		benchmarks
docs		docs
examples		examples
tests		tests
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
SECURITY.md		SECURITY.md
mkdocs.yml		mkdocs.yml
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

verdict

The Problem

Installation

Quickstart

The Three Dimensions

1. Task Completion — Did the agent answer correctly?

2. Reasoning Quality — Was the agent's thinking coherent and efficient?

3. Tool Use Accuracy — Did the agent call the right tools, correctly?

EvalReport API

Production Guards

Evaluation Pipeline

Score Aggregation

Integration: sentinel + herald

Data Models at a Glance

Philosophy

Contributing

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

verdict

The Problem

Installation

Quickstart

The Three Dimensions

1. Task Completion — Did the agent answer correctly?

2. Reasoning Quality — Was the agent's thinking coherent and efficient?

3. Tool Use Accuracy — Did the agent call the right tools, correctly?

EvalReport API

Production Guards

Evaluation Pipeline

Score Aggregation

Integration: sentinel + herald

Data Models at a Glance

Philosophy

Contributing

About

Topics

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages