Problem
Waza already has tool_constraint, tool_calls, and action_sequence graders, plus a session_digest.tool_calls field. The gaps are at the edges:
- Argument matching is shallow. Current graders assert tool names and counts well; deep argument matching (JSON-schema, regex on specific fields, equality with tolerance) is hand-rolled per task.
- Cross-engine event normalization is uneven. Different engines surface tool calls in different shapes; graders sometimes have to special-case.
- No aggregate "tool-use" metrics. No equivalent of "tool selection accuracy" or "tool input correctness" across a suite — only per-task pass/fail.
Proposal
Three focused enhancements to existing graders and the session digest:
A. Richer argument matching in tool_calls / tool_constraint
- type: tool_calls
expect:
- name: search_docs
args:
query: { regex: "^(authentication|auth)\\b" }
limit: { equals: 10 }
Supported matchers: equals, regex, json_schema, contains, range.
B. Normalized tool events in results.json
Add a stable, engine-agnostic tool_events[] array per task (alongside the existing session_digest). Each event: { turn, tool_name, args, result, success, duration_ms }. Existing fields preserved for backward compatibility.
C. Aggregate metrics in waza compare / dashboard
Surface per-suite metrics: tool selection accuracy (% tasks calling the expected tool first), tool input correctness (% tool calls with all required args matched), call count distribution.
Why this matters for agentic-first
Tool use is the agentic behavior. Tightening argument matchers, normalizing events across engines, and exposing aggregate tool metrics gives authors the loop they need to debug how the agent works, not just whether the final answer was right.
Acceptance criteria
Non-goals (filed separately)
Related
Problem
Waza already has
tool_constraint,tool_calls, andaction_sequencegraders, plus asession_digest.tool_callsfield. The gaps are at the edges:Proposal
Three focused enhancements to existing graders and the session digest:
A. Richer argument matching in
tool_calls/tool_constraintSupported matchers:
equals,regex,json_schema,contains,range.B. Normalized tool events in
results.jsonAdd a stable, engine-agnostic
tool_events[]array per task (alongside the existingsession_digest). Each event:{ turn, tool_name, args, result, success, duration_ms }. Existing fields preserved for backward compatibility.C. Aggregate metrics in
waza compare/ dashboardSurface per-suite metrics: tool selection accuracy (% tasks calling the expected tool first), tool input correctness (% tool calls with all required args matched), call count distribution.
Why this matters for agentic-first
Tool use is the agentic behavior. Tightening argument matchers, normalizing events across engines, and exposing aggregate tool metrics gives authors the loop they need to debug how the agent works, not just whether the final answer was right.
Acceptance criteria
tool_callsandtool_constraintaccept structured argument matchers (equals,regex,json_schema,contains,range).results.jsongains a normalizedtool_events[]array per task; existingsession_digest.tool_callspreserved.waza compareoutputs aggregate tool-use metrics (selection accuracy, input correctness).site/showing migration patterns.Non-goals (filed separately)
Related
internal/graders/{tool_calls,tool_constraint,action_sequence}_grader.go,session_digest