feat: per-task tool metrics with structured arg matchers (closes #366)#388
Merged
Conversation
Contributor
There was a problem hiding this comment.
Pull request overview
This PR implements Wave 2 of #366 by adding structured tool-argument matching for tool-related graders, emitting a normalized per-run tool_events[] record into results.json (schemaVersion 1.1), and surfacing aggregate tool-use metrics in waza compare.
Changes:
- Add
internal/graders/argmatcher(equals/regex/contains/range/json_schema) and wire matchers intotool_calls(expect) andtool_constraint(argson tool specs). - Build deterministic
runs[].tool_events[]from Copilot SDK session events and emit it inresults.json(schemaVersion bumped to 1.1). - Extend
waza compareto compute and print tool-use aggregate metrics (and include them in JSON output).
Show a summary per file
| File | Description |
|---|---|
| site/src/content/docs/reference/schema-changes.md | Documents the schema 1.1 results.json addition and updates the schema changelog. |
| site/src/content/docs/reference/cli.mdx | Documents the new waza compare TOOL USE metrics section (schema 1.1+). |
| site/src/content/docs/guides/graders.mdx | Documents tool_calls.expect and tool_constraint.args using structured argument matchers. |
| README.md | Notes results.json schema 1.1 and the new tool_events[] field. |
| internal/orchestration/tool_events.go | Adds deterministic builder for normalized []ToolEvent from session events. |
| internal/orchestration/tool_events_test.go | Tests tool-events builder correlation/turn/sequence/error behaviors. |
| internal/orchestration/runner.go | Plumbs ToolEvents into RunResult during execution. |
| internal/models/tool_event.go | Defines the ToolEvent normalized schema and JSON shape. |
| internal/models/schema_version.go | Bumps CurrentSchemaVersion to 1.1. |
| internal/models/outcome.go | Adds RunResult.ToolEvents []ToolEvent to results output. |
| internal/models/outcome_schema_test.go | Updates schema default test and adds tool-events JSON round-trip coverage. |
| internal/models/grader_params.go | Extends grader parameter models to accept matcher-based args (Args) and tool_calls.expect. |
| internal/graders/tool_constraint_grader.go | Compiles and evaluates arg matchers for tool_constraint tool specs. |
| internal/graders/tool_constraint_grader_test.go | Adds tests for tool_constraint arg matcher behavior and construct errors. |
| internal/graders/tool_calls_grader.go | Implements expect evaluation with tool-name regex + arg matcher evaluation. |
| internal/graders/tool_calls_grader_test.go | Adds tests for expect scenarios (missing tool, matcher pass/fail, invalid matcher). |
| internal/graders/tool_args.go | Adds shared argument normalization and matcher-evaluation helpers. |
| internal/graders/argmatcher/matcher.go | Implements matcher decoding/validation + match execution for 5 matcher kinds. |
| internal/graders/argmatcher/matcher_test.go | Unit tests for matcher decoding, matching semantics, and error cases. |
| cmd/waza/cmd_migrate_test.go | Updates migrate tests for new current schemaVersion and messaging. |
| cmd/waza/cmd_compare.go | Adds tool-use metrics aggregation, JSON report fields, and TOOL USE table output. |
| cmd/waza/cmd_compare_test.go | Adds tests for tool metrics aggregation, histogram, selection accuracy, and gating. |
Review details
- Files reviewed: 22/22 changed files
- Comments generated: 8
- Review effort level: Low
spboyer
pushed a commit
that referenced
this pull request
Jun 28, 2026
- graders.mdx: matchers are single-key mappings (no kind field); graders evaluate session_digest.tool_calls (not tool_events[]); range matcher uses gte/lte/gt/lt (not [min, max]). - tool_events.go: stringifyResult comment matches JSON-only behavior. - cmd_compare.go: histogram bucketed per-run (not truncated per-task avg); added 'Tasks w/ tools' row; renamed 'Tasks w/' to 'Runs w/' labels; use tagged switch on runCalls. - schema-changes.md / README.md / schema.mdx: missing schemaVersion is interpreted as the current schema version (1.1), not 1.0. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
spboyer
pushed a commit
that referenced
this pull request
Jun 28, 2026
- graders.mdx: matchers are single-key mappings (no kind field); graders evaluate session_digest.tool_calls (not tool_events[]); range matcher uses gte/lte/gt/lt (not [min, max]). - tool_events.go: stringifyResult comment matches JSON-only behavior. - cmd_compare.go: histogram bucketed per-run (not truncated per-task avg); added 'Tasks w/ tools' row; renamed 'Tasks w/' to 'Runs w/' labels; use tagged switch on runCalls. - schema-changes.md / README.md / schema.mdx: missing schemaVersion is interpreted as the current schema version (1.1), not 1.0. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
c755cc9 to
251ddde
Compare
- Add argmatcher package (equals/regex/contains/range/json_schema)
- Extend tool_calls grader with expect: [{tool, args}] block
- Extend tool_constraint grader with args: matchers on expect_tools
- Add normalized tool_events[] to RunResult (turn, sequence, tool_call_id,
tool_name, args, result, success, duration_ms, error) populated from
session events — replay-friendly for Wave 3 (#367), OTel-aligned
- Bump results.json schemaVersion to 1.1 (MINOR additive per #368/#382)
- waza compare prints aggregate TOOL USE section (total calls, success
rate, avg/task, histogram, selection accuracy) when tool data present
- Unit tests for matchers, builder, both graders, schema round-trip,
compare metrics + histogram
- Docs: graders.mdx (expect/args), schema-changes.md (1.1 entry),
cli.mdx (compare TOOL USE), README.md
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- graders.mdx: matchers are single-key mappings (no kind field); graders evaluate session_digest.tool_calls (not tool_events[]); range matcher uses gte/lte/gt/lt (not [min, max]). - tool_events.go: stringifyResult comment matches JSON-only behavior. - cmd_compare.go: histogram bucketed per-run (not truncated per-task avg); added 'Tasks w/ tools' row; renamed 'Tasks w/' to 'Runs w/' labels; use tagged switch on runCalls. - schema-changes.md / README.md / schema.mdx: missing schemaVersion is interpreted as the current schema version (1.1), not 1.0. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The MCP mocks test in #387 used an empty schemaVersion and expected the 1.1 error path. Because LoadEvalSpec normalizes empty schemaVersion to the current version (1.1), the test passed validation instead of failing. Make the test explicit by setting schemaVersion: '1.0' to actually trigger the gate, then bump to '1.1' in the second half. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
42d233a to
42143d2
Compare
- persist compiled matcher in validateToolSpecs (map value semantics) - capture engine-specific tool args via ToolCallArgs.Extra (mapstructure ',remain') - bucket call_count_histogram per-task across trials (not per-run) - rename TOOL USE table label 'Runs w/' -> 'Tasks w/' to match metric - sync README tool_events[] field list with ToolEvent struct - add IsCompiled() accessor + tests for persisted compile, extra args, and per-task histogram with trials_per_task > 1 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Implements #366 — Wave 2.
Summary
Three additive capabilities, all gated by schemaVersion 1.1 (MINOR bump per #368/#382):
A. Structured argument matchers for
tool_calls/tool_constraintNew
internal/graders/argmatcherpackage with five matcher kinds:equalsequalsregexregexcontainscontainsrangerange[min, max]json_schemajson_schematool_callsgains anexpect: [{tool, args}]block — assert which tool ran and how it was called.tool_constraintgains anargs:map on eachexpect_toolsentry.required_tools/forbidden_toolssemantics unchanged.B. Normalized
tool_events[]inresults.jsonEach
RunResultnow carries an additivetool_events: []ToolEventarray, built deterministically from session events:{ "turn": 2, "sequence": 3, "tool_call_id": "call_abc", "tool_name": "edit", "args": { "path": "src/auth.go", "file_text": "…" }, "result": "edited", "success": true, "duration_ms": 142 }gen_ai.tool.name,gen_ai.tool.call.id, etc. — seeinternal/telemetry).session_digest.tool_callspreserved.C. Aggregate tool-use metrics in
waza compareWhen inputs contain tool data,
waza compareprints a newTOOL USEsection:total_calls,tasks_with_tools,avg_calls_per_tasksuccess_rate(fraction of calls withsuccess: true)selection_accuracy(fraction of tasks where thetool_callsgrader passed; denominator excludes tasks with notool_callsgrader)call_count_histogram(buckets0/1/2/3+)Section is suppressed for legacy 1.0 results.
Test plan
internal/graders/argmatcher— all 5 matcher kinds + invalid-regex / invalid-schema construct errorsinternal/orchestration/tool_events_test.go— 8 builder cases (empty, pair-by-ID, turn increments, missing complete, failure path, unknown complete, malformed start, nil args)internal/graders/tool_calls_grader_test.go— 8 newexpectcases (equals/regex/contains/range/json_schema/missing tool/extra tool/strict ordering)internal/graders/tool_constraint_grader_test.go— 4 newargscases incl. invalid regexinternal/models/outcome_schema_test.go— ToolEvents round-trip at schemaVersion 1.1cmd/waza/cmd_compare_test.go— 5 new metric tests (no data gate, totals/success rate, histogram, selection accuracy, hasAnyToolData)go build ./... && go test ./... && go vet ./... && golangci-lint runall cleanDocs
site/src/content/docs/guides/graders.mdx—expect:+ matcher kinds reference under tool_calls,args:under tool_constraintsite/src/content/docs/reference/schema-changes.md— 1.1 changelog entry, table updatedsite/src/content/docs/reference/cli.mdx—waza compareTOOL USE section documentedREADME.md— schemaVersion 1.1 mentionWave coordination
tool_events[]ordering is stable & deterministic so Wave 3 feat: Trace replay and deterministic snapshot for agent runs #367 snapshot/replay can rely on it.internal/models/tool_event.goto stay consistent with feat: OpenTelemetry trace export for agent runs (agentic-first observability) #362 /internal/telemetry.Closes #366