feat: per-task tool metrics with structured arg matchers (closes #366) by spboyer · Pull Request #388 · microsoft/waza

spboyer · 2026-06-28T12:19:36Z

Implements #366 — Wave 2.

Summary

Three additive capabilities, all gated by schemaVersion 1.1 (MINOR bump per #368/#382):

A. Structured argument matchers for `tool_calls` / `tool_constraint`

New internal/graders/argmatcher package with five matcher kinds:

Kind	Field	Matches when…
`equals`	`equals`	argument deeply equals supplied value
`regex`	`regex`	stringified arg matches RE2 pattern
`contains`	`contains`	stringified arg contains substring
`range`	`range`	numeric arg is within `[min, max]`
`json_schema`	`json_schema`	arg validates against JSON Schema

tool_calls gains an expect: [{tool, args}] block — assert which tool ran and how it was called.
tool_constraint gains an args: map on each expect_tools entry.
Backward compatible: existing required_tools / forbidden_tools semantics unchanged.

B. Normalized `tool_events[]` in `results.json`

Each RunResult now carries an additive tool_events: []ToolEvent array, built deterministically from session events:

{
  "turn": 2,
  "sequence": 3,
  "tool_call_id": "call_abc",
  "tool_name": "edit",
  "args": { "path": "src/auth.go", "file_text": "…" },
  "result": "edited",
  "success": true,
  "duration_ms": 142
}

OTel GenAI alignment (gen_ai.tool.name, gen_ai.tool.call.id, etc. — see internal/telemetry).
Replay-friendly and stable for Wave 3 snapshot/replay (feat: Trace replay and deterministic snapshot for agent runs #367).
Legacy session_digest.tool_calls preserved.

C. Aggregate tool-use metrics in `waza compare`

When inputs contain tool data, waza compare prints a new TOOL USE section:

total_calls, tasks_with_tools, avg_calls_per_task
success_rate (fraction of calls with success: true)
selection_accuracy (fraction of tasks where the tool_calls grader passed; denominator excludes tasks with no tool_calls grader)
call_count_histogram (buckets 0 / 1 / 2 / 3+)

Section is suppressed for legacy 1.0 results.

Test plan

✅ internal/graders/argmatcher — all 5 matcher kinds + invalid-regex / invalid-schema construct errors
✅ internal/orchestration/tool_events_test.go — 8 builder cases (empty, pair-by-ID, turn increments, missing complete, failure path, unknown complete, malformed start, nil args)
✅ internal/graders/tool_calls_grader_test.go — 8 new expect cases (equals/regex/contains/range/json_schema/missing tool/extra tool/strict ordering)
✅ internal/graders/tool_constraint_grader_test.go — 4 new args cases incl. invalid regex
✅ internal/models/outcome_schema_test.go — ToolEvents round-trip at schemaVersion 1.1
✅ cmd/waza/cmd_compare_test.go — 5 new metric tests (no data gate, totals/success rate, histogram, selection accuracy, hasAnyToolData)
✅ go build ./... && go test ./... && go vet ./... && golangci-lint run all clean

Docs

site/src/content/docs/guides/graders.mdx — expect: + matcher kinds reference under tool_calls, args: under tool_constraint
site/src/content/docs/reference/schema-changes.md — 1.1 changelog entry, table updated
site/src/content/docs/reference/cli.mdx — waza compare TOOL USE section documented
README.md — schemaVersion 1.1 mention

Wave coordination

Builds on feat: Schema versioning policy and migration tooling for public artifacts #368/feat: schema versioning policy (closes #368) #382 schema versioning (uses MINOR additive policy).
tool_events[] ordering is stable & deterministic so Wave 3 feat: Trace replay and deterministic snapshot for agent runs #367 snapshot/replay can rely on it.
OTel attribute mapping documented inline in internal/models/tool_event.go to stay consistent with feat: OpenTelemetry trace export for agent runs (agentic-first observability) #362 / internal/telemetry.

Closes #366

Copilot

Pull request overview

This PR implements Wave 2 of #366 by adding structured tool-argument matching for tool-related graders, emitting a normalized per-run tool_events[] record into results.json (schemaVersion 1.1), and surfacing aggregate tool-use metrics in waza compare.

Changes:

Add internal/graders/argmatcher (equals/regex/contains/range/json_schema) and wire matchers into tool_calls (expect) and tool_constraint (args on tool specs).
Build deterministic runs[].tool_events[] from Copilot SDK session events and emit it in results.json (schemaVersion bumped to 1.1).
Extend waza compare to compute and print tool-use aggregate metrics (and include them in JSON output).

Show a summary per file

File	Description
site/src/content/docs/reference/schema-changes.md	Documents the schema 1.1 `results.json` addition and updates the schema changelog.
site/src/content/docs/reference/cli.mdx	Documents the new `waza compare` TOOL USE metrics section (schema 1.1+).
site/src/content/docs/guides/graders.mdx	Documents `tool_calls.expect` and `tool_constraint.args` using structured argument matchers.
README.md	Notes `results.json` schema 1.1 and the new `tool_events[]` field.
internal/orchestration/tool_events.go	Adds deterministic builder for normalized `[]ToolEvent` from session events.
internal/orchestration/tool_events_test.go	Tests tool-events builder correlation/turn/sequence/error behaviors.
internal/orchestration/runner.go	Plumbs `ToolEvents` into `RunResult` during execution.
internal/models/tool_event.go	Defines the `ToolEvent` normalized schema and JSON shape.
internal/models/schema_version.go	Bumps `CurrentSchemaVersion` to `1.1`.
internal/models/outcome.go	Adds `RunResult.ToolEvents []ToolEvent` to results output.
internal/models/outcome_schema_test.go	Updates schema default test and adds tool-events JSON round-trip coverage.
internal/models/grader_params.go	Extends grader parameter models to accept matcher-based args (`Args`) and `tool_calls.expect`.
internal/graders/tool_constraint_grader.go	Compiles and evaluates arg matchers for tool_constraint tool specs.
internal/graders/tool_constraint_grader_test.go	Adds tests for tool_constraint arg matcher behavior and construct errors.
internal/graders/tool_calls_grader.go	Implements `expect` evaluation with tool-name regex + arg matcher evaluation.
internal/graders/tool_calls_grader_test.go	Adds tests for `expect` scenarios (missing tool, matcher pass/fail, invalid matcher).
internal/graders/tool_args.go	Adds shared argument normalization and matcher-evaluation helpers.
internal/graders/argmatcher/matcher.go	Implements matcher decoding/validation + match execution for 5 matcher kinds.
internal/graders/argmatcher/matcher_test.go	Unit tests for matcher decoding, matching semantics, and error cases.
cmd/waza/cmd_migrate_test.go	Updates migrate tests for new current schemaVersion and messaging.
cmd/waza/cmd_compare.go	Adds tool-use metrics aggregation, JSON report fields, and TOOL USE table output.
cmd/waza/cmd_compare_test.go	Adds tests for tool metrics aggregation, histogram, selection accuracy, and gating.

Review details

Files reviewed: 22/22 changed files
Comments generated: 8
Review effort level: Low

- graders.mdx: matchers are single-key mappings (no kind field); graders evaluate session_digest.tool_calls (not tool_events[]); range matcher uses gte/lte/gt/lt (not [min, max]). - tool_events.go: stringifyResult comment matches JSON-only behavior. - cmd_compare.go: histogram bucketed per-run (not truncated per-task avg); added 'Tasks w/ tools' row; renamed 'Tasks w/' to 'Runs w/' labels; use tagged switch on runCalls. - schema-changes.md / README.md / schema.mdx: missing schemaVersion is interpreted as the current schema version (1.1), not 1.0. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot

Review details

Files reviewed: 23/24 changed files
Comments generated: 3
Review effort level: Low

- Add argmatcher package (equals/regex/contains/range/json_schema) - Extend tool_calls grader with expect: [{tool, args}] block - Extend tool_constraint grader with args: matchers on expect_tools - Add normalized tool_events[] to RunResult (turn, sequence, tool_call_id, tool_name, args, result, success, duration_ms, error) populated from session events — replay-friendly for Wave 3 (#367), OTel-aligned - Bump results.json schemaVersion to 1.1 (MINOR additive per #368/#382) - waza compare prints aggregate TOOL USE section (total calls, success rate, avg/task, histogram, selection accuracy) when tool data present - Unit tests for matchers, builder, both graders, schema round-trip, compare metrics + histogram - Docs: graders.mdx (expect/args), schema-changes.md (1.1 entry), cli.mdx (compare TOOL USE), README.md Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

- graders.mdx: matchers are single-key mappings (no kind field); graders evaluate session_digest.tool_calls (not tool_events[]); range matcher uses gte/lte/gt/lt (not [min, max]). - tool_events.go: stringifyResult comment matches JSON-only behavior. - cmd_compare.go: histogram bucketed per-run (not truncated per-task avg); added 'Tasks w/ tools' row; renamed 'Tasks w/' to 'Runs w/' labels; use tagged switch on runCalls. - schema-changes.md / README.md / schema.mdx: missing schemaVersion is interpreted as the current schema version (1.1), not 1.0. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

The MCP mocks test in #387 used an empty schemaVersion and expected the 1.1 error path. Because LoadEvalSpec normalizes empty schemaVersion to the current version (1.1), the test passed validation instead of failing. Make the test explicit by setting schemaVersion: '1.0' to actually trigger the gate, then bump to '1.1' in the second half. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot

Review details

Files reviewed: 23/23 changed files
Comments generated: 4
Review effort level: Low

- persist compiled matcher in validateToolSpecs (map value semantics) - capture engine-specific tool args via ToolCallArgs.Extra (mapstructure ',remain') - bucket call_count_histogram per-task across trials (not per-run) - rename TOOL USE table label 'Runs w/' -> 'Tasks w/' to match metric - sync README tool_events[] field list with ToolEvent struct - add IsCompiled() accessor + tests for persisted compile, extra args, and per-task histogram with trials_per_task > 1 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot AI review requested due to automatic review settings June 28, 2026 12:19

Copilot started reviewing on behalf of spboyer June 28, 2026 12:20 View session

Copilot AI reviewed Jun 28, 2026

View reviewed changes

Copilot AI review requested due to automatic review settings June 28, 2026 12:29

spboyer force-pushed the spboyer-feat-tool-metrics-366 branch from c755cc9 to 251ddde Compare June 28, 2026 12:29

Copilot started reviewing on behalf of spboyer June 28, 2026 12:30 View session

Copilot AI reviewed Jun 28, 2026

View reviewed changes

Comment thread site/src/content/docs/reference/schema-changes.md Outdated

Comment thread internal/graders/tool_constraint_grader.go

Comment thread internal/graders/tool_args.go Outdated

Copilot AI added 3 commits June 28, 2026 08:45

Copilot AI review requested due to automatic review settings June 28, 2026 12:48

spboyer force-pushed the spboyer-feat-tool-metrics-366 branch from 42d233a to 42143d2 Compare June 28, 2026 12:48

Copilot started reviewing on behalf of spboyer June 28, 2026 12:49 View session

Copilot AI reviewed Jun 28, 2026

View reviewed changes

Comment thread internal/graders/tool_constraint_grader.go

Comment thread cmd/waza/cmd_compare.go Outdated

Comment thread cmd/waza/cmd_compare.go Outdated

Comment thread README.md Outdated

spboyer merged commit 48421cc into main Jun 28, 2026
10 checks passed

spboyer deleted the spboyer-feat-tool-metrics-366 branch June 28, 2026 13:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: per-task tool metrics with structured arg matchers (closes #366)#388

feat: per-task tool metrics with structured arg matchers (closes #366)#388
spboyer merged 4 commits into
mainfrom
spboyer-feat-tool-metrics-366

spboyer commented Jun 28, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

spboyer commented Jun 28, 2026

Summary

A. Structured argument matchers for tool_calls / tool_constraint

B. Normalized tool_events[] in results.json

C. Aggregate tool-use metrics in waza compare

Test plan

Docs

Wave coordination

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Review details

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Review details

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Review details

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

A. Structured argument matchers for `tool_calls` / `tool_constraint`

B. Normalized `tool_events[]` in `results.json`

C. Aggregate tool-use metrics in `waza compare`