Skip to content

feat: per-task tool metrics with structured arg matchers (closes #366)#388

Merged
spboyer merged 4 commits into
mainfrom
spboyer-feat-tool-metrics-366
Jun 28, 2026
Merged

feat: per-task tool metrics with structured arg matchers (closes #366)#388
spboyer merged 4 commits into
mainfrom
spboyer-feat-tool-metrics-366

Conversation

@spboyer

@spboyer spboyer commented Jun 28, 2026

Copy link
Copy Markdown
Member

Implements #366 — Wave 2.

Summary

Three additive capabilities, all gated by schemaVersion 1.1 (MINOR bump per #368/#382):

A. Structured argument matchers for tool_calls / tool_constraint

New internal/graders/argmatcher package with five matcher kinds:

Kind Field Matches when…
equals equals argument deeply equals supplied value
regex regex stringified arg matches RE2 pattern
contains contains stringified arg contains substring
range range numeric arg is within [min, max]
json_schema json_schema arg validates against JSON Schema
  • tool_calls gains an expect: [{tool, args}] block — assert which tool ran and how it was called.
  • tool_constraint gains an args: map on each expect_tools entry.
  • Backward compatible: existing required_tools / forbidden_tools semantics unchanged.

B. Normalized tool_events[] in results.json

Each RunResult now carries an additive tool_events: []ToolEvent array, built deterministically from session events:

{
  "turn": 2,
  "sequence": 3,
  "tool_call_id": "call_abc",
  "tool_name": "edit",
  "args": { "path": "src/auth.go", "file_text": "" },
  "result": "edited",
  "success": true,
  "duration_ms": 142
}

C. Aggregate tool-use metrics in waza compare

When inputs contain tool data, waza compare prints a new TOOL USE section:

  • total_calls, tasks_with_tools, avg_calls_per_task
  • success_rate (fraction of calls with success: true)
  • selection_accuracy (fraction of tasks where the tool_calls grader passed; denominator excludes tasks with no tool_calls grader)
  • call_count_histogram (buckets 0 / 1 / 2 / 3+)

Section is suppressed for legacy 1.0 results.

Test plan

  • internal/graders/argmatcher — all 5 matcher kinds + invalid-regex / invalid-schema construct errors
  • internal/orchestration/tool_events_test.go — 8 builder cases (empty, pair-by-ID, turn increments, missing complete, failure path, unknown complete, malformed start, nil args)
  • internal/graders/tool_calls_grader_test.go — 8 new expect cases (equals/regex/contains/range/json_schema/missing tool/extra tool/strict ordering)
  • internal/graders/tool_constraint_grader_test.go — 4 new args cases incl. invalid regex
  • internal/models/outcome_schema_test.go — ToolEvents round-trip at schemaVersion 1.1
  • cmd/waza/cmd_compare_test.go — 5 new metric tests (no data gate, totals/success rate, histogram, selection accuracy, hasAnyToolData)
  • go build ./... && go test ./... && go vet ./... && golangci-lint run all clean

Docs

  • site/src/content/docs/guides/graders.mdxexpect: + matcher kinds reference under tool_calls, args: under tool_constraint
  • site/src/content/docs/reference/schema-changes.md — 1.1 changelog entry, table updated
  • site/src/content/docs/reference/cli.mdxwaza compare TOOL USE section documented
  • README.md — schemaVersion 1.1 mention

Wave coordination

Closes #366

Copilot AI review requested due to automatic review settings June 28, 2026 12:19

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR implements Wave 2 of #366 by adding structured tool-argument matching for tool-related graders, emitting a normalized per-run tool_events[] record into results.json (schemaVersion 1.1), and surfacing aggregate tool-use metrics in waza compare.

Changes:

  • Add internal/graders/argmatcher (equals/regex/contains/range/json_schema) and wire matchers into tool_calls (expect) and tool_constraint (args on tool specs).
  • Build deterministic runs[].tool_events[] from Copilot SDK session events and emit it in results.json (schemaVersion bumped to 1.1).
  • Extend waza compare to compute and print tool-use aggregate metrics (and include them in JSON output).
Show a summary per file
File Description
site/src/content/docs/reference/schema-changes.md Documents the schema 1.1 results.json addition and updates the schema changelog.
site/src/content/docs/reference/cli.mdx Documents the new waza compare TOOL USE metrics section (schema 1.1+).
site/src/content/docs/guides/graders.mdx Documents tool_calls.expect and tool_constraint.args using structured argument matchers.
README.md Notes results.json schema 1.1 and the new tool_events[] field.
internal/orchestration/tool_events.go Adds deterministic builder for normalized []ToolEvent from session events.
internal/orchestration/tool_events_test.go Tests tool-events builder correlation/turn/sequence/error behaviors.
internal/orchestration/runner.go Plumbs ToolEvents into RunResult during execution.
internal/models/tool_event.go Defines the ToolEvent normalized schema and JSON shape.
internal/models/schema_version.go Bumps CurrentSchemaVersion to 1.1.
internal/models/outcome.go Adds RunResult.ToolEvents []ToolEvent to results output.
internal/models/outcome_schema_test.go Updates schema default test and adds tool-events JSON round-trip coverage.
internal/models/grader_params.go Extends grader parameter models to accept matcher-based args (Args) and tool_calls.expect.
internal/graders/tool_constraint_grader.go Compiles and evaluates arg matchers for tool_constraint tool specs.
internal/graders/tool_constraint_grader_test.go Adds tests for tool_constraint arg matcher behavior and construct errors.
internal/graders/tool_calls_grader.go Implements expect evaluation with tool-name regex + arg matcher evaluation.
internal/graders/tool_calls_grader_test.go Adds tests for expect scenarios (missing tool, matcher pass/fail, invalid matcher).
internal/graders/tool_args.go Adds shared argument normalization and matcher-evaluation helpers.
internal/graders/argmatcher/matcher.go Implements matcher decoding/validation + match execution for 5 matcher kinds.
internal/graders/argmatcher/matcher_test.go Unit tests for matcher decoding, matching semantics, and error cases.
cmd/waza/cmd_migrate_test.go Updates migrate tests for new current schemaVersion and messaging.
cmd/waza/cmd_compare.go Adds tool-use metrics aggregation, JSON report fields, and TOOL USE table output.
cmd/waza/cmd_compare_test.go Adds tests for tool metrics aggregation, histogram, selection accuracy, and gating.

Review details

  • Files reviewed: 22/22 changed files
  • Comments generated: 8
  • Review effort level: Low

Comment thread site/src/content/docs/guides/graders.mdx Outdated
Comment thread site/src/content/docs/guides/graders.mdx Outdated
Comment thread internal/orchestration/tool_events.go Outdated
Comment thread cmd/waza/cmd_compare.go Outdated
Comment thread cmd/waza/cmd_compare.go
Comment thread site/src/content/docs/reference/schema-changes.md Outdated
Comment thread README.md Outdated
Comment thread site/src/content/docs/guides/graders.mdx Outdated
spboyer pushed a commit that referenced this pull request Jun 28, 2026
- graders.mdx: matchers are single-key mappings (no kind field);
  graders evaluate session_digest.tool_calls (not tool_events[]);
  range matcher uses gte/lte/gt/lt (not [min, max]).
- tool_events.go: stringifyResult comment matches JSON-only behavior.
- cmd_compare.go: histogram bucketed per-run (not truncated per-task avg);
  added 'Tasks w/ tools' row; renamed 'Tasks w/' to 'Runs w/' labels;
  use tagged switch on runCalls.
- schema-changes.md / README.md / schema.mdx: missing schemaVersion is
  interpreted as the current schema version (1.1), not 1.0.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
spboyer pushed a commit that referenced this pull request Jun 28, 2026
- graders.mdx: matchers are single-key mappings (no kind field);
  graders evaluate session_digest.tool_calls (not tool_events[]);
  range matcher uses gte/lte/gt/lt (not [min, max]).
- tool_events.go: stringifyResult comment matches JSON-only behavior.
- cmd_compare.go: histogram bucketed per-run (not truncated per-task avg);
  added 'Tasks w/ tools' row; renamed 'Tasks w/' to 'Runs w/' labels;
  use tagged switch on runCalls.
- schema-changes.md / README.md / schema.mdx: missing schemaVersion is
  interpreted as the current schema version (1.1), not 1.0.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings June 28, 2026 12:29
@spboyer spboyer force-pushed the spboyer-feat-tool-metrics-366 branch from c755cc9 to 251ddde Compare June 28, 2026 12:29

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review details

  • Files reviewed: 23/24 changed files
  • Comments generated: 3
  • Review effort level: Low

Comment thread site/src/content/docs/reference/schema-changes.md Outdated
Comment thread internal/graders/tool_constraint_grader.go
Comment thread internal/graders/tool_args.go Outdated
Copilot AI added 3 commits June 28, 2026 08:45
- Add argmatcher package (equals/regex/contains/range/json_schema)
- Extend tool_calls grader with expect: [{tool, args}] block
- Extend tool_constraint grader with args: matchers on expect_tools
- Add normalized tool_events[] to RunResult (turn, sequence, tool_call_id,
  tool_name, args, result, success, duration_ms, error) populated from
  session events — replay-friendly for Wave 3 (#367), OTel-aligned
- Bump results.json schemaVersion to 1.1 (MINOR additive per #368/#382)
- waza compare prints aggregate TOOL USE section (total calls, success
  rate, avg/task, histogram, selection accuracy) when tool data present
- Unit tests for matchers, builder, both graders, schema round-trip,
  compare metrics + histogram
- Docs: graders.mdx (expect/args), schema-changes.md (1.1 entry),
  cli.mdx (compare TOOL USE), README.md

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- graders.mdx: matchers are single-key mappings (no kind field);
  graders evaluate session_digest.tool_calls (not tool_events[]);
  range matcher uses gte/lte/gt/lt (not [min, max]).
- tool_events.go: stringifyResult comment matches JSON-only behavior.
- cmd_compare.go: histogram bucketed per-run (not truncated per-task avg);
  added 'Tasks w/ tools' row; renamed 'Tasks w/' to 'Runs w/' labels;
  use tagged switch on runCalls.
- schema-changes.md / README.md / schema.mdx: missing schemaVersion is
  interpreted as the current schema version (1.1), not 1.0.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The MCP mocks test in #387 used an empty schemaVersion and expected
the 1.1 error path. Because LoadEvalSpec normalizes empty schemaVersion
to the current version (1.1), the test passed validation instead of
failing. Make the test explicit by setting schemaVersion: '1.0' to
actually trigger the gate, then bump to '1.1' in the second half.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings June 28, 2026 12:48
@spboyer spboyer force-pushed the spboyer-feat-tool-metrics-366 branch from 42d233a to 42143d2 Compare June 28, 2026 12:48

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review details

  • Files reviewed: 23/23 changed files
  • Comments generated: 4
  • Review effort level: Low

Comment thread internal/graders/tool_constraint_grader.go
Comment thread cmd/waza/cmd_compare.go Outdated
Comment thread cmd/waza/cmd_compare.go Outdated
Comment thread README.md Outdated
- persist compiled matcher in validateToolSpecs (map value semantics)
- capture engine-specific tool args via ToolCallArgs.Extra (mapstructure ',remain')
- bucket call_count_histogram per-task across trials (not per-run)
- rename TOOL USE table label 'Runs w/' -> 'Tasks w/' to match metric
- sync README tool_events[] field list with ToolEvent struct
- add IsCompiled() accessor + tests for persisted compile, extra args, and
  per-task histogram with trials_per_task > 1

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@spboyer spboyer merged commit 48421cc into main Jun 28, 2026
10 checks passed
@spboyer spboyer deleted the spboyer-feat-tool-metrics-366 branch June 28, 2026 13:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: Agentic metrics — tool-call accuracy, tool selection, tool-input correctness

3 participants