Skip to content

feat: Agentic metrics — tool-call accuracy, tool selection, tool-input correctness #366

Description

@spboyer

Problem

Waza already has tool_constraint, tool_calls, and action_sequence graders, plus a session_digest.tool_calls field. The gaps are at the edges:

  1. Argument matching is shallow. Current graders assert tool names and counts well; deep argument matching (JSON-schema, regex on specific fields, equality with tolerance) is hand-rolled per task.
  2. Cross-engine event normalization is uneven. Different engines surface tool calls in different shapes; graders sometimes have to special-case.
  3. No aggregate "tool-use" metrics. No equivalent of "tool selection accuracy" or "tool input correctness" across a suite — only per-task pass/fail.

Proposal

Three focused enhancements to existing graders and the session digest:

A. Richer argument matching in tool_calls / tool_constraint

- type: tool_calls
  expect:
    - name: search_docs
      args:
        query: { regex: "^(authentication|auth)\\b" }
        limit: { equals: 10 }

Supported matchers: equals, regex, json_schema, contains, range.

B. Normalized tool events in results.json

Add a stable, engine-agnostic tool_events[] array per task (alongside the existing session_digest). Each event: { turn, tool_name, args, result, success, duration_ms }. Existing fields preserved for backward compatibility.

C. Aggregate metrics in waza compare / dashboard

Surface per-suite metrics: tool selection accuracy (% tasks calling the expected tool first), tool input correctness (% tool calls with all required args matched), call count distribution.

Why this matters for agentic-first

Tool use is the agentic behavior. Tightening argument matchers, normalizing events across engines, and exposing aggregate tool metrics gives authors the loop they need to debug how the agent works, not just whether the final answer was right.

Acceptance criteria

  • tool_calls and tool_constraint accept structured argument matchers (equals, regex, json_schema, contains, range).
  • results.json gains a normalized tool_events[] array per task; existing session_digest.tool_calls preserved.
  • waza compare outputs aggregate tool-use metrics (selection accuracy, input correctness).
  • Cross-engine event normalization depends on [E1] Decouple ExecutionResponse from Copilot SDK + Multi-Agent Engine Support #10; this issue specifies the waza-side shape.
  • Tests cover all matcher kinds and ensure existing tasks keep working.
  • Docs in site/ showing migration patterns.

Non-goals (filed separately)

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions