Skip to content

feat: per-trial token usage in results JSON (currently only aggregate summary.usage) #272

Description

@JayDoubleu

Summary

waza run --trials N -o results.json aggregates token usage across all N trials into a single summary.usage block. There is no per-trial token breakdown anywhere in the JSON, which makes it impossible to compute per-run variance / stdev / t-statistics from a single results file.

Motivation

Comparing token usage of an eval with vs without a skill (--no-skills vs default) is one of the most natural waza use cases. To make that comparison statistically defensible you need per-trial token counts to compute mean ± sd and a Welch t-test. With only an aggregate you can compare the two means but you can't report uncertainty or significance, and any difference smaller than the per-run variance is indistinguishable from noise.

In a real experiment (n=10 each arm, claude-sonnet-4.5, single architecture-describing prompt against a small codebase) the per-run total varied from 65k to 167k tokens — CV ≈ 27%. The aggregate alone would have hidden that completely.

Repro

waza run eval.yaml --no-skills --trials 10 -o agg.json
python3 -c "import json; print(json.dumps(json.load(open('agg.json'))['summary']['usage'], indent=2))"
# {
#   "turns": 42,
#   "input_tokens": 969340,
#   "output_tokens": 10612,
#   ...
# }

python3 -c "
import json
d = json.load(open('agg.json'))
r = d['tasks'][0]['runs'][0]
print(list(r.keys()))
# ['attempts', 'duration_ms', 'final_output', 'run_number', 'session_digest',
#  'status', 'transcript', 'validations', 'workspace_dir']
# — no token/usage field on individual runs
"

Inside the per-run transcript, there are session.usage_info and assistant.usage event types but they arrive with only {"type": ...} — the token counts are not populated:

{ "type": "session.usage_info" }
{ "type": "assistant.usage" }

Workaround

Invoke waza run --trials 1 -o results/r${i}.json N times in parallel, then aggregate the N summary.usage blocks externally. Works, but defeats the purpose of --trials N and burns extra wall-clock setup per run.

Proposed fix

Either:

  • (a) Populate the existing assistant.usage / session.usage_info transcript events with token counts (they already exist; they're just empty), and/or
  • (b) Add a usage field on each entry of tasks[].runs[] that mirrors the summary.usage schema (input_tokens, output_tokens, cache_read_tokens, cache_write_tokens, turns, premium_requests).

(b) is the more useful for analysis since it's a flat field next to the other per-run stats (duration, status, etc.).

Environment

  • waza 0.33.0
  • Executor: copilot-sdk
  • Model: claude-sonnet-4.5

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions