Summary
waza run --trials N -o results.json aggregates token usage across all N trials into a single summary.usage block. There is no per-trial token breakdown anywhere in the JSON, which makes it impossible to compute per-run variance / stdev / t-statistics from a single results file.
Motivation
Comparing token usage of an eval with vs without a skill (--no-skills vs default) is one of the most natural waza use cases. To make that comparison statistically defensible you need per-trial token counts to compute mean ± sd and a Welch t-test. With only an aggregate you can compare the two means but you can't report uncertainty or significance, and any difference smaller than the per-run variance is indistinguishable from noise.
In a real experiment (n=10 each arm, claude-sonnet-4.5, single architecture-describing prompt against a small codebase) the per-run total varied from 65k to 167k tokens — CV ≈ 27%. The aggregate alone would have hidden that completely.
Repro
waza run eval.yaml --no-skills --trials 10 -o agg.json
python3 -c "import json; print(json.dumps(json.load(open('agg.json'))['summary']['usage'], indent=2))"
# {
# "turns": 42,
# "input_tokens": 969340,
# "output_tokens": 10612,
# ...
# }
python3 -c "
import json
d = json.load(open('agg.json'))
r = d['tasks'][0]['runs'][0]
print(list(r.keys()))
# ['attempts', 'duration_ms', 'final_output', 'run_number', 'session_digest',
# 'status', 'transcript', 'validations', 'workspace_dir']
# — no token/usage field on individual runs
"
Inside the per-run transcript, there are session.usage_info and assistant.usage event types but they arrive with only {"type": ...} — the token counts are not populated:
{ "type": "session.usage_info" }
{ "type": "assistant.usage" }
Workaround
Invoke waza run --trials 1 -o results/r${i}.json N times in parallel, then aggregate the N summary.usage blocks externally. Works, but defeats the purpose of --trials N and burns extra wall-clock setup per run.
Proposed fix
Either:
- (a) Populate the existing
assistant.usage / session.usage_info transcript events with token counts (they already exist; they're just empty), and/or
- (b) Add a
usage field on each entry of tasks[].runs[] that mirrors the summary.usage schema (input_tokens, output_tokens, cache_read_tokens, cache_write_tokens, turns, premium_requests).
(b) is the more useful for analysis since it's a flat field next to the other per-run stats (duration, status, etc.).
Environment
- waza
0.33.0
- Executor:
copilot-sdk
- Model:
claude-sonnet-4.5
Summary
waza run --trials N -o results.jsonaggregates token usage across all N trials into a singlesummary.usageblock. There is no per-trial token breakdown anywhere in the JSON, which makes it impossible to compute per-run variance / stdev / t-statistics from a single results file.Motivation
Comparing token usage of an eval with vs without a skill (
--no-skillsvs default) is one of the most natural waza use cases. To make that comparison statistically defensible you need per-trial token counts to compute mean ± sd and a Welch t-test. With only an aggregate you can compare the two means but you can't report uncertainty or significance, and any difference smaller than the per-run variance is indistinguishable from noise.In a real experiment (n=10 each arm,
claude-sonnet-4.5, single architecture-describing prompt against a small codebase) the per-run total varied from 65k to 167k tokens — CV ≈ 27%. The aggregate alone would have hidden that completely.Repro
Inside the per-run
transcript, there aresession.usage_infoandassistant.usageevent types but they arrive with only{"type": ...}— the token counts are not populated:{ "type": "session.usage_info" } { "type": "assistant.usage" }Workaround
Invoke
waza run --trials 1 -o results/r${i}.jsonN times in parallel, then aggregate the Nsummary.usageblocks externally. Works, but defeats the purpose of--trials Nand burns extra wall-clock setup per run.Proposed fix
Either:
assistant.usage/session.usage_infotranscript events with token counts (they already exist; they're just empty), and/orusagefield on each entry oftasks[].runs[]that mirrors thesummary.usageschema (input_tokens,output_tokens,cache_read_tokens,cache_write_tokens,turns,premium_requests).(b) is the more useful for analysis since it's a flat field next to the other per-run stats (duration, status, etc.).
Environment
0.33.0copilot-sdkclaude-sonnet-4.5