feat: per-trial token usage in results JSON (currently only aggregate summary.usage)

## Summary

`waza run --trials N -o results.json` aggregates token usage across all N trials into a single `summary.usage` block. There is no per-trial token breakdown anywhere in the JSON, which makes it impossible to compute per-run variance / stdev / t-statistics from a single results file.

## Motivation

Comparing token usage of an eval with vs without a skill (`--no-skills` vs default) is one of the most natural waza use cases. To make that comparison statistically defensible you need per-trial token counts to compute mean ± sd and a Welch t-test. With only an aggregate you can compare the two means but you can't report uncertainty or significance, and any difference smaller than the per-run variance is indistinguishable from noise.

In a real experiment (n=10 each arm, `claude-sonnet-4.5`, single architecture-describing prompt against a small codebase) the per-run total varied from **65k to 167k tokens** — CV ≈ 27%. The aggregate alone would have hidden that completely.

## Repro

```bash
waza run eval.yaml --no-skills --trials 10 -o agg.json
python3 -c "import json; print(json.dumps(json.load(open('agg.json'))['summary']['usage'], indent=2))"
# {
#   "turns": 42,
#   "input_tokens": 969340,
#   "output_tokens": 10612,
#   ...
# }

python3 -c "
import json
d = json.load(open('agg.json'))
r = d['tasks'][0]['runs'][0]
print(list(r.keys()))
# ['attempts', 'duration_ms', 'final_output', 'run_number', 'session_digest',
#  'status', 'transcript', 'validations', 'workspace_dir']
# — no token/usage field on individual runs
"
```

Inside the per-run `transcript`, there are `session.usage_info` and `assistant.usage` event types but they arrive with only `{"type": ...}` — the token counts are not populated:

```json
{ "type": "session.usage_info" }
{ "type": "assistant.usage" }
```

## Workaround

Invoke `waza run --trials 1 -o results/r${i}.json` N times in parallel, then aggregate the N `summary.usage` blocks externally. Works, but defeats the purpose of `--trials N` and burns extra wall-clock setup per run.

## Proposed fix

Either:

- (a) Populate the existing `assistant.usage` / `session.usage_info` transcript events with token counts (they already exist; they're just empty), **and/or**
- (b) Add a `usage` field on each entry of `tasks[].runs[]` that mirrors the `summary.usage` schema (`input_tokens`, `output_tokens`, `cache_read_tokens`, `cache_write_tokens`, `turns`, `premium_requests`).

(b) is the more useful for analysis since it's a flat field next to the other per-run stats (duration, status, etc.).

## Environment

- waza `0.33.0`
- Executor: `copilot-sdk`
- Model: `claude-sonnet-4.5`


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: per-trial token usage in results JSON (currently only aggregate summary.usage) #272

Summary

Motivation

Repro

Workaround

Proposed fix

Environment

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

feat: per-trial token usage in results JSON (currently only aggregate summary.usage) #272

Description

Summary

Motivation

Repro

Workaround

Proposed fix

Environment

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions