Tracking parent: #80171
Depends on: Phase 1 #80172 (which captures usage in the per-cell shape)
Goal
Surface per-runtime token cost in a side-by-side report so cost regressions of Pi → Codex (or vice versa) become PR-blockers, not silent operator surprises. Pash explicitly asked for this: "Would like a full report including token efficiency."
Scope
Live mode only. Mock-openai returns fixed token counts (extensions/qa-lab/src/providers/mock-openai/server.ts:482 and friends), so deltas in mock mode are meaningless and would mislead reviewers. The token-efficiency report runs against live-frontier, gated to scheduled cron.
Concrete deliverables
Code
- New
extensions/qa-lab/src/token-efficiency-report.ts — per-scenario rollup:
export type TokenEfficiencyRow = {
scenarioId: string;
pi: { inputTokens: number; outputTokens: number; totalTokens: number; toolCallCount: number };
codex: { inputTokens: number; outputTokens: number; totalTokens: number; toolCallCount: number };
deltaPercent: number; // ((codex.total - pi.total) / pi.total) * 100
flagged: boolean; // |deltaPercent| > 15 → true
toolsUsed: string[]; // union across both runtimes
};
export type TokenEfficiencyReport = {
rows: TokenEfficiencyRow[];
aggregate: {
pi: { totalTokens: number; p50PerTurn: number; p90PerTurn: number };
codex: { totalTokens: number; p50PerTurn: number; p90PerTurn: number };
deltaPercent: number;
flaggedScenarios: string[];
};
};
- Extend
extensions/qa-lab/src/agentic-parity-report.ts — render the token-efficiency table when --token-efficiency is set.
- Extend
extensions/qa-lab/src/cli.ts — qa parity-report --runtime-axis --token-efficiency flag combo.
- Extend
.github/workflows/qa-live-transports-convex.yml (the existing nightly cron home for live parity) — add a live_runtime_parity_token_efficiency step that runs nightly and uploads the report as an artifact.
Capture point
Per extensions/qa-lab/src/CLAUDE.md rule on transport vs assistant-message shapes: capture usage at the assistant-message level (AssistantMessage.usage) rather than the transport level. The transport-level shapes differ between Pi and Codex but the assistant-message shape is normalised by both runtimes.
The capture wires through the cells.{pi,codex}.usage field already added in Phase 1's RuntimeParityCell.
Tests
- Snapshot tests for the table rendering with three scenarios (one delta-low, one delta-flagged, one tool-call-difference).
- Aggregate-percentile test on a fixture with known per-turn distributions.
Acceptance criteria
Out of scope
- Real-customer transcript token capture (Phase 5 is the harness, not a billing dashboard).
- Cross-vendor cost comparison — that's still on the model-axis gate.
- Per-tool token attribution — captured but rolled up to per-scenario in this PR; per-tool can be a follow-up if needed.
References
Tracking parent: #80171
Depends on: Phase 1 #80172 (which captures
usagein the per-cell shape)Goal
Surface per-runtime token cost in a side-by-side report so cost regressions of Pi → Codex (or vice versa) become PR-blockers, not silent operator surprises. Pash explicitly asked for this: "Would like a full report including token efficiency."
Scope
Live mode only. Mock-openai returns fixed token counts (
extensions/qa-lab/src/providers/mock-openai/server.ts:482and friends), so deltas in mock mode are meaningless and would mislead reviewers. The token-efficiency report runs againstlive-frontier, gated to scheduled cron.Concrete deliverables
Code
extensions/qa-lab/src/token-efficiency-report.ts— per-scenario rollup:extensions/qa-lab/src/agentic-parity-report.ts— render the token-efficiency table when--token-efficiencyis set.extensions/qa-lab/src/cli.ts—qa parity-report --runtime-axis --token-efficiencyflag combo..github/workflows/qa-live-transports-convex.yml(the existing nightly cron home for live parity) — add alive_runtime_parity_token_efficiencystep that runs nightly and uploads the report as an artifact.Capture point
Per
extensions/qa-lab/src/CLAUDE.mdrule on transport vs assistant-message shapes: capture usage at the assistant-message level (AssistantMessage.usage) rather than the transport level. The transport-level shapes differ between Pi and Codex but the assistant-message shape is normalised by both runtimes.The capture wires through the
cells.{pi,codex}.usagefield already added in Phase 1'sRuntimeParityCell.Tests
Acceptance criteria
qa parity-report --runtime-axis --token-efficiencyproduces a Markdown side-by-side table with delta% and tool-coverage column.Out of scope
References
.github/workflows/qa-live-transports-convex.ymlextensions/qa-lab/src/providers/mock-openai/server.ts:482