[Codex×Pi parity Phase 4] Token-efficiency report

**Tracking parent:** #80171
**Depends on:** Phase 1 #80172 (which captures `usage` in the per-cell shape)

## Goal

Surface per-runtime token cost in a side-by-side report so cost regressions of Pi → Codex (or vice versa) become PR-blockers, not silent operator surprises. Pash explicitly asked for this: "Would like a full report including token efficiency."

## Scope

**Live mode only.** Mock-openai returns fixed token counts (`extensions/qa-lab/src/providers/mock-openai/server.ts:482` and friends), so deltas in mock mode are meaningless and would mislead reviewers. The token-efficiency report runs against `live-frontier`, gated to scheduled cron.

## Concrete deliverables

### Code

- **New** `extensions/qa-lab/src/token-efficiency-report.ts` — per-scenario rollup:
  ```ts
  export type TokenEfficiencyRow = {
    scenarioId: string;
    pi: { inputTokens: number; outputTokens: number; totalTokens: number; toolCallCount: number };
    codex: { inputTokens: number; outputTokens: number; totalTokens: number; toolCallCount: number };
    deltaPercent: number;          // ((codex.total - pi.total) / pi.total) * 100
    flagged: boolean;              // |deltaPercent| > 15 → true
    toolsUsed: string[];           // union across both runtimes
  };
  export type TokenEfficiencyReport = {
    rows: TokenEfficiencyRow[];
    aggregate: {
      pi: { totalTokens: number; p50PerTurn: number; p90PerTurn: number };
      codex: { totalTokens: number; p50PerTurn: number; p90PerTurn: number };
      deltaPercent: number;
      flaggedScenarios: string[];
    };
  };
  ```
- **Extend** `extensions/qa-lab/src/agentic-parity-report.ts` — render the token-efficiency table when `--token-efficiency` is set.
- **Extend** `extensions/qa-lab/src/cli.ts` — `qa parity-report --runtime-axis --token-efficiency` flag combo.
- **Extend** `.github/workflows/qa-live-transports-convex.yml` (the existing nightly cron home for live parity) — add a `live_runtime_parity_token_efficiency` step that runs nightly and uploads the report as an artifact.

### Capture point

Per `extensions/qa-lab/src/CLAUDE.md` rule on transport vs assistant-message shapes: capture usage at the **assistant-message level** (`AssistantMessage.usage`) rather than the transport level. The transport-level shapes differ between Pi and Codex but the assistant-message shape is normalised by both runtimes.

The capture wires through the `cells.{pi,codex}.usage` field already added in Phase 1's `RuntimeParityCell`.

### Tests

- Snapshot tests for the table rendering with three scenarios (one delta-low, one delta-flagged, one tool-call-difference).
- Aggregate-percentile test on a fixture with known per-turn distributions.

## Acceptance criteria

- [ ] `qa parity-report --runtime-axis --token-efficiency` produces a Markdown side-by-side table with delta% and tool-coverage column.
- [ ] Per-runtime aggregates: total, p50, p90 per turn.
- [ ] Scenario-level delta >15% flag fires (and is asserted in tests).
- [ ] Stored as a release artifact for week-over-week tracking (compresses well, JSON sidecar alongside Markdown).
- [ ] Nightly cron lane wired up; runs against live-frontier only.
- [ ] Mock-mode invocation produces a clear "skipped — mock provider returns fixed counts" message instead of a misleading zero-delta table.

## Out of scope

- Real-customer transcript token capture (Phase 5 is the harness, not a billing dashboard).
- Cross-vendor cost comparison — that's still on the model-axis gate.
- Per-tool token attribution — captured but rolled up to per-scenario in this PR; per-tool can be a follow-up if needed.

## References

- Tracking parent: #80171
- Phase 1 (capture infrastructure): #80172
- Existing live-mode home: `.github/workflows/qa-live-transports-convex.yml`
- Mock-mode caveat: `extensions/qa-lab/src/providers/mock-openai/server.ts:482`


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Codex×Pi parity Phase 4] Token-efficiency report #80175

Goal

Scope

Concrete deliverables

Code

Capture point

Tests

Acceptance criteria

Out of scope

References

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

[Codex×Pi parity Phase 4] Token-efficiency report #80175

Description

Goal

Scope

Concrete deliverables

Code

Capture point

Tests

Acceptance criteria

Out of scope

References

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions