Skip to content

[Codex×Pi parity Phase 4] Token-efficiency report #80175

@100yenadmin

Description

@100yenadmin

Tracking parent: #80171
Depends on: Phase 1 #80172 (which captures usage in the per-cell shape)

Goal

Surface per-runtime token cost in a side-by-side report so cost regressions of Pi → Codex (or vice versa) become PR-blockers, not silent operator surprises. Pash explicitly asked for this: "Would like a full report including token efficiency."

Scope

Live mode only. Mock-openai returns fixed token counts (extensions/qa-lab/src/providers/mock-openai/server.ts:482 and friends), so deltas in mock mode are meaningless and would mislead reviewers. The token-efficiency report runs against live-frontier, gated to scheduled cron.

Concrete deliverables

Code

  • New extensions/qa-lab/src/token-efficiency-report.ts — per-scenario rollup:
    export type TokenEfficiencyRow = {
      scenarioId: string;
      pi: { inputTokens: number; outputTokens: number; totalTokens: number; toolCallCount: number };
      codex: { inputTokens: number; outputTokens: number; totalTokens: number; toolCallCount: number };
      deltaPercent: number;          // ((codex.total - pi.total) / pi.total) * 100
      flagged: boolean;              // |deltaPercent| > 15 → true
      toolsUsed: string[];           // union across both runtimes
    };
    export type TokenEfficiencyReport = {
      rows: TokenEfficiencyRow[];
      aggregate: {
        pi: { totalTokens: number; p50PerTurn: number; p90PerTurn: number };
        codex: { totalTokens: number; p50PerTurn: number; p90PerTurn: number };
        deltaPercent: number;
        flaggedScenarios: string[];
      };
    };
  • Extend extensions/qa-lab/src/agentic-parity-report.ts — render the token-efficiency table when --token-efficiency is set.
  • Extend extensions/qa-lab/src/cli.tsqa parity-report --runtime-axis --token-efficiency flag combo.
  • Extend .github/workflows/qa-live-transports-convex.yml (the existing nightly cron home for live parity) — add a live_runtime_parity_token_efficiency step that runs nightly and uploads the report as an artifact.

Capture point

Per extensions/qa-lab/src/CLAUDE.md rule on transport vs assistant-message shapes: capture usage at the assistant-message level (AssistantMessage.usage) rather than the transport level. The transport-level shapes differ between Pi and Codex but the assistant-message shape is normalised by both runtimes.

The capture wires through the cells.{pi,codex}.usage field already added in Phase 1's RuntimeParityCell.

Tests

  • Snapshot tests for the table rendering with three scenarios (one delta-low, one delta-flagged, one tool-call-difference).
  • Aggregate-percentile test on a fixture with known per-turn distributions.

Acceptance criteria

  • qa parity-report --runtime-axis --token-efficiency produces a Markdown side-by-side table with delta% and tool-coverage column.
  • Per-runtime aggregates: total, p50, p90 per turn.
  • Scenario-level delta >15% flag fires (and is asserted in tests).
  • Stored as a release artifact for week-over-week tracking (compresses well, JSON sidecar alongside Markdown).
  • Nightly cron lane wired up; runs against live-frontier only.
  • Mock-mode invocation produces a clear "skipped — mock provider returns fixed counts" message instead of a misleading zero-delta table.

Out of scope

  • Real-customer transcript token capture (Phase 5 is the harness, not a billing dashboard).
  • Cross-vendor cost comparison — that's still on the model-axis gate.
  • Per-tool token attribution — captured but rolled up to per-scenario in this PR; per-tool can be a follow-up if needed.

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions