Skip to content

[Codex×Pi parity Phase 1] Runtime axis in qa-lab parity machinery #80172

@100yenadmin

Description

@100yenadmin

Tracking parent: #80171

Goal

Add a runtime axis (pi vs codex) to the existing qa-lab parity machinery so the same scenario, same model ref can be run under both runtimes and the outputs diffed. This is the unblocker for the Codex-vs-Pi parity harness — every other phase composes on top.

Scope

Reuse current scenarios (no new fixtures yet — Phase 2 covers per-tool fixtures). Reuse current parity-report plumbing where possible. The deliverable is the orchestrator + drift classifier + report extension + CI wiring.

Concrete deliverables

Code

  • New extensions/qa-lab/src/runtime-parity.ts — orchestrator. Public API:

    export type RuntimeId = "pi" | "codex";
    export type RuntimeParityCell = {
      runtime: RuntimeId;
      transcriptBytes: string;
      toolCalls: Array<{ tool: string; argsHash: string; resultHash: string; errorClass?: string }>;
      finalText: string;
      usage: { inputTokens: number; outputTokens: number; totalTokens: number; cacheRead?: number; cacheWrite?: number };
      wallClockMs: number;
      transportErrorClass?: string;
      runtimeErrorClass?: string;
      bootStateLines: string[];
    };
    export type RuntimeParityResult = {
      scenarioId: string;
      cells: { pi: RuntimeParityCell; codex: RuntimeParityCell };
      drift: "none" | "text-only" | "tool-call-shape" | "tool-result-shape" | "structural" | "failure-mode";
      driftDetails?: string;
    };
    export function runRuntimeParityScenario(params: {...}): Promise<RuntimeParityResult>;
  • New extensions/qa-lab/src/runtime-parity.test.ts — unit tests for the drift classifier (each of the six drift categories has a fixture transcript pair).

  • Extend src/agents/model-runtime-policy.ts — add OPENCLAW_QA_FORCE_RUNTIME=pi|codex env-var override at the top of resolveModelRuntimePolicy. Gate read to process.env.OPENCLAW_BUILD_PRIVATE_QA === "1" so the seam cannot fire in production builds. Document as test-only in JSDoc.

  • Extend extensions/qa-lab/src/agentic-parity-report.ts — add a runtime field to per-cell summary, a runtimeDrift rollup section, and a Markdown rendering for the drift category table.

  • Extend extensions/qa-lab/src/cli.ts — new qa suite --runtime-pair pi,codex flag. When set, suite runner runs each scenario twice and feeds both cells through runRuntimeParityScenario.

  • Extend .github/workflows/openclaw-release-checks.yml — new step qa_lab_runtime_parity_release_checks parallel to the existing qa_lab_parity_lane_release_checks. Same gating (OPENCLAW_BUILD_PRIVATE_QA=1).

Tests

  • Unit tests for the drift classifier (six fixtures, one per drift category).
  • Integration test: run two existing agentic scenarios through runRuntimeParityScenario against mock-openai, assert the report shape is correct.
  • Snapshot test for the new Markdown rollup section.

Acceptance criteria

  • pnpm openclaw qa suite --provider-mode mock-openai --parity-pack agentic --runtime-pair pi,codex runs each existing agentic scenario twice (once per runtime) and emits a summary with a runtime field per cell.
  • The drift classifier emits one of {none, text-only, tool-call-shape, tool-result-shape, structural, failure-mode} per scenario.
  • New qa parity-report mode --runtime-axis produces a side-by-side Markdown table.
  • OPENCLAW_QA_FORCE_RUNTIME env var, gated to OPENCLAW_BUILD_PRIVATE_QA=1, is documented as test-only and has a unit test asserting it does NOT fire when the gate is unset.
  • CI wiring runs the runtime-pair on existing scenarios in mock-openai mode.
  • All existing parity tests still green; no behavior change for non-QA users.
  • pnpm exec oxlint --type-aware and pnpm check:test-types clean.

Out of scope (subsequent phases)

  • Per-tool fixture set → #PhaseTwoIssue
  • Codex-plugin lifecycle scenarios → #PhaseThreeIssue
  • Token efficiency reporting (this PR captures usage in the per-cell shape, but the side-by-side report is in Phase 4)
  • JSONL replay → #PhaseFiveIssue

References

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions