[Codex×Pi parity Phase 1] Runtime axis in qa-lab parity machinery

**Tracking parent:** #80171

## Goal

Add a `runtime` axis (`pi` vs `codex`) to the existing qa-lab parity machinery so the same scenario, same model ref can be run under both runtimes and the outputs diffed. This is the unblocker for the Codex-vs-Pi parity harness — every other phase composes on top.

## Scope

Reuse current scenarios (no new fixtures yet — Phase 2 covers per-tool fixtures). Reuse current parity-report plumbing where possible. The deliverable is the orchestrator + drift classifier + report extension + CI wiring.

## Concrete deliverables

### Code

- **New** `extensions/qa-lab/src/runtime-parity.ts` — orchestrator. Public API:
  ```ts
  export type RuntimeId = "pi" | "codex";
  export type RuntimeParityCell = {
    runtime: RuntimeId;
    transcriptBytes: string;
    toolCalls: Array<{ tool: string; argsHash: string; resultHash: string; errorClass?: string }>;
    finalText: string;
    usage: { inputTokens: number; outputTokens: number; totalTokens: number; cacheRead?: number; cacheWrite?: number };
    wallClockMs: number;
    transportErrorClass?: string;
    runtimeErrorClass?: string;
    bootStateLines: string[];
  };
  export type RuntimeParityResult = {
    scenarioId: string;
    cells: { pi: RuntimeParityCell; codex: RuntimeParityCell };
    drift: "none" | "text-only" | "tool-call-shape" | "tool-result-shape" | "structural" | "failure-mode";
    driftDetails?: string;
  };
  export function runRuntimeParityScenario(params: {...}): Promise<RuntimeParityResult>;
  ```

- **New** `extensions/qa-lab/src/runtime-parity.test.ts` — unit tests for the drift classifier (each of the six drift categories has a fixture transcript pair).

- **Extend** `src/agents/model-runtime-policy.ts` — add `OPENCLAW_QA_FORCE_RUNTIME=pi|codex` env-var override at the top of `resolveModelRuntimePolicy`. Gate read to `process.env.OPENCLAW_BUILD_PRIVATE_QA === "1"` so the seam cannot fire in production builds. Document as test-only in JSDoc.

- **Extend** `extensions/qa-lab/src/agentic-parity-report.ts` — add a `runtime` field to per-cell summary, a `runtimeDrift` rollup section, and a Markdown rendering for the drift category table.

- **Extend** `extensions/qa-lab/src/cli.ts` — new `qa suite --runtime-pair pi,codex` flag. When set, suite runner runs each scenario twice and feeds both cells through `runRuntimeParityScenario`.

- **Extend** `.github/workflows/openclaw-release-checks.yml` — new step `qa_lab_runtime_parity_release_checks` parallel to the existing `qa_lab_parity_lane_release_checks`. Same gating (`OPENCLAW_BUILD_PRIVATE_QA=1`).

### Tests

- Unit tests for the drift classifier (six fixtures, one per drift category).
- Integration test: run two existing agentic scenarios through `runRuntimeParityScenario` against mock-openai, assert the report shape is correct.
- Snapshot test for the new Markdown rollup section.

## Acceptance criteria

- [ ] `pnpm openclaw qa suite --provider-mode mock-openai --parity-pack agentic --runtime-pair pi,codex` runs each existing agentic scenario twice (once per runtime) and emits a summary with a `runtime` field per cell.
- [ ] The drift classifier emits one of `{none, text-only, tool-call-shape, tool-result-shape, structural, failure-mode}` per scenario.
- [ ] New `qa parity-report` mode `--runtime-axis` produces a side-by-side Markdown table.
- [ ] `OPENCLAW_QA_FORCE_RUNTIME` env var, gated to `OPENCLAW_BUILD_PRIVATE_QA=1`, is documented as test-only and has a unit test asserting it does NOT fire when the gate is unset.
- [ ] CI wiring runs the runtime-pair on existing scenarios in mock-openai mode.
- [ ] All existing parity tests still green; no behavior change for non-QA users.
- [ ] `pnpm exec oxlint --type-aware` and `pnpm check:test-types` clean.

## Out of scope (subsequent phases)

- Per-tool fixture set → #PhaseTwoIssue
- Codex-plugin lifecycle scenarios → #PhaseThreeIssue
- Token efficiency reporting (this PR captures `usage` in the per-cell shape, but the side-by-side report is in Phase 4)
- JSONL replay → #PhaseFiveIssue

## References

- Tracking parent: #80171
- Existing model-axis sibling: #74290 → #79347
- Bug cluster motivating: #78055, #78060, #78407, #78499
- Existing parity primitives to reuse: `extensions/qa-lab/src/agentic-parity.ts`, `extensions/qa-lab/src/agentic-parity-report.ts`, `extensions/qa-lab/src/character-eval.ts`


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Codex×Pi parity Phase 1] Runtime axis in qa-lab parity machinery #80172

Goal

Scope

Concrete deliverables

Code

Tests

Acceptance criteria

Out of scope (subsequent phases)

References

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

[Codex×Pi parity Phase 1] Runtime axis in qa-lab parity machinery #80172

Description

Goal

Scope

Concrete deliverables

Code

Tests

Acceptance criteria

Out of scope (subsequent phases)

References

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions