Tracking parent: #80171
Goal
Add a runtime axis (pi vs codex) to the existing qa-lab parity machinery so the same scenario, same model ref can be run under both runtimes and the outputs diffed. This is the unblocker for the Codex-vs-Pi parity harness — every other phase composes on top.
Scope
Reuse current scenarios (no new fixtures yet — Phase 2 covers per-tool fixtures). Reuse current parity-report plumbing where possible. The deliverable is the orchestrator + drift classifier + report extension + CI wiring.
Concrete deliverables
Code
-
New extensions/qa-lab/src/runtime-parity.ts — orchestrator. Public API:
export type RuntimeId = "pi" | "codex";
export type RuntimeParityCell = {
runtime: RuntimeId;
transcriptBytes: string;
toolCalls: Array<{ tool: string; argsHash: string; resultHash: string; errorClass?: string }>;
finalText: string;
usage: { inputTokens: number; outputTokens: number; totalTokens: number; cacheRead?: number; cacheWrite?: number };
wallClockMs: number;
transportErrorClass?: string;
runtimeErrorClass?: string;
bootStateLines: string[];
};
export type RuntimeParityResult = {
scenarioId: string;
cells: { pi: RuntimeParityCell; codex: RuntimeParityCell };
drift: "none" | "text-only" | "tool-call-shape" | "tool-result-shape" | "structural" | "failure-mode";
driftDetails?: string;
};
export function runRuntimeParityScenario(params: {...}): Promise<RuntimeParityResult>;
-
New extensions/qa-lab/src/runtime-parity.test.ts — unit tests for the drift classifier (each of the six drift categories has a fixture transcript pair).
-
Extend src/agents/model-runtime-policy.ts — add OPENCLAW_QA_FORCE_RUNTIME=pi|codex env-var override at the top of resolveModelRuntimePolicy. Gate read to process.env.OPENCLAW_BUILD_PRIVATE_QA === "1" so the seam cannot fire in production builds. Document as test-only in JSDoc.
-
Extend extensions/qa-lab/src/agentic-parity-report.ts — add a runtime field to per-cell summary, a runtimeDrift rollup section, and a Markdown rendering for the drift category table.
-
Extend extensions/qa-lab/src/cli.ts — new qa suite --runtime-pair pi,codex flag. When set, suite runner runs each scenario twice and feeds both cells through runRuntimeParityScenario.
-
Extend .github/workflows/openclaw-release-checks.yml — new step qa_lab_runtime_parity_release_checks parallel to the existing qa_lab_parity_lane_release_checks. Same gating (OPENCLAW_BUILD_PRIVATE_QA=1).
Tests
- Unit tests for the drift classifier (six fixtures, one per drift category).
- Integration test: run two existing agentic scenarios through
runRuntimeParityScenario against mock-openai, assert the report shape is correct.
- Snapshot test for the new Markdown rollup section.
Acceptance criteria
Out of scope (subsequent phases)
- Per-tool fixture set → #PhaseTwoIssue
- Codex-plugin lifecycle scenarios → #PhaseThreeIssue
- Token efficiency reporting (this PR captures
usage in the per-cell shape, but the side-by-side report is in Phase 4)
- JSONL replay → #PhaseFiveIssue
References
Tracking parent: #80171
Goal
Add a
runtimeaxis (pivscodex) to the existing qa-lab parity machinery so the same scenario, same model ref can be run under both runtimes and the outputs diffed. This is the unblocker for the Codex-vs-Pi parity harness — every other phase composes on top.Scope
Reuse current scenarios (no new fixtures yet — Phase 2 covers per-tool fixtures). Reuse current parity-report plumbing where possible. The deliverable is the orchestrator + drift classifier + report extension + CI wiring.
Concrete deliverables
Code
New
extensions/qa-lab/src/runtime-parity.ts— orchestrator. Public API:New
extensions/qa-lab/src/runtime-parity.test.ts— unit tests for the drift classifier (each of the six drift categories has a fixture transcript pair).Extend
src/agents/model-runtime-policy.ts— addOPENCLAW_QA_FORCE_RUNTIME=pi|codexenv-var override at the top ofresolveModelRuntimePolicy. Gate read toprocess.env.OPENCLAW_BUILD_PRIVATE_QA === "1"so the seam cannot fire in production builds. Document as test-only in JSDoc.Extend
extensions/qa-lab/src/agentic-parity-report.ts— add aruntimefield to per-cell summary, aruntimeDriftrollup section, and a Markdown rendering for the drift category table.Extend
extensions/qa-lab/src/cli.ts— newqa suite --runtime-pair pi,codexflag. When set, suite runner runs each scenario twice and feeds both cells throughrunRuntimeParityScenario.Extend
.github/workflows/openclaw-release-checks.yml— new stepqa_lab_runtime_parity_release_checksparallel to the existingqa_lab_parity_lane_release_checks. Same gating (OPENCLAW_BUILD_PRIVATE_QA=1).Tests
runRuntimeParityScenarioagainst mock-openai, assert the report shape is correct.Acceptance criteria
pnpm openclaw qa suite --provider-mode mock-openai --parity-pack agentic --runtime-pair pi,codexruns each existing agentic scenario twice (once per runtime) and emits a summary with aruntimefield per cell.{none, text-only, tool-call-shape, tool-result-shape, structural, failure-mode}per scenario.qa parity-reportmode--runtime-axisproduces a side-by-side Markdown table.OPENCLAW_QA_FORCE_RUNTIMEenv var, gated toOPENCLAW_BUILD_PRIVATE_QA=1, is documented as test-only and has a unit test asserting it does NOT fire when the gate is unset.pnpm exec oxlint --type-awareandpnpm check:test-typesclean.Out of scope (subsequent phases)
usagein the per-cell shape, but the side-by-side report is in Phase 4)References
extensions/qa-lab/src/agentic-parity.ts,extensions/qa-lab/src/agentic-parity-report.ts,extensions/qa-lab/src/character-eval.ts