Skip to content

Codex-vs-Pi runtime parity QA harness (RFC + tracking)Β #80171

@100yenadmin

Description

@100yenadmin

Codex-vs-Pi runtime parity QA harness (RFC + tracking)

Context

Per the maintainer thread between @pash, @Eva-βš‘πŸ‘, and @ai-hpc on Yesterday: OpenClaw is moving to Codex as the default runtime for OpenAI agent turns. The Pi-built tool surface, doctor migrations, plugin install/version flows, and auth-profile selection have a known regression class when the runtime axis is flipped β€” recent issues #78055, #78060, #78407, #78499 cluster around exactly this surface.

The maintainer ask:

  • @pash β€” stability + low-hanging optimisations ahead of the announcement; codex-plugin install/update ergonomics; version-pinning regression coverage; "fails clearly, remediation steps clear, ux is good"; token efficiency report.
  • @Eva-βš‘πŸ‘ β€” full parity QA pi-vs-codex; loop 3 agents on difficult scenarios from real jsonl session history; all tools and long runs to 100% parity; debug logging on each cell.
  • @ai-hpc β€” already manually verified the 4-cell doctor-migration matrix on current main; needs to be codified as a harness that can't regress.

The existing model-axis parity gate (introduced in #74290, folded into release validation by #74622, baseline bump in flight at #79347) compares gpt-5.5 vs claude-opus-4-7 β€” same runtime, two different models. The new harness is orthogonal: same model, two different runtimes.

This RFC sketches the architecture for the runtime-parity harness so implementation can be split into reviewable sub-issues. Builds on (and replaces) the proposal sketched in extensions/qa-lab/transport-parity-gate.md from closed PR #78512.

Architecture

The matrix

scenarios Γ— runtimes Γ— plugin-states Γ— auth-shapes Γ— provider-mode
Axis Values Purpose
scenarios per-tool fixtures + jsonl-replay scenarios + existing agentic-parity scenarios What the agent is asked to do
runtimes pi, codex The "primary subject" of the comparison β€” same model, forced runtime
plugin-states codex-missing, codex-pinned-old, codex-current, codex-head Stress codex-as-plugin lifecycle
auth-shapes oauth-only, apikey-only, mixed-profiles Catches auth-selection bugs (#78499 class)
provider-mode mock-openai (hermetic, default), live-frontier (real, gated) Cost/speed vs realism trade-off

Full Cartesian is huge; we run a small hermetic subset on every PR (mock-openai Γ— current-codex Γ— oauth-only across the per-tool fixtures) and the full live matrix on schedule (release-checks workflow, gated behind OPENCLAW_BUILD_PRIVATE_QA=1).

Per-cell capture

For every cell of the matrix, emit:

  • transcript-bytes β€” full JSONL of the turn chain (already produced by qa-lab; just needs runtime tagging).
  • tool-calls[] β€” ordered list of { tool_name, args_hash, result_hash, error_class? }.
  • final-text β€” assistant final answer text, normalized for whitespace.
  • usage β€” { input_tokens, output_tokens, total_tokens, cache_read?, cache_write? }. Aggregate per-turn and per-scenario.
  • wall-clock-ms, transport-error-class?, runtime-error-class?.
  • boot-state β€” gateway.err.log lines containing FailoverError, No API key found, Codex app-server, etc.

Drift classifier

When transcripts differ between the pi and codex cells of the same scenario, classify:

  • text-only β€” final answers differ in wording but mean the same thing (allowed within model-eval tolerance, same rubric the existing agentic-parity-report.test.ts uses).
  • tool-call-shape β€” different tools called, different arg shapes, different ordering.
  • tool-result-shape β€” same tool called but result is interpreted differently.
  • structural β€” different turn count, different phase structure, missing/extra final answer.
  • failure-mode β€” one cell errors, the other doesn't.

The harness reports drift category per scenario, not just pass/fail. This is what makes it actionable for "lots of tools break under codex" β€” you see exactly which tool family drifts.

Token-efficiency report

For live-mode runs: per-scenario, side-by-side table:

scenario              | pi tokens | codex tokens | Ξ”      | tools used
----------------------|-----------|--------------|--------|----------
bash-list-files       |   1,240   |    1,180     | -4.8%  | bash
exec-approval-loop    |   3,840   |    4,210     | +9.6%  | exec, message-tool
web-search-then-fetch |   2,100   |    1,950     | -7.1%  | web_search, web_fetch
                       ...
TOTAL                 |  N        |   M          |  Β±x%   | -

Plus per-runtime aggregates (total, p50, p90 per turn) and a flag when delta >15% so model-cost regressions surface as PR-blockers.

Components β€” file-level layout

New / extended file Purpose Phase
extensions/qa-lab/src/runtime-parity.ts (new) Orchestrator: takes a scenario, runs it twice with pi and codex forced, returns per-cell capture 1
extensions/qa-lab/src/runtime-parity.test.ts (new) Unit tests for orchestrator + drift classifier 1
src/agents/model-runtime-policy.ts (extend) Add an OPENCLAW_QA_FORCE_RUNTIME env-var seam (test-only) so the harness can override agentRuntime.id resolution without mutating user config. Document as test-only in the export's JSDoc. 1
extensions/qa-lab/src/agentic-parity-report.ts (extend) Add runtime field to per-cell summary, runtimeDrift rollup section 1
extensions/qa-lab/src/cli.ts (extend) New qa suite --runtime-pair pi,codex flag, propagates to suite runner 1
qa/scenarios/runtime/tools/<tool>.md (new) One scenario per tool family β€” see Phase 2 list below 2
extensions/qa-lab/src/codex-plugin-fixture.ts (new) Helpers to seed ~/.openclaw/npm/node_modules/@openclaw/codex to a known version (or absent) before a cell 3
extensions/qa-lab/src/codex-plugin-lifecycle.test.ts (new) Asserts doctor + first-turn flow under each plugin-state 3
extensions/qa-lab/src/token-efficiency-report.ts (new) Side-by-side token report; integrates into qa parity-report 4
extensions/qa-lab/src/jsonl-replay.ts (new) Replays real captured session transcripts through both runtimes 5
.github/workflows/openclaw-release-checks.yml (extend) Wire the runtime-pair lane into the same matrix that already runs the model-pair lane 1

Phasing β€” five PRs, staged

Sub-issues filed:

Phase 1 β€” Runtime axis (smallest, lands first)

Scope: add the runtime dimension to the existing parity machinery. Reuse current scenarios; do not add new fixtures yet.

Files: runtime-parity.ts, model-runtime-policy.ts extension, agentic-parity-report.ts extension, cli.ts flag, workflow wiring, tests.

Acceptance:

  • pnpm openclaw qa suite --provider-mode mock-openai --parity-pack agentic --runtime-pair pi,codex runs each existing agentic scenario twice (once per runtime) and produces a summary with a runtime field per cell.
  • The drift classifier is implemented and emits one of {none, text-only, tool-call-shape, tool-result-shape, structural, failure-mode} per scenario.
  • New qa parity-report mode --runtime-axis produces a side-by-side table.
  • OPENCLAW_QA_FORCE_RUNTIME=pi|codex env var, set at policy resolution time, is documented as test-only and gated to OPENCLAW_BUILD_PRIVATE_QA=1.
  • CI wiring: a new step in .github/workflows/openclaw-release-checks.yml (folded into the same matrix as ci: fold parity into QA release validationΒ #74622's parity lane) running the runtime-pair on the existing scenarios.
  • All existing parity tests still green; no behavior change for non-QA users.

Phase 2 β€” Per-tool fixture set

Scope: one fixture per tool family. Each fixture is deterministic: prompts the agent in a way that forces exactly one tool call with predictable arguments. Asserts the tool was invoked, completed, and result shape matches between runtimes.

Tool families (from src/agents/pi-tools.create-openclaw-coding-tools.ts and codex harness contract β€” exact list to be confirmed in the PR):

  • bash
  • exec (approval flow)
  • fs.read, fs.write, fs.list
  • grep
  • edit / apply-patch
  • web_search, web_fetch
  • tavily_search, tavily_extract
  • image_generate
  • tts
  • message-tool (send + media variants)
  • session_status, sessions_spawn
  • memory.recall, memory.add (if pi-only, mark as expected drift)
  • skill_* invocations

Acceptance:

  • Each tool has a qa/scenarios/runtime/tools/<tool>.md fixture.
  • Each fixture passes both cells when run under --runtime-pair pi,codex against current main, OR is annotated with a known-broken marker that points at a tracking issue.
  • The runtime-parity report enumerates per-tool drift, not just per-scenario.
  • A pnpm openclaw qa tool-coverage --runtime-pair pi,codex command produces a Markdown table of "tool X: pi=βœ… codex=❌ #issue" for the README of the harness.

Phase 3 β€” Codex-plugin lifecycle harness

Scope: stress the codex-plugin install / update / version-pinning flows that pash flagged.

Cells (from the bug clusters and ai-hpc's manual matrix):

  1. Cold install β€” clean home, no codex plugin β†’ openclaw doctor --fix from a config that needs codex β†’ assert remediation message clear, install completes, retry succeeds, no $ leakage to api-key path.
  2. OAuth-only with mixed-profiles β€” both openai-codex:* and openai:* profiles in auth-profiles.json β†’ assert codex auth picked, not the api-key ([Bug]: doctor --fix rewrites Codex runtime model refs to openai/* and breaks Codex auth profile selectionΒ #78499 case).
  3. Pinned-old codex plugin + new openclaw β€” codex plugin pinned to release N-1, openclaw on N β†’ assert version mismatch detected and reported with a clear remediation hint.
  4. Pinned-new codex plugin + old openclaw β€” same axis flipped.
  5. Codex plugin install racing first agent turn β€” concurrent install + agent run β†’ assert ordering doesn't lose tokens or produce a duplicate response.
  6. Doctor migration safety β€” codify @ai-hpc's four manual cells as automated checks: oauth-only, mixed-profile, mixed + defaults pin, mixed + per-agent pin β†’ assert doctor --fix strips pins and codex auto-routes.

Acceptance:

  • Each cell is automated, runs in mock-openai mode, completes <60s.
  • Failure modes have asserted error messages (string match) so any wording regression is caught.
  • Live-mode variant gated to scheduled runs.

Phase 4 β€” Token-efficiency report

Scope: capture and surface per-runtime token usage. Live mode only (mock-openai returns fixed counts so deltas there are meaningless).

Acceptance:

  • qa parity-report --runtime-axis --token-efficiency produces the side-by-side table described above.
  • Per-runtime aggregates: total, p50, p90, per-turn.
  • Flag when scenario-level delta >15%.
  • Stored as a release artifact for week-over-week tracking.

Phase 5 β€” JSONL replay (lower priority, separate track)

Scope: Eva's "loop 3 agents on difficult scenarios from real jsonl session history."

Approach: take captured session transcripts (from a maintainer-supplied jsonl set, stripped of PII), extract user turns, replay through fresh sessions on each runtime. Diff trajectories.

Acceptance: harness accepts a directory of jsonl, runs each through --runtime-pair, produces a drift report with the same drift classifier from Phase 1. PR is gated behind a curated fixture set so it can land without a real-customer transcript dump.

Performance / cost budget

  • Hermetic on-PR runs (mock-openai, single auth-shape, codex-current only): target <5 min total for all scenarios across both runtimes. Parallelizable per scenario.
  • Full live release-checks lane: target <30 min with parallelism, gated behind OPENCLAW_BUILD_PRIVATE_QA=1.
  • Token-efficiency live runs: separate scheduled cron, not on every release; nightly is fine.

Out of scope

Failure-mode taxonomy (for triage)

When the harness reports drift, the triage flow is:

  1. failure-mode drift = one runtime errors, the other doesn't β†’ blocking. Open a P1 bug.
  2. structural drift = turn count or phase structure differs β†’ likely blocking. Investigate before merging anything that touches that code path.
  3. tool-call-shape drift = wrong/missing tool β†’ P1-P2 depending on the tool family.
  4. tool-result-shape drift = same tool, different parsing β†’ P2 unless it changes outcomes.
  5. text-only drift within tolerance = expected; no action.
  6. text-only drift outside tolerance = model-eval rubric escalation.

References

Handoff notes for the implementing agent

  • Read extensions/qa-lab/AGENTS.md and the scoped extensions/qa-lab/src/CLAUDE.md (if present) before touching code.
  • The OPENCLAW_QA_FORCE_RUNTIME seam is the only runtime-mutation surface added by Phase 1 β€” keep it gated and test-only. Do not let it leak into production code paths.
  • Phase 1 is the unblocker. Phases 2–4 can be parallelised once Phase 1 lands. Phase 5 is independent and lowest priority.
  • The drift classifier in Phase 1 must use the same rubric as the existing agentic-parity-report.test.ts for text-only drift to keep tolerance consistent across the two parity gates.
  • For Phase 3 cell 5 (install race), avoid timing-based assertions β€” use deterministic ordering primitives.
  • For Phase 4, capture usage at the assistant-message level (AssistantMessage.usage) rather than at the transport level β€” the transport-level shapes differ between Pi and Codex but the assistant-message shape is normalized.
  • Sub-issues will be filed for each phase. This issue is the tracking parent.

cc @pash @Eva-βš‘πŸ‘ @ai-hpc

Metadata

Metadata

Assignees

No one assigned

    Labels

    staleMarked as stale due to inactivity

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions