You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Per the maintainer thread between @pash, @Eva-β‘π, and @ai-hpc on Yesterday: OpenClaw is moving to Codex as the default runtime for OpenAI agent turns. The Pi-built tool surface, doctor migrations, plugin install/version flows, and auth-profile selection have a known regression class when the runtime axis is flipped β recent issues #78055, #78060, #78407, #78499 cluster around exactly this surface.
The maintainer ask:
@pash β stability + low-hanging optimisations ahead of the announcement; codex-plugin install/update ergonomics; version-pinning regression coverage; "fails clearly, remediation steps clear, ux is good"; token efficiency report.
@Eva-β‘π β full parity QA pi-vs-codex; loop 3 agents on difficult scenarios from real jsonl session history; all tools and long runs to 100% parity; debug logging on each cell.
@ai-hpc β already manually verified the 4-cell doctor-migration matrix on current main; needs to be codified as a harness that can't regress.
The existing model-axis parity gate (introduced in #74290, folded into release validation by #74622, baseline bump in flight at #79347) compares gpt-5.5 vs claude-opus-4-7 β same runtime, two different models. The new harness is orthogonal: same model, two different runtimes.
This RFC sketches the architecture for the runtime-parity harness so implementation can be split into reviewable sub-issues. Builds on (and replaces) the proposal sketched in extensions/qa-lab/transport-parity-gate.md from closed PR #78512.
Full Cartesian is huge; we run a small hermetic subset on every PR (mock-openai Γ current-codex Γ oauth-only across the per-tool fixtures) and the full live matrix on schedule (release-checks workflow, gated behind OPENCLAW_BUILD_PRIVATE_QA=1).
Per-cell capture
For every cell of the matrix, emit:
transcript-bytes β full JSONL of the turn chain (already produced by qa-lab; just needs runtime tagging).
tool-calls[] β ordered list of { tool_name, args_hash, result_hash, error_class? }.
final-text β assistant final answer text, normalized for whitespace.
boot-state β gateway.err.log lines containing FailoverError, No API key found, Codex app-server, etc.
Drift classifier
When transcripts differ between the pi and codex cells of the same scenario, classify:
text-only β final answers differ in wording but mean the same thing (allowed within model-eval tolerance, same rubric the existing agentic-parity-report.test.ts uses).
tool-call-shape β different tools called, different arg shapes, different ordering.
tool-result-shape β same tool called but result is interpreted differently.
structural β different turn count, different phase structure, missing/extra final answer.
failure-mode β one cell errors, the other doesn't.
The harness reports drift category per scenario, not just pass/fail. This is what makes it actionable for "lots of tools break under codex" β you see exactly which tool family drifts.
Token-efficiency report
For live-mode runs: per-scenario, side-by-side table:
Add an OPENCLAW_QA_FORCE_RUNTIME env-var seam (test-only) so the harness can override agentRuntime.id resolution without mutating user config. Document as test-only in the export's JSDoc.
pnpm openclaw qa suite --provider-mode mock-openai --parity-pack agentic --runtime-pair pi,codex runs each existing agentic scenario twice (once per runtime) and produces a summary with a runtime field per cell.
The drift classifier is implemented and emits one of {none, text-only, tool-call-shape, tool-result-shape, structural, failure-mode} per scenario.
New qa parity-report mode --runtime-axis produces a side-by-side table.
OPENCLAW_QA_FORCE_RUNTIME=pi|codex env var, set at policy resolution time, is documented as test-only and gated to OPENCLAW_BUILD_PRIVATE_QA=1.
CI wiring: a new step in .github/workflows/openclaw-release-checks.yml (folded into the same matrix as ci: fold parity into QA release validationΒ #74622's parity lane) running the runtime-pair on the existing scenarios.
All existing parity tests still green; no behavior change for non-QA users.
Phase 2 β Per-tool fixture set
Scope: one fixture per tool family. Each fixture is deterministic: prompts the agent in a way that forces exactly one tool call with predictable arguments. Asserts the tool was invoked, completed, and result shape matches between runtimes.
Tool families (from src/agents/pi-tools.create-openclaw-coding-tools.ts and codex harness contract β exact list to be confirmed in the PR):
bash
exec (approval flow)
fs.read, fs.write, fs.list
grep
edit / apply-patch
web_search, web_fetch
tavily_search, tavily_extract
image_generate
tts
message-tool (send + media variants)
session_status, sessions_spawn
memory.recall, memory.add (if pi-only, mark as expected drift)
skill_* invocations
Acceptance:
Each tool has a qa/scenarios/runtime/tools/<tool>.md fixture.
Each fixture passes both cells when run under --runtime-pair pi,codex against current main, OR is annotated with a known-broken marker that points at a tracking issue.
The runtime-parity report enumerates per-tool drift, not just per-scenario.
A pnpm openclaw qa tool-coverage --runtime-pair pi,codex command produces a Markdown table of "tool X: pi=β codex=β #issue" for the README of the harness.
Phase 3 β Codex-plugin lifecycle harness
Scope: stress the codex-plugin install / update / version-pinning flows that pash flagged.
Cells (from the bug clusters and ai-hpc's manual matrix):
Cold install β clean home, no codex plugin β openclaw doctor --fix from a config that needs codex β assert remediation message clear, install completes, retry succeeds, no $ leakage to api-key path.
Pinned-old codex plugin + new openclaw β codex plugin pinned to release N-1, openclaw on N β assert version mismatch detected and reported with a clear remediation hint.
Pinned-new codex plugin + old openclaw β same axis flipped.
Codex plugin install racing first agent turn β concurrent install + agent run β assert ordering doesn't lose tokens or produce a duplicate response.
Doctor migration safety β codify @ai-hpc's four manual cells as automated checks: oauth-only, mixed-profile, mixed + defaults pin, mixed + per-agent pin β assert doctor --fix strips pins and codex auto-routes.
Acceptance:
Each cell is automated, runs in mock-openai mode, completes <60s.
Failure modes have asserted error messages (string match) so any wording regression is caught.
Live-mode variant gated to scheduled runs.
Phase 4 β Token-efficiency report
Scope: capture and surface per-runtime token usage. Live mode only (mock-openai returns fixed counts so deltas there are meaningless).
Acceptance:
qa parity-report --runtime-axis --token-efficiency produces the side-by-side table described above.
Stored as a release artifact for week-over-week tracking.
Phase 5 β JSONL replay (lower priority, separate track)
Scope: Eva's "loop 3 agents on difficult scenarios from real jsonl session history."
Approach: take captured session transcripts (from a maintainer-supplied jsonl set, stripped of PII), extract user turns, replay through fresh sessions on each runtime. Diff trajectories.
Acceptance: harness accepts a directory of jsonl, runs each through --runtime-pair, produces a drift report with the same drift classifier from Phase 1. PR is gated behind a curated fixture set so it can land without a real-customer transcript dump.
Performance / cost budget
Hermetic on-PR runs (mock-openai, single auth-shape, codex-current only): target <5 min total for all scenarios across both runtimes. Parallelizable per scenario.
Full live release-checks lane: target <30 min with parallelism, gated behind OPENCLAW_BUILD_PRIVATE_QA=1.
Token-efficiency live runs: separate scheduled cron, not on every release; nightly is fine.
Read extensions/qa-lab/AGENTS.md and the scoped extensions/qa-lab/src/CLAUDE.md (if present) before touching code.
The OPENCLAW_QA_FORCE_RUNTIME seam is the only runtime-mutation surface added by Phase 1 β keep it gated and test-only. Do not let it leak into production code paths.
Phase 1 is the unblocker. Phases 2β4 can be parallelised once Phase 1 lands. Phase 5 is independent and lowest priority.
The drift classifier in Phase 1 must use the same rubric as the existing agentic-parity-report.test.ts for text-only drift to keep tolerance consistent across the two parity gates.
For Phase 3 cell 5 (install race), avoid timing-based assertions β use deterministic ordering primitives.
For Phase 4, capture usage at the assistant-message level (AssistantMessage.usage) rather than at the transport level β the transport-level shapes differ between Pi and Codex but the assistant-message shape is normalized.
Sub-issues will be filed for each phase. This issue is the tracking parent.
Codex-vs-Pi runtime parity QA harness (RFC + tracking)
Context
Per the maintainer thread between
@pash,@Eva-β‘π, and@ai-hpcon Yesterday: OpenClaw is moving to Codex as the default runtime for OpenAI agent turns. The Pi-built tool surface, doctor migrations, plugin install/version flows, and auth-profile selection have a known regression class when the runtime axis is flipped β recent issues #78055, #78060, #78407, #78499 cluster around exactly this surface.The maintainer ask:
@pashβ stability + low-hanging optimisations ahead of the announcement; codex-plugin install/update ergonomics; version-pinning regression coverage; "fails clearly, remediation steps clear, ux is good"; token efficiency report.@Eva-β‘πβ full parity QA pi-vs-codex; loop 3 agents on difficult scenarios from real jsonl session history; all tools and long runs to 100% parity; debug logging on each cell.@ai-hpcβ already manually verified the 4-cell doctor-migration matrix on current main; needs to be codified as a harness that can't regress.The existing model-axis parity gate (introduced in #74290, folded into release validation by #74622, baseline bump in flight at #79347) compares
gpt-5.5vsclaude-opus-4-7β same runtime, two different models. The new harness is orthogonal: same model, two different runtimes.This RFC sketches the architecture for the runtime-parity harness so implementation can be split into reviewable sub-issues. Builds on (and replaces) the proposal sketched in
extensions/qa-lab/transport-parity-gate.mdfrom closed PR #78512.Architecture
The matrix
pi,codexcodex-missing,codex-pinned-old,codex-current,codex-headoauth-only,apikey-only,mixed-profilesmock-openai(hermetic, default),live-frontier(real, gated)Full Cartesian is huge; we run a small hermetic subset on every PR (mock-openai Γ current-codex Γ oauth-only across the per-tool fixtures) and the full live matrix on schedule (release-checks workflow, gated behind
OPENCLAW_BUILD_PRIVATE_QA=1).Per-cell capture
For every cell of the matrix, emit:
transcript-bytesβ full JSONL of the turn chain (already produced by qa-lab; just needs runtime tagging).tool-calls[]β ordered list of{ tool_name, args_hash, result_hash, error_class? }.final-textβ assistant final answer text, normalized for whitespace.usageβ{ input_tokens, output_tokens, total_tokens, cache_read?, cache_write? }. Aggregate per-turn and per-scenario.wall-clock-ms,transport-error-class?,runtime-error-class?.boot-stateβgateway.err.loglines containingFailoverError,No API key found,Codex app-server, etc.Drift classifier
When transcripts differ between the
piandcodexcells of the same scenario, classify:text-onlyβ final answers differ in wording but mean the same thing (allowed within model-eval tolerance, same rubric the existingagentic-parity-report.test.tsuses).tool-call-shapeβ different tools called, different arg shapes, different ordering.tool-result-shapeβ same tool called but result is interpreted differently.structuralβ different turn count, different phase structure, missing/extra final answer.failure-modeβ one cell errors, the other doesn't.The harness reports drift category per scenario, not just pass/fail. This is what makes it actionable for "lots of tools break under codex" β you see exactly which tool family drifts.
Token-efficiency report
For live-mode runs: per-scenario, side-by-side table:
Plus per-runtime aggregates (total, p50, p90 per turn) and a flag when delta >15% so model-cost regressions surface as PR-blockers.
Components β file-level layout
extensions/qa-lab/src/runtime-parity.ts(new)piandcodexforced, returns per-cell captureextensions/qa-lab/src/runtime-parity.test.ts(new)src/agents/model-runtime-policy.ts(extend)OPENCLAW_QA_FORCE_RUNTIMEenv-var seam (test-only) so the harness can overrideagentRuntime.idresolution without mutating user config. Document as test-only in the export's JSDoc.extensions/qa-lab/src/agentic-parity-report.ts(extend)runtimefield to per-cell summary,runtimeDriftrollup sectionextensions/qa-lab/src/cli.ts(extend)qa suite --runtime-pair pi,codexflag, propagates to suite runnerqa/scenarios/runtime/tools/<tool>.md(new)extensions/qa-lab/src/codex-plugin-fixture.ts(new)~/.openclaw/npm/node_modules/@openclaw/codexto a known version (or absent) before a cellextensions/qa-lab/src/codex-plugin-lifecycle.test.ts(new)extensions/qa-lab/src/token-efficiency-report.ts(new)qa parity-reportextensions/qa-lab/src/jsonl-replay.ts(new).github/workflows/openclaw-release-checks.yml(extend)Phasing β five PRs, staged
Sub-issues filed:
Phase 1 β Runtime axis (smallest, lands first)
Scope: add the
runtimedimension to the existing parity machinery. Reuse current scenarios; do not add new fixtures yet.Files:
runtime-parity.ts,model-runtime-policy.tsextension,agentic-parity-report.tsextension,cli.tsflag, workflow wiring, tests.Acceptance:
pnpm openclaw qa suite --provider-mode mock-openai --parity-pack agentic --runtime-pair pi,codexruns each existing agentic scenario twice (once per runtime) and produces a summary with aruntimefield per cell.{none, text-only, tool-call-shape, tool-result-shape, structural, failure-mode}per scenario.qa parity-reportmode--runtime-axisproduces a side-by-side table.OPENCLAW_QA_FORCE_RUNTIME=pi|codexenv var, set at policy resolution time, is documented as test-only and gated toOPENCLAW_BUILD_PRIVATE_QA=1..github/workflows/openclaw-release-checks.yml(folded into the same matrix as ci: fold parity into QA release validationΒ #74622's parity lane) running the runtime-pair on the existing scenarios.Phase 2 β Per-tool fixture set
Scope: one fixture per tool family. Each fixture is deterministic: prompts the agent in a way that forces exactly one tool call with predictable arguments. Asserts the tool was invoked, completed, and result shape matches between runtimes.
Tool families (from
src/agents/pi-tools.create-openclaw-coding-tools.tsand codex harness contract β exact list to be confirmed in the PR):bashexec(approval flow)fs.read,fs.write,fs.listgrepedit/apply-patchweb_search,web_fetchtavily_search,tavily_extractimage_generatettsmessage-tool(send + media variants)session_status,sessions_spawnmemory.recall,memory.add(if pi-only, mark as expected drift)skill_*invocationsAcceptance:
qa/scenarios/runtime/tools/<tool>.mdfixture.--runtime-pair pi,codexagainst current main, OR is annotated with a known-broken marker that points at a tracking issue.pnpm openclaw qa tool-coverage --runtime-pair pi,codexcommand produces a Markdown table of "tool X: pi=β codex=β #issue" for the README of the harness.Phase 3 β Codex-plugin lifecycle harness
Scope: stress the codex-plugin install / update / version-pinning flows that pash flagged.
Cells (from the bug clusters and ai-hpc's manual matrix):
openclaw doctor --fixfrom a config that needs codex β assert remediation message clear, install completes, retry succeeds, no $ leakage to api-key path.openai-codex:*andopenai:*profiles inauth-profiles.jsonβ assert codex auth picked, not the api-key ([Bug]: doctor --fix rewrites Codex runtime model refs to openai/* and breaks Codex auth profile selectionΒ #78499 case).@ai-hpc's four manual cells as automated checks: oauth-only, mixed-profile, mixed + defaults pin, mixed + per-agent pin β assertdoctor --fixstrips pins and codex auto-routes.Acceptance:
Phase 4 β Token-efficiency report
Scope: capture and surface per-runtime token usage. Live mode only (mock-openai returns fixed counts so deltas there are meaningless).
Acceptance:
qa parity-report --runtime-axis --token-efficiencyproduces the side-by-side table described above.Phase 5 β JSONL replay (lower priority, separate track)
Scope: Eva's "loop 3 agents on difficult scenarios from real jsonl session history."
Approach: take captured session transcripts (from a maintainer-supplied jsonl set, stripped of PII), extract user turns, replay through fresh sessions on each runtime. Diff trajectories.
Acceptance: harness accepts a directory of jsonl, runs each through
--runtime-pair, produces a drift report with the same drift classifier from Phase 1. PR is gated behind a curated fixture set so it can land without a real-customer transcript dump.Performance / cost budget
OPENCLAW_BUILD_PRIVATE_QA=1.Out of scope
Failure-mode taxonomy (for triage)
When the harness reports drift, the triage flow is:
failure-modedrift = one runtime errors, the other doesn't β blocking. Open a P1 bug.structuraldrift = turn count or phase structure differs β likely blocking. Investigate before merging anything that touches that code path.tool-call-shapedrift = wrong/missing tool β P1-P2 depending on the tool family.tool-result-shapedrift = same tool, different parsing β P2 unless it changes outcomes.text-onlydrift within tolerance = expected; no action.text-onlydrift outside tolerance = model-eval rubric escalation.References
openai/*routes).transport-parity-gate.mddesign doc + reproduction test (test was reframed-out by Keep OpenAI Codex migrations on automatic runtime routingΒ #79238; doc lifts forward into this RFC).extensions/qa-lab/transport-parity-gate.mdβ design-doc-only PR will be filed extracting the doc content from test(doctor): reproduce #78407 openai-codex model-ref rewrite without authΒ #78512 and updating it for current main.@ai-hpc's four-cell manual matrix verification (Yesterday).Handoff notes for the implementing agent
extensions/qa-lab/AGENTS.mdand the scopedextensions/qa-lab/src/CLAUDE.md(if present) before touching code.OPENCLAW_QA_FORCE_RUNTIMEseam is the only runtime-mutation surface added by Phase 1 β keep it gated and test-only. Do not let it leak into production code paths.agentic-parity-report.test.tsfor text-only drift to keep tolerance consistent across the two parity gates.AssistantMessage.usage) rather than at the transport level β the transport-level shapes differ between Pi and Codex but the assistant-message shape is normalized.cc
@pash@Eva-β‘π@ai-hpc