Codex-vs-Pi runtime parity QA harness (RFC + tracking)

# Codex-vs-Pi runtime parity QA harness (RFC + tracking)

## Context

Per the maintainer thread between `@pash`, `@Eva-⚡🐑`, and `@ai-hpc` on Yesterday: OpenClaw is moving to Codex as the default runtime for OpenAI agent turns. The Pi-built tool surface, doctor migrations, plugin install/version flows, and auth-profile selection have a known regression class when the runtime axis is flipped — recent issues #78055, #78060, #78407, #78499 cluster around exactly this surface.

The maintainer ask:

- **`@pash`** — stability + low-hanging optimisations ahead of the announcement; codex-plugin install/update ergonomics; version-pinning regression coverage; "fails clearly, remediation steps clear, ux is good"; **token efficiency report**.
- **`@Eva-⚡🐑`** — full parity QA pi-vs-codex; loop 3 agents on difficult scenarios from real jsonl session history; all tools and long runs to 100% parity; debug logging on each cell.
- **`@ai-hpc`** — already manually verified the 4-cell doctor-migration matrix on current main; needs to be codified as a harness that can't regress.

The existing model-axis parity gate (introduced in #74290, folded into release validation by #74622, baseline bump in flight at #79347) compares `gpt-5.5` vs `claude-opus-4-7` — same runtime, two different models. **The new harness is orthogonal: same model, two different runtimes.**

This RFC sketches the architecture for the runtime-parity harness so implementation can be split into reviewable sub-issues. Builds on (and replaces) the proposal sketched in [`extensions/qa-lab/transport-parity-gate.md`](https://github.com/openclaw/openclaw/pull/78512/files) from closed PR #78512.

## Architecture

### The matrix

```
scenarios × runtimes × plugin-states × auth-shapes × provider-mode
```

| Axis | Values | Purpose |
|---|---|---|
| **scenarios** | per-tool fixtures + jsonl-replay scenarios + existing agentic-parity scenarios | What the agent is asked to do |
| **runtimes** | `pi`, `codex` | The "primary subject" of the comparison — same model, forced runtime |
| **plugin-states** | `codex-missing`, `codex-pinned-old`, `codex-current`, `codex-head` | Stress codex-as-plugin lifecycle |
| **auth-shapes** | `oauth-only`, `apikey-only`, `mixed-profiles` | Catches auth-selection bugs (#78499 class) |
| **provider-mode** | `mock-openai` (hermetic, default), `live-frontier` (real, gated) | Cost/speed vs realism trade-off |

Full Cartesian is huge; we run a **small hermetic subset on every PR** (mock-openai × current-codex × oauth-only across the per-tool fixtures) and the **full live matrix on schedule** (release-checks workflow, gated behind `OPENCLAW_BUILD_PRIVATE_QA=1`).

### Per-cell capture

For every cell of the matrix, emit:

- `transcript-bytes` — full JSONL of the turn chain (already produced by qa-lab; just needs runtime tagging).
- `tool-calls[]` — ordered list of `{ tool_name, args_hash, result_hash, error_class? }`.
- `final-text` — assistant final answer text, normalized for whitespace.
- `usage` — `{ input_tokens, output_tokens, total_tokens, cache_read?, cache_write? }`. Aggregate per-turn and per-scenario.
- `wall-clock-ms`, `transport-error-class?`, `runtime-error-class?`.
- `boot-state` — `gateway.err.log` lines containing `FailoverError`, `No API key found`, `Codex app-server`, etc.

### Drift classifier

When transcripts differ between the `pi` and `codex` cells of the same scenario, classify:

- `text-only` — final answers differ in wording but mean the same thing (allowed within model-eval tolerance, same rubric the existing `agentic-parity-report.test.ts` uses).
- `tool-call-shape` — different tools called, different arg shapes, different ordering.
- `tool-result-shape` — same tool called but result is interpreted differently.
- `structural` — different turn count, different phase structure, missing/extra final answer.
- `failure-mode` — one cell errors, the other doesn't.

The harness reports drift category per scenario, not just pass/fail. **This is what makes it actionable for "lots of tools break under codex" — you see exactly which tool family drifts.**

### Token-efficiency report

For live-mode runs: per-scenario, side-by-side table:

```
scenario              | pi tokens | codex tokens | Δ      | tools used
----------------------|-----------|--------------|--------|----------
bash-list-files       |   1,240   |    1,180     | -4.8%  | bash
exec-approval-loop    |   3,840   |    4,210     | +9.6%  | exec, message-tool
web-search-then-fetch |   2,100   |    1,950     | -7.1%  | web_search, web_fetch
                       ...
TOTAL                 |  N        |   M          |  ±x%   | -
```

Plus per-runtime aggregates (total, p50, p90 per turn) and a flag when delta >15% so model-cost regressions surface as PR-blockers.

## Components — file-level layout

| New / extended file | Purpose | Phase |
|---|---|---|
| `extensions/qa-lab/src/runtime-parity.ts` (new) | Orchestrator: takes a scenario, runs it twice with `pi` and `codex` forced, returns per-cell capture | 1 |
| `extensions/qa-lab/src/runtime-parity.test.ts` (new) | Unit tests for orchestrator + drift classifier | 1 |
| `src/agents/model-runtime-policy.ts` (extend) | Add an `OPENCLAW_QA_FORCE_RUNTIME` env-var seam (test-only) so the harness can override `agentRuntime.id` resolution without mutating user config. Document as test-only in the export's JSDoc. | 1 |
| `extensions/qa-lab/src/agentic-parity-report.ts` (extend) | Add `runtime` field to per-cell summary, `runtimeDrift` rollup section | 1 |
| `extensions/qa-lab/src/cli.ts` (extend) | New `qa suite --runtime-pair pi,codex` flag, propagates to suite runner | 1 |
| `qa/scenarios/runtime/tools/<tool>.md` (new) | One scenario per tool family — see Phase 2 list below | 2 |
| `extensions/qa-lab/src/codex-plugin-fixture.ts` (new) | Helpers to seed `~/.openclaw/npm/node_modules/@openclaw/codex` to a known version (or absent) before a cell | 3 |
| `extensions/qa-lab/src/codex-plugin-lifecycle.test.ts` (new) | Asserts doctor + first-turn flow under each plugin-state | 3 |
| `extensions/qa-lab/src/token-efficiency-report.ts` (new) | Side-by-side token report; integrates into `qa parity-report` | 4 |
| `extensions/qa-lab/src/jsonl-replay.ts` (new) | Replays real captured session transcripts through both runtimes | 5 |
| `.github/workflows/openclaw-release-checks.yml` (extend) | Wire the runtime-pair lane into the same matrix that already runs the model-pair lane | 1 |

## Phasing — five PRs, staged

**Sub-issues filed:**
- Phase 1 — Runtime axis: #80172
- Phase 2 — Per-tool fixture set: #80173
- Phase 3 — Codex-plugin lifecycle: #80174
- Phase 4 — Token-efficiency report: #80175
- Phase 5 — JSONL replay: #80176


### Phase 1 — Runtime axis (smallest, lands first)

**Scope:** add the `runtime` dimension to the existing parity machinery. Reuse current scenarios; do not add new fixtures yet.

**Files:** `runtime-parity.ts`, `model-runtime-policy.ts` extension, `agentic-parity-report.ts` extension, `cli.ts` flag, workflow wiring, tests.

**Acceptance:**
- `pnpm openclaw qa suite --provider-mode mock-openai --parity-pack agentic --runtime-pair pi,codex` runs each existing agentic scenario twice (once per runtime) and produces a summary with a `runtime` field per cell.
- The drift classifier is implemented and emits one of `{none, text-only, tool-call-shape, tool-result-shape, structural, failure-mode}` per scenario.
- New `qa parity-report` mode `--runtime-axis` produces a side-by-side table.
- `OPENCLAW_QA_FORCE_RUNTIME=pi|codex` env var, set at policy resolution time, is documented as test-only and gated to `OPENCLAW_BUILD_PRIVATE_QA=1`.
- CI wiring: a new step in `.github/workflows/openclaw-release-checks.yml` (folded into the same matrix as #74622's parity lane) running the runtime-pair on the existing scenarios.
- All existing parity tests still green; no behavior change for non-QA users.

### Phase 2 — Per-tool fixture set

**Scope:** one fixture per tool family. Each fixture is deterministic: prompts the agent in a way that forces exactly one tool call with predictable arguments. Asserts the tool was invoked, completed, and result shape matches between runtimes.

**Tool families** (from `src/agents/pi-tools.create-openclaw-coding-tools.ts` and codex harness contract — exact list to be confirmed in the PR):

- `bash`
- `exec` (approval flow)
- `fs.read`, `fs.write`, `fs.list`
- `grep`
- `edit` / `apply-patch`
- `web_search`, `web_fetch`
- `tavily_search`, `tavily_extract`
- `image_generate`
- `tts`
- `message-tool` (send + media variants)
- `session_status`, `sessions_spawn`
- `memory.recall`, `memory.add` (if pi-only, mark as expected drift)
- `skill_*` invocations

**Acceptance:**
- Each tool has a `qa/scenarios/runtime/tools/<tool>.md` fixture.
- Each fixture passes both cells when run under `--runtime-pair pi,codex` against current main, OR is annotated with a known-broken marker that points at a tracking issue.
- The runtime-parity report enumerates per-tool drift, not just per-scenario.
- A `pnpm openclaw qa tool-coverage --runtime-pair pi,codex` command produces a Markdown table of "tool X: pi=✅ codex=❌ #issue" for the README of the harness.

### Phase 3 — Codex-plugin lifecycle harness

**Scope:** stress the codex-plugin install / update / version-pinning flows that pash flagged.

**Cells (from the bug clusters and ai-hpc's manual matrix):**

1. **Cold install** — clean home, no codex plugin → `openclaw doctor --fix` from a config that needs codex → assert remediation message clear, install completes, retry succeeds, no $ leakage to api-key path.
2. **OAuth-only with mixed-profiles** — both `openai-codex:*` and `openai:*` profiles in `auth-profiles.json` → assert codex auth picked, not the api-key (#78499 case).
3. **Pinned-old codex plugin + new openclaw** — codex plugin pinned to release N-1, openclaw on N → assert version mismatch detected and reported with a clear remediation hint.
4. **Pinned-new codex plugin + old openclaw** — same axis flipped.
5. **Codex plugin install racing first agent turn** — concurrent install + agent run → assert ordering doesn't lose tokens or produce a duplicate response.
6. **Doctor migration safety** — codify `@ai-hpc`'s four manual cells as automated checks: oauth-only, mixed-profile, mixed + defaults pin, mixed + per-agent pin → assert `doctor --fix` strips pins and codex auto-routes.

**Acceptance:**
- Each cell is automated, runs in mock-openai mode, completes <60s.
- Failure modes have asserted error messages (string match) so any wording regression is caught.
- Live-mode variant gated to scheduled runs.

### Phase 4 — Token-efficiency report

**Scope:** capture and surface per-runtime token usage. Live mode only (mock-openai returns fixed counts so deltas there are meaningless).

**Acceptance:**
- `qa parity-report --runtime-axis --token-efficiency` produces the side-by-side table described above.
- Per-runtime aggregates: total, p50, p90, per-turn.
- Flag when scenario-level delta >15%.
- Stored as a release artifact for week-over-week tracking.

### Phase 5 — JSONL replay (lower priority, separate track)

**Scope:** Eva's "loop 3 agents on difficult scenarios from real jsonl session history."

**Approach:** take captured session transcripts (from a maintainer-supplied jsonl set, stripped of PII), extract user turns, replay through fresh sessions on each runtime. Diff trajectories.

**Acceptance:** harness accepts a directory of jsonl, runs each through `--runtime-pair`, produces a drift report with the same drift classifier from Phase 1. PR is gated behind a curated fixture set so it can land without a real-customer transcript dump.

## Performance / cost budget

- Hermetic on-PR runs (mock-openai, single auth-shape, codex-current only): target **<5 min total** for all scenarios across both runtimes. Parallelizable per scenario.
- Full live release-checks lane: target **<30 min** with parallelism, gated behind `OPENCLAW_BUILD_PRIVATE_QA=1`.
- Token-efficiency live runs: separate scheduled cron, not on every release; nightly is fine.

## Out of scope

- Cross-vendor model parity stays in the existing model-axis gate (#74290 / #79347).
- CLI surface / message-clarity work like #77221.
- Mobile/iOS replay — separate harness if needed.
- Real-customer transcript ingestion — Phase 5 uses curated fixtures only.

## Failure-mode taxonomy (for triage)

When the harness reports drift, the triage flow is:

1. **`failure-mode` drift** = one runtime errors, the other doesn't → blocking. Open a P1 bug.
2. **`structural` drift** = turn count or phase structure differs → likely blocking. Investigate before merging anything that touches that code path.
3. **`tool-call-shape` drift** = wrong/missing tool → P1-P2 depending on the tool family.
4. **`tool-result-shape` drift** = same tool, different parsing → P2 unless it changes outcomes.
5. **`text-only` drift within tolerance** = expected; no action.
6. **`text-only` drift outside tolerance** = model-eval rubric escalation.

## References

- #78457 — original transport-parity gate proposal (this RFC supersedes its scope).
- #78055, #78060, #78407 — bug cluster that motivates the harness.
- #78499 — Codex auth profile selection (residual of #78407).
- #79238 — most recent runtime-policy fix on main (changed how `openai/*` routes).
- #74290 (closed) → #79347 (slim follow-up in flight) — sibling model-axis parity.
- Closed #78512 — original `transport-parity-gate.md` design doc + reproduction test (test was reframed-out by #79238; doc lifts forward into this RFC).
- `extensions/qa-lab/transport-parity-gate.md` — design-doc-only PR will be filed extracting the doc content from #78512 and updating it for current main.
- `@ai-hpc`'s four-cell manual matrix verification (Yesterday).

## Handoff notes for the implementing agent

- Read `extensions/qa-lab/AGENTS.md` and the scoped `extensions/qa-lab/src/CLAUDE.md` (if present) before touching code.
- The `OPENCLAW_QA_FORCE_RUNTIME` seam is the **only** runtime-mutation surface added by Phase 1 — keep it gated and test-only. Do not let it leak into production code paths.
- Phase 1 is the unblocker. Phases 2–4 can be parallelised once Phase 1 lands. Phase 5 is independent and lowest priority.
- The drift classifier in Phase 1 must use the **same** rubric as the existing `agentic-parity-report.test.ts` for text-only drift to keep tolerance consistent across the two parity gates.
- For Phase 3 cell 5 (install race), avoid timing-based assertions — use deterministic ordering primitives.
- For Phase 4, capture usage at the assistant-message level (`AssistantMessage.usage`) rather than at the transport level — the transport-level shapes differ between Pi and Codex but the assistant-message shape is normalized.
- Sub-issues will be filed for each phase. This issue is the tracking parent.

cc `@pash` `@Eva-⚡🐑` `@ai-hpc`

New / extended file	Purpose	Phase
`extensions/qa-lab/src/runtime-parity.ts` (new)	Orchestrator: takes a scenario, runs it twice with `pi` and `codex` forced, returns per-cell capture	1
`extensions/qa-lab/src/runtime-parity.test.ts` (new)	Unit tests for orchestrator + drift classifier	1
`src/agents/model-runtime-policy.ts` (extend)	Add an `OPENCLAW_QA_FORCE_RUNTIME` env-var seam (test-only) so the harness can override `agentRuntime.id` resolution without mutating user config. Document as test-only in the export's JSDoc.	1
`extensions/qa-lab/src/agentic-parity-report.ts` (extend)	Add `runtime` field to per-cell summary, `runtimeDrift` rollup section	1
`extensions/qa-lab/src/cli.ts` (extend)	New `qa suite --runtime-pair pi,codex` flag, propagates to suite runner	1
`qa/scenarios/runtime/tools/<tool>.md` (new)	One scenario per tool family — see Phase 2 list below	2
`extensions/qa-lab/src/codex-plugin-fixture.ts` (new)	Helpers to seed `~/.openclaw/npm/node_modules/@openclaw/codex` to a known version (or absent) before a cell	3
`extensions/qa-lab/src/codex-plugin-lifecycle.test.ts` (new)	Asserts doctor + first-turn flow under each plugin-state	3
`extensions/qa-lab/src/token-efficiency-report.ts` (new)	Side-by-side token report; integrates into `qa parity-report`	4
`extensions/qa-lab/src/jsonl-replay.ts` (new)	Replays real captured session transcripts through both runtimes	5
`.github/workflows/openclaw-release-checks.yml` (extend)	Wire the runtime-pair lane into the same matrix that already runs the model-pair lane	1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Codex-vs-Pi runtime parity QA harness (RFC + tracking) #80171

Codex-vs-Pi runtime parity QA harness (RFC + tracking)

Context

Architecture

The matrix

Per-cell capture

Drift classifier

Token-efficiency report

Components — file-level layout

Phasing — five PRs, staged

Phase 1 — Runtime axis (smallest, lands first)

Phase 2 — Per-tool fixture set

Phase 3 — Codex-plugin lifecycle harness

Phase 4 — Token-efficiency report

Phase 5 — JSONL replay (lower priority, separate track)

Performance / cost budget

Out of scope

Failure-mode taxonomy (for triage)

References

Handoff notes for the implementing agent

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Axis	Values	Purpose
scenarios	per-tool fixtures + jsonl-replay scenarios + existing agentic-parity scenarios	What the agent is asked to do
runtimes	`pi`, `codex`	The "primary subject" of the comparison — same model, forced runtime
plugin-states	`codex-missing`, `codex-pinned-old`, `codex-current`, `codex-head`	Stress codex-as-plugin lifecycle
auth-shapes	`oauth-only`, `apikey-only`, `mixed-profiles`	Catches auth-selection bugs (#78499 class)
provider-mode	`mock-openai` (hermetic, default), `live-frontier` (real, gated)	Cost/speed vs realism trade-off

Uh oh!

Codex-vs-Pi runtime parity QA harness (RFC + tracking) #80171

Description

Codex-vs-Pi runtime parity QA harness (RFC + tracking)

Context

Architecture

The matrix

Per-cell capture

Drift classifier

Token-efficiency report

Components — file-level layout

Phasing — five PRs, staged

Phase 1 — Runtime axis (smallest, lands first)

Phase 2 — Per-tool fixture set

Phase 3 — Codex-plugin lifecycle harness

Phase 4 — Token-efficiency report

Phase 5 — JSONL replay (lower priority, separate track)

Performance / cost budget

Out of scope

Failure-mode taxonomy (for triage)

References

Handoff notes for the implementing agent

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions