You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Propose a sibling QA gate to the existing model-parity gate (introduced in #74290, later folded into openclaw-release-checks.yml / full-release-validation.yml by #74622) that catches a different class of regression: silent drift between the two paths to the same logical model, and between runtime harnesses for the same model+provider.
This gate would have caught — or made trivially diagnosable — every issue in the cluster around #78055, including the doctor config-rewrite regression filed as #78407.
Motivation
The existing parity gate compares two different models:
candidate openai/gpt-5.5-alt vs baseline anthropic/claude-opus-4-7
That answers a product question (do GPT-5.5 and Opus 4.7 give equivalent answers for a user choosing between them). It does not exercise the surfaces that have produced the recent run of regressions:
fix(subagents): keep thread-bound spawns isolated by default #78060 (subagent thread-bound spawns implicitly forking requester history) — the implicit-fork path differs between pi native runtime and the codex CLI subprocess harness; a runtime-parity gate over the same scenarios would have surfaced the inconsistency.
#77221 (CLI tool-vs-subcommand error message) is in a different test family and is not in scope here.
Proposed scope
A new gate, structured as a matrix in extensions/qa-lab/, asserting equivalence across two axes for the same scenario inputs already used by the existing character-eval / agentic-parity suites:
Axis 1 — Provider parity (same model, different transport):openai/gpt-5.5 vs openai-codex/gpt-5.5. Same logical model, different auth surface, different request shape, different lineage code (HTTP vs WS, no previous_response_id vs previous_response_id-based incremental). Any divergence beyond a published tolerance is a bug.
Axis 2 — Runtime parity (same model+provider, different harness):pi native runtime vs codex CLI subprocess. Different tool-loop, different streaming surface, different memory wiring. Any divergence is a bug in one of them.
Cell assertions per scenario:
Final answer text equivalent (within the existing parity-report tolerance).
extensions/qa-lab/src/providers/shared/mock-model-config.ts — add openai-codex/gpt-5.5 alongside the existing openai/gpt-5.5-alt entry.
extensions/qa-lab/src/qa-gateway-config.test.ts — extend the gateway-boot test pattern with the four-cell matrix.
New extensions/qa-lab/src/transport-parity.ts + transport-parity.test.ts — orchestrator that runs the matrix per fixture and produces a parity-report-style summary.
New extensions/qa-lab/src/runtime-parity.ts — codex-CLI sandbox (mirror the pattern in qa-live-transports-convex.yml for transport sandboxing).
CI wiring: add a step in openclaw-release-checks.yml (the home that #74622 folded the parity gate into), gated behind the same OPENCLAW_BUILD_PRIVATE_QA=1 build flag the existing parity tests use.
Concrete starter (would also close #78407 as a side-effect)
A narrow first slice — fixture-replay regression for the doctor flow — can land independently of the broader matrix and is the smallest unit of value:
New src/commands/doctor-config-flow.codex-model-ref-preservation.test.ts (sibling to the existing doctor-config-flow.missing-default-account-bindings.test.ts).
I'm planning to open an umbrella draft PR that adds at least the doctor-flow fixture-replay test (failing, reproducing #78407) and lays out the transport-parity scaffolding as TODOs the maintainers can flesh out — happy to split into smaller PRs if the maintainer prefers per-axis review.
Summary
Propose a sibling QA gate to the existing model-parity gate (introduced in #74290, later folded into
openclaw-release-checks.yml/full-release-validation.ymlby #74622) that catches a different class of regression: silent drift between the two paths to the same logical model, and between runtime harnesses for the same model+provider.This gate would have caught — or made trivially diagnosable — every issue in the cluster around #78055, including the doctor config-rewrite regression filed as #78407.
Motivation
The existing parity gate compares two different models:
openai/gpt-5.5-altvs baselineanthropic/claude-opus-4-7That answers a product question (do GPT-5.5 and Opus 4.7 give equivalent answers for a user choosing between them). It does not exercise the surfaces that have produced the recent run of regressions:
response.completedlineage on theopenai-codexWebSocket transport. The same prompt routed through rawopenaiHTTP would have produced a divergent (correct) trajectory; a same-model-different-provider parity gate would have flagged the WS-only stale-final replay immediately.--fixrewritesopenai-codex/*→openai/*on update) — config-migration silently flipped half the install from one transport to the other. A provider-parity gate would have failed when the post-doctor config produced a different (failing) auth resolution than the pre-doctor config for identical scenario inputs.pinative runtime and thecodexCLI subprocess harness; a runtime-parity gate over the same scenarios would have surfaced the inconsistency.#77221 (CLI tool-vs-subcommand error message) is in a different test family and is not in scope here.
Proposed scope
A new gate, structured as a matrix in
extensions/qa-lab/, asserting equivalence across two axes for the same scenario inputs already used by the existing character-eval / agentic-parity suites:openai/gpt-5.5vsopenai-codex/gpt-5.5. Same logical model, different auth surface, different request shape, different lineage code (HTTP vs WS, noprevious_response_idvsprevious_response_id-based incremental). Any divergence beyond a published tolerance is a bug.pinative runtime vscodexCLI subprocess. Different tool-loop, different streaming surface, different memory wiring. Any divergence is a bug in one of them.Cell assertions per scenario:
auth-profiles.json(catches [Bug]: openclaw doctor --fix rewrites openai-codex/* model refs to openai/* on 2026.5.4 → 2026.5.5 update, locking out ChatGPT-OAuth users #78407-class config corruption).Implementation sketch
Reuse the qa-lab primitives that already exist in this clone:
extensions/qa-lab/src/providers/mock-openai/server.ts— already extended in Update QA lab parity gate for GPT-5.5 vs Opus 4.7 #74290; add a second profile variant exposing the openai-codex Responses surface.extensions/qa-lab/src/providers/shared/mock-model-config.ts— addopenai-codex/gpt-5.5alongside the existingopenai/gpt-5.5-altentry.extensions/qa-lab/src/qa-gateway-config.test.ts— extend the gateway-boot test pattern with the four-cell matrix.extensions/qa-lab/src/transport-parity.ts+transport-parity.test.ts— orchestrator that runs the matrix per fixture and produces a parity-report-style summary.extensions/qa-lab/src/runtime-parity.ts— codex-CLI sandbox (mirror the pattern inqa-live-transports-convex.ymlfor transport sandboxing).CI wiring: add a step in
openclaw-release-checks.yml(the home that #74622 folded the parity gate into), gated behind the sameOPENCLAW_BUILD_PRIVATE_QA=1build flag the existing parity tests use.Concrete starter (would also close #78407 as a side-effect)
A narrow first slice — fixture-replay regression for the doctor flow — can land independently of the broader matrix and is the smallest unit of value:
src/commands/doctor-config-flow.codex-model-ref-preservation.test.ts(sibling to the existingdoctor-config-flow.missing-default-account-bindings.test.ts).openai-codex/{gpt-5.4,gpt-5.4-mini,gpt-5.4-pro,gpt-5.5,gpt-5.5-pro}acrossagents.defaults.modelOverride.{primary,fallbacks},agents.modelCatalog, and per-agent + per-channelmodelOverrideblocks (mirrors the 5-location footprint observed in [Bug]: openclaw doctor --fix rewrites openai-codex/* model refs to openai/* on 2026.5.4 → 2026.5.5 update, locking out ChatGPT-OAuth users #78407).auth-profiles.jsoncontaining onlyopenai-codex:*andanthropic:*(no rawopenai:*).--fixnormalize pass.openai-codex/*ref is rewritten toopenai/*.auth-profiles.json(general invariant — applies to any future migration too).modelCatalog(catches the lostopenai-codex/gpt-5.4-proghost in [Bug]: openclaw doctor --fix rewrites openai-codex/* model refs to openai/* on 2026.5.4 → 2026.5.5 update, locking out ChatGPT-OAuth users #78407).I'm planning to open an umbrella draft PR that adds at least the doctor-flow fixture-replay test (failing, reproducing #78407) and lays out the transport-parity scaffolding as TODOs the maintainers can flesh out — happy to split into smaller PRs if the maintainer prefers per-axis review.
Related