Problem
The QA lab parity gate and related tests are only partially updated for the current frontier model targets as of 2026-04-29. The OpenAI candidate lane has moved to openai/gpt-5.5, but several workflow names, artifact paths, mock-model fixtures, reports, scenario docs, and the Anthropic baseline still encode GPT-5.4 / Opus 4.6 era assumptions.
That makes the expensive QA/parity gate harder to trust: it can be green while still validating the old Opus baseline, and stale labels like gpt54 / opus46 make it unclear what actually ran.
Current upstream evidence
Checked against current origin/main on 2026-04-29 at fa8a7d70ee (docs: fix clawsweeper skill metadata).
Parity workflow is mixed current/stale
.github/workflows/parity-gate.yml currently has:
- job/workflow text still referring to
OpenAI / Opus 4.6
OPENCLAW_CI_OPENAI_MODEL defaulting to openai/gpt-5.5
- candidate lane still using
--alt-model openai/gpt-5.4-alt
- candidate output dir still
.artifacts/qa-e2e/gpt54
- baseline lane still using
--model anthropic/claude-opus-4-6
- baseline lane still using
--alt-model anthropic/claude-sonnet-4-6
- baseline output dir still
.artifacts/qa-e2e/opus46
- parity report baseline label still
anthropic/claude-opus-4-6
QA lab defaults are partially updated
extensions/qa-lab/src/providers/live-frontier/catalog.ts now defaults the primary live frontier model to:
extensions/qa-lab/src/providers/live-frontier/index.ts and model-selection.runtime.ts also know about openai/gpt-5.5.
However, the related parity/reporting surfaces still carry Opus 4.6 assumptions:
extensions/qa-lab/src/providers/live-frontier/parity.ts
extensions/qa-lab/src/providers/live-frontier/character-eval.ts
extensions/qa-lab/src/agentic-parity-report.test.ts
extensions/qa-lab/src/providers/mock-openai/server.ts
Provider support for Opus 4.7 already exists elsewhere
The Anthropic provider layer already includes anthropic/claude-opus-4-7 / claude-opus-4.7 mappings in core provider code. This issue is therefore about QA-lab/parity wiring drift, not missing Anthropic provider support.
Scenario metadata still targets Opus 4.6
The Anthropic Opus live smoke scenarios still describe and require Opus 4.6:
qa/scenarios/models/anthropic-opus-api-key-smoke.md
qa/scenarios/models/anthropic-opus-setup-token-smoke.md
These should be moved to Opus 4.7, or made family/parameter driven if exact latest-model names are expected to change frequently.
What still works
The harness is not dead. A recent upstream parity-gate run succeeded with the current mixed configuration:
Local focused validation also passed against current main:
pnpm exec vitest run --config test/vitest/vitest.extension-qa.config.ts \
extensions/qa-lab/src/providers/mock-openai/server.test.ts \
extensions/qa-lab/src/qa-gateway-config.test.ts \
extensions/qa-lab/src/suite-planning.test.ts \
extensions/qa-lab/src/agentic-parity-report.test.ts \
extensions/qa-lab/src/scenario-catalog.test.ts
Result: 5 files, 132 tests passed.
Full extension QA unit lane also passed:
pnpm exec vitest run --config test/vitest/vitest.extension-qa.config.ts
Result: 63 files, 524 tests passed.
Private QA runtime build also passed:
OPENCLAW_BUILD_PRIVATE_QA=1 OPENCLAW_ENABLE_PRIVATE_QA_CLI=1 pnpm build
What needs hardening
The quick preflight path did not reliably work locally. Running mock preflight twice with isolated OpenClaw state failed on approval-turn-tool-followthrough with:
gateway timeout after 25000ms
Representative command shape:
env \
HOME=/tmp/openclaw-origin-main-qa-home \
OPENCLAW_HOME=/tmp/openclaw-origin-main-qa-home \
OPENCLAW_STATE_DIR=/tmp/openclaw-origin-main-qa-state \
OPENCLAW_CONFIG_PATH=/tmp/openclaw-origin-main-qa-home/openclaw.json \
OPENCLAW_BUILD_PRIVATE_QA=1 \
OPENCLAW_ENABLE_PRIVATE_QA_CLI=1 \
OPENCLAW_QA_SUITE_PROGRESS=1 \
OPENAI_API_KEY= \
ANTHROPIC_API_KEY= \
OPENCLAW_LIVE_OPENAI_KEY= \
OPENCLAW_LIVE_ANTHROPIC_KEY= \
OPENCLAW_LIVE_GEMINI_KEY= \
OPENCLAW_LIVE_SETUP_TOKEN_VALUE= \
pnpm openclaw qa suite \
--provider-mode mock-openai \
--parity-pack agentic \
--concurrency 1 \
--model openai/gpt-5.5 \
--alt-model openai/gpt-5.5-alt \
--preflight \
--output-dir .artifacts/qa-e2e/preflight
This does not look like an unknown-model failure. The full CI parity suite runs approval-turn-tool-followthrough later in the 12-scenario pack after the gateway/agent path is already warm, and it passes there. The preflight path runs it cold, with a short timeout, so it is not a reliable quick “does this even work?” sentinel.
Desired target state
The QA lab should clearly and truthfully validate the current target comparison:
openai/gpt-5.5 candidate vs anthropic/claude-opus-4-7 baseline
All user-facing labels, report labels, mock models, scenario docs, artifact paths, and workflow names should either use the current exact model names or be renamed to generic stable names like openai-candidate / anthropic-baseline to avoid repeated drift.
Implementation checklist
-
Update .github/workflows/parity-gate.yml:
- rename job/workflow text from Opus 4.6 to Opus 4.7
- use
anthropic/claude-opus-4-7 for the baseline lane
- decide whether
anthropic/claude-sonnet-4-6 remains the correct alternate model or whether the alternate should also move
- replace
openai/gpt-5.4-alt with openai/gpt-5.5-alt, or derive the mock alt model from the primary model
- rename
.artifacts/qa-e2e/gpt54 and .artifacts/qa-e2e/opus46 to current or generic names
-
Update QA lab parity/reporting code:
extensions/qa-lab/src/providers/live-frontier/parity.ts
extensions/qa-lab/src/providers/live-frontier/character-eval.ts
extensions/qa-lab/src/agentic-parity-report.test.ts
- any report title, baseline label, summary, fixture, and expected snapshot text that still says Opus 4.6 or GPT-5.4
-
Update mock provider fixtures/tests:
extensions/qa-lab/src/providers/mock-openai/server.ts
- advertise
claude-opus-4-7 in mock model lists where appropriate
- keep compatibility aliases only where intentionally needed
- ensure the mock provider variant resolver still maps
openai/* and anthropic/* by provider family rather than brittle exact model strings
-
Update model scenario metadata/docs:
qa/scenarios/models/anthropic-opus-api-key-smoke.md
qa/scenarios/models/anthropic-opus-setup-token-smoke.md
- move
requiredModel and expected summaries from claude-opus-4-6 to claude-opus-4-7, or introduce a parameterized/family-level requirement if that is the preferred QA contract
-
Sweep for stale strings before opening the PR:
rg "gpt54|gpt-5\.4-alt|opus46|opus-4-6|Opus 4\.6|GPT-5\.4" \
.github/workflows \
extensions/qa-lab \
qa/scenarios \
test/helpers/auto-reply
- Harden
--preflight:
- increase the first cold agent-run timeout for preflight, or
- add a lightweight warmup call before
approval-turn-tool-followthrough, or
- make the gateway child-call timeout retryable for QA preflight when the gateway is healthy but the first agent RPC times out
- keep the preflight cheap; the point is to avoid paying for the full long parity gate just to discover obvious breakage
Acceptance criteria
rg "gpt54|gpt-5\.4-alt|opus46|opus-4-6|Opus 4\.6|GPT-5\.4" only returns intentional compatibility aliases or historical comments with explicit justification.
pnpm exec vitest run --config test/vitest/vitest.extension-qa.config.ts passes.
OPENCLAW_BUILD_PRIVATE_QA=1 OPENCLAW_ENABLE_PRIVATE_QA_CLI=1 pnpm build passes.
- Mock preflight passes with GPT-5.5 candidate naming:
pnpm openclaw qa suite \
--provider-mode mock-openai \
--parity-pack agentic \
--concurrency 1 \
--model openai/gpt-5.5 \
--alt-model openai/gpt-5.5-alt \
--preflight
- The full parity gate runs candidate
openai/gpt-5.5 against baseline anthropic/claude-opus-4-7 and produces a report that truthfully names those models.
- The expensive 12-scenario parity suite remains mock-provider compatible and does not require live API keys in mock mode.
Why this matters
These QA gates are long-running and expensive enough that they need to be unambiguous. A green parity gate should mean “current OpenAI candidate vs current Anthropic Opus baseline passed,” not “GPT-5.5 passed against an older Opus 4.6 baseline while several labels still say GPT-5.4/Opus 4.6.”
A follow-up PR can be mostly mechanical if it updates the model constants, report labels, mock model fixtures, scenario metadata, artifact names, and preflight timeout behavior together.
Problem
The QA lab parity gate and related tests are only partially updated for the current frontier model targets as of 2026-04-29. The OpenAI candidate lane has moved to
openai/gpt-5.5, but several workflow names, artifact paths, mock-model fixtures, reports, scenario docs, and the Anthropic baseline still encode GPT-5.4 / Opus 4.6 era assumptions.That makes the expensive QA/parity gate harder to trust: it can be green while still validating the old Opus baseline, and stale labels like
gpt54/opus46make it unclear what actually ran.Current upstream evidence
Checked against current
origin/mainon 2026-04-29 atfa8a7d70ee(docs: fix clawsweeper skill metadata).Parity workflow is mixed current/stale
.github/workflows/parity-gate.ymlcurrently has:OpenAI / Opus 4.6OPENCLAW_CI_OPENAI_MODELdefaulting toopenai/gpt-5.5--alt-model openai/gpt-5.4-alt.artifacts/qa-e2e/gpt54--model anthropic/claude-opus-4-6--alt-model anthropic/claude-sonnet-4-6.artifacts/qa-e2e/opus46anthropic/claude-opus-4-6QA lab defaults are partially updated
extensions/qa-lab/src/providers/live-frontier/catalog.tsnow defaults the primary live frontier model to:extensions/qa-lab/src/providers/live-frontier/index.tsandmodel-selection.runtime.tsalso know aboutopenai/gpt-5.5.However, the related parity/reporting surfaces still carry Opus 4.6 assumptions:
extensions/qa-lab/src/providers/live-frontier/parity.tsextensions/qa-lab/src/providers/live-frontier/character-eval.tsextensions/qa-lab/src/agentic-parity-report.test.tsextensions/qa-lab/src/providers/mock-openai/server.tsProvider support for Opus 4.7 already exists elsewhere
The Anthropic provider layer already includes
anthropic/claude-opus-4-7/claude-opus-4.7mappings in core provider code. This issue is therefore about QA-lab/parity wiring drift, not missing Anthropic provider support.Scenario metadata still targets Opus 4.6
The Anthropic Opus live smoke scenarios still describe and require Opus 4.6:
qa/scenarios/models/anthropic-opus-api-key-smoke.mdqa/scenarios/models/anthropic-opus-setup-token-smoke.mdThese should be moved to Opus 4.7, or made family/parameter driven if exact latest-model names are expected to change frequently.
What still works
The harness is not dead. A recent upstream parity-gate run succeeded with the current mixed configuration:
openai/gpt-5.5anthropic/claude-opus-4-6Local focused validation also passed against current main:
pnpm exec vitest run --config test/vitest/vitest.extension-qa.config.ts \ extensions/qa-lab/src/providers/mock-openai/server.test.ts \ extensions/qa-lab/src/qa-gateway-config.test.ts \ extensions/qa-lab/src/suite-planning.test.ts \ extensions/qa-lab/src/agentic-parity-report.test.ts \ extensions/qa-lab/src/scenario-catalog.test.tsResult: 5 files, 132 tests passed.
Full extension QA unit lane also passed:
pnpm exec vitest run --config test/vitest/vitest.extension-qa.config.tsResult: 63 files, 524 tests passed.
Private QA runtime build also passed:
What needs hardening
The quick preflight path did not reliably work locally. Running mock preflight twice with isolated OpenClaw state failed on
approval-turn-tool-followthroughwith:Representative command shape:
env \ HOME=/tmp/openclaw-origin-main-qa-home \ OPENCLAW_HOME=/tmp/openclaw-origin-main-qa-home \ OPENCLAW_STATE_DIR=/tmp/openclaw-origin-main-qa-state \ OPENCLAW_CONFIG_PATH=/tmp/openclaw-origin-main-qa-home/openclaw.json \ OPENCLAW_BUILD_PRIVATE_QA=1 \ OPENCLAW_ENABLE_PRIVATE_QA_CLI=1 \ OPENCLAW_QA_SUITE_PROGRESS=1 \ OPENAI_API_KEY= \ ANTHROPIC_API_KEY= \ OPENCLAW_LIVE_OPENAI_KEY= \ OPENCLAW_LIVE_ANTHROPIC_KEY= \ OPENCLAW_LIVE_GEMINI_KEY= \ OPENCLAW_LIVE_SETUP_TOKEN_VALUE= \ pnpm openclaw qa suite \ --provider-mode mock-openai \ --parity-pack agentic \ --concurrency 1 \ --model openai/gpt-5.5 \ --alt-model openai/gpt-5.5-alt \ --preflight \ --output-dir .artifacts/qa-e2e/preflightThis does not look like an unknown-model failure. The full CI parity suite runs
approval-turn-tool-followthroughlater in the 12-scenario pack after the gateway/agent path is already warm, and it passes there. The preflight path runs it cold, with a short timeout, so it is not a reliable quick “does this even work?” sentinel.Desired target state
The QA lab should clearly and truthfully validate the current target comparison:
All user-facing labels, report labels, mock models, scenario docs, artifact paths, and workflow names should either use the current exact model names or be renamed to generic stable names like
openai-candidate/anthropic-baselineto avoid repeated drift.Implementation checklist
Update
.github/workflows/parity-gate.yml:anthropic/claude-opus-4-7for the baseline laneanthropic/claude-sonnet-4-6remains the correct alternate model or whether the alternate should also moveopenai/gpt-5.4-altwithopenai/gpt-5.5-alt, or derive the mock alt model from the primary model.artifacts/qa-e2e/gpt54and.artifacts/qa-e2e/opus46to current or generic namesUpdate QA lab parity/reporting code:
extensions/qa-lab/src/providers/live-frontier/parity.tsextensions/qa-lab/src/providers/live-frontier/character-eval.tsextensions/qa-lab/src/agentic-parity-report.test.tsUpdate mock provider fixtures/tests:
extensions/qa-lab/src/providers/mock-openai/server.tsclaude-opus-4-7in mock model lists where appropriateopenai/*andanthropic/*by provider family rather than brittle exact model stringsUpdate model scenario metadata/docs:
qa/scenarios/models/anthropic-opus-api-key-smoke.mdqa/scenarios/models/anthropic-opus-setup-token-smoke.mdrequiredModeland expected summaries fromclaude-opus-4-6toclaude-opus-4-7, or introduce a parameterized/family-level requirement if that is the preferred QA contractSweep for stale strings before opening the PR:
rg "gpt54|gpt-5\.4-alt|opus46|opus-4-6|Opus 4\.6|GPT-5\.4" \ .github/workflows \ extensions/qa-lab \ qa/scenarios \ test/helpers/auto-reply--preflight:approval-turn-tool-followthrough, orAcceptance criteria
rg "gpt54|gpt-5\.4-alt|opus46|opus-4-6|Opus 4\.6|GPT-5\.4"only returns intentional compatibility aliases or historical comments with explicit justification.pnpm exec vitest run --config test/vitest/vitest.extension-qa.config.tspasses.OPENCLAW_BUILD_PRIVATE_QA=1 OPENCLAW_ENABLE_PRIVATE_QA_CLI=1 pnpm buildpasses.openai/gpt-5.5against baselineanthropic/claude-opus-4-7and produces a report that truthfully names those models.Why this matters
These QA gates are long-running and expensive enough that they need to be unambiguous. A green parity gate should mean “current OpenAI candidate vs current Anthropic Opus baseline passed,” not “GPT-5.5 passed against an older Opus 4.6 baseline while several labels still say GPT-5.4/Opus 4.6.”
A follow-up PR can be mostly mechanical if it updates the model constants, report labels, mock model fixtures, scenario metadata, artifact names, and preflight timeout behavior together.