Skip to content

Update QA lab parity gate for GPT-5.5 vs Opus 4.7 and harden preflight #74262

@100yenadmin

Description

@100yenadmin

Problem

The QA lab parity gate and related tests are only partially updated for the current frontier model targets as of 2026-04-29. The OpenAI candidate lane has moved to openai/gpt-5.5, but several workflow names, artifact paths, mock-model fixtures, reports, scenario docs, and the Anthropic baseline still encode GPT-5.4 / Opus 4.6 era assumptions.

That makes the expensive QA/parity gate harder to trust: it can be green while still validating the old Opus baseline, and stale labels like gpt54 / opus46 make it unclear what actually ran.

Current upstream evidence

Checked against current origin/main on 2026-04-29 at fa8a7d70ee (docs: fix clawsweeper skill metadata).

Parity workflow is mixed current/stale

.github/workflows/parity-gate.yml currently has:

  • job/workflow text still referring to OpenAI / Opus 4.6
  • OPENCLAW_CI_OPENAI_MODEL defaulting to openai/gpt-5.5
  • candidate lane still using --alt-model openai/gpt-5.4-alt
  • candidate output dir still .artifacts/qa-e2e/gpt54
  • baseline lane still using --model anthropic/claude-opus-4-6
  • baseline lane still using --alt-model anthropic/claude-sonnet-4-6
  • baseline output dir still .artifacts/qa-e2e/opus46
  • parity report baseline label still anthropic/claude-opus-4-6

QA lab defaults are partially updated

extensions/qa-lab/src/providers/live-frontier/catalog.ts now defaults the primary live frontier model to:

openai/gpt-5.5

extensions/qa-lab/src/providers/live-frontier/index.ts and model-selection.runtime.ts also know about openai/gpt-5.5.

However, the related parity/reporting surfaces still carry Opus 4.6 assumptions:

  • extensions/qa-lab/src/providers/live-frontier/parity.ts
  • extensions/qa-lab/src/providers/live-frontier/character-eval.ts
  • extensions/qa-lab/src/agentic-parity-report.test.ts
  • extensions/qa-lab/src/providers/mock-openai/server.ts

Provider support for Opus 4.7 already exists elsewhere

The Anthropic provider layer already includes anthropic/claude-opus-4-7 / claude-opus-4.7 mappings in core provider code. This issue is therefore about QA-lab/parity wiring drift, not missing Anthropic provider support.

Scenario metadata still targets Opus 4.6

The Anthropic Opus live smoke scenarios still describe and require Opus 4.6:

  • qa/scenarios/models/anthropic-opus-api-key-smoke.md
  • qa/scenarios/models/anthropic-opus-setup-token-smoke.md

These should be moved to Opus 4.7, or made family/parameter driven if exact latest-model names are expected to change frequently.

What still works

The harness is not dead. A recent upstream parity-gate run succeeded with the current mixed configuration:

Local focused validation also passed against current main:

pnpm exec vitest run --config test/vitest/vitest.extension-qa.config.ts \
  extensions/qa-lab/src/providers/mock-openai/server.test.ts \
  extensions/qa-lab/src/qa-gateway-config.test.ts \
  extensions/qa-lab/src/suite-planning.test.ts \
  extensions/qa-lab/src/agentic-parity-report.test.ts \
  extensions/qa-lab/src/scenario-catalog.test.ts

Result: 5 files, 132 tests passed.

Full extension QA unit lane also passed:

pnpm exec vitest run --config test/vitest/vitest.extension-qa.config.ts

Result: 63 files, 524 tests passed.

Private QA runtime build also passed:

OPENCLAW_BUILD_PRIVATE_QA=1 OPENCLAW_ENABLE_PRIVATE_QA_CLI=1 pnpm build

What needs hardening

The quick preflight path did not reliably work locally. Running mock preflight twice with isolated OpenClaw state failed on approval-turn-tool-followthrough with:

gateway timeout after 25000ms

Representative command shape:

env \
  HOME=/tmp/openclaw-origin-main-qa-home \
  OPENCLAW_HOME=/tmp/openclaw-origin-main-qa-home \
  OPENCLAW_STATE_DIR=/tmp/openclaw-origin-main-qa-state \
  OPENCLAW_CONFIG_PATH=/tmp/openclaw-origin-main-qa-home/openclaw.json \
  OPENCLAW_BUILD_PRIVATE_QA=1 \
  OPENCLAW_ENABLE_PRIVATE_QA_CLI=1 \
  OPENCLAW_QA_SUITE_PROGRESS=1 \
  OPENAI_API_KEY= \
  ANTHROPIC_API_KEY= \
  OPENCLAW_LIVE_OPENAI_KEY= \
  OPENCLAW_LIVE_ANTHROPIC_KEY= \
  OPENCLAW_LIVE_GEMINI_KEY= \
  OPENCLAW_LIVE_SETUP_TOKEN_VALUE= \
  pnpm openclaw qa suite \
    --provider-mode mock-openai \
    --parity-pack agentic \
    --concurrency 1 \
    --model openai/gpt-5.5 \
    --alt-model openai/gpt-5.5-alt \
    --preflight \
    --output-dir .artifacts/qa-e2e/preflight

This does not look like an unknown-model failure. The full CI parity suite runs approval-turn-tool-followthrough later in the 12-scenario pack after the gateway/agent path is already warm, and it passes there. The preflight path runs it cold, with a short timeout, so it is not a reliable quick “does this even work?” sentinel.

Desired target state

The QA lab should clearly and truthfully validate the current target comparison:

openai/gpt-5.5 candidate vs anthropic/claude-opus-4-7 baseline

All user-facing labels, report labels, mock models, scenario docs, artifact paths, and workflow names should either use the current exact model names or be renamed to generic stable names like openai-candidate / anthropic-baseline to avoid repeated drift.

Implementation checklist

  • Update .github/workflows/parity-gate.yml:

    • rename job/workflow text from Opus 4.6 to Opus 4.7
    • use anthropic/claude-opus-4-7 for the baseline lane
    • decide whether anthropic/claude-sonnet-4-6 remains the correct alternate model or whether the alternate should also move
    • replace openai/gpt-5.4-alt with openai/gpt-5.5-alt, or derive the mock alt model from the primary model
    • rename .artifacts/qa-e2e/gpt54 and .artifacts/qa-e2e/opus46 to current or generic names
  • Update QA lab parity/reporting code:

    • extensions/qa-lab/src/providers/live-frontier/parity.ts
    • extensions/qa-lab/src/providers/live-frontier/character-eval.ts
    • extensions/qa-lab/src/agentic-parity-report.test.ts
    • any report title, baseline label, summary, fixture, and expected snapshot text that still says Opus 4.6 or GPT-5.4
  • Update mock provider fixtures/tests:

    • extensions/qa-lab/src/providers/mock-openai/server.ts
    • advertise claude-opus-4-7 in mock model lists where appropriate
    • keep compatibility aliases only where intentionally needed
    • ensure the mock provider variant resolver still maps openai/* and anthropic/* by provider family rather than brittle exact model strings
  • Update model scenario metadata/docs:

    • qa/scenarios/models/anthropic-opus-api-key-smoke.md
    • qa/scenarios/models/anthropic-opus-setup-token-smoke.md
    • move requiredModel and expected summaries from claude-opus-4-6 to claude-opus-4-7, or introduce a parameterized/family-level requirement if that is the preferred QA contract
  • Sweep for stale strings before opening the PR:

rg "gpt54|gpt-5\.4-alt|opus46|opus-4-6|Opus 4\.6|GPT-5\.4" \
  .github/workflows \
  extensions/qa-lab \
  qa/scenarios \
  test/helpers/auto-reply
  • Harden --preflight:
    • increase the first cold agent-run timeout for preflight, or
    • add a lightweight warmup call before approval-turn-tool-followthrough, or
    • make the gateway child-call timeout retryable for QA preflight when the gateway is healthy but the first agent RPC times out
    • keep the preflight cheap; the point is to avoid paying for the full long parity gate just to discover obvious breakage

Acceptance criteria

  • rg "gpt54|gpt-5\.4-alt|opus46|opus-4-6|Opus 4\.6|GPT-5\.4" only returns intentional compatibility aliases or historical comments with explicit justification.
  • pnpm exec vitest run --config test/vitest/vitest.extension-qa.config.ts passes.
  • OPENCLAW_BUILD_PRIVATE_QA=1 OPENCLAW_ENABLE_PRIVATE_QA_CLI=1 pnpm build passes.
  • Mock preflight passes with GPT-5.5 candidate naming:
pnpm openclaw qa suite \
  --provider-mode mock-openai \
  --parity-pack agentic \
  --concurrency 1 \
  --model openai/gpt-5.5 \
  --alt-model openai/gpt-5.5-alt \
  --preflight
  • The full parity gate runs candidate openai/gpt-5.5 against baseline anthropic/claude-opus-4-7 and produces a report that truthfully names those models.
  • The expensive 12-scenario parity suite remains mock-provider compatible and does not require live API keys in mock mode.

Why this matters

These QA gates are long-running and expensive enough that they need to be unambiguous. A green parity gate should mean “current OpenAI candidate vs current Anthropic Opus baseline passed,” not “GPT-5.5 passed against an older Opus 4.6 baseline while several labels still say GPT-5.4/Opus 4.6.”

A follow-up PR can be mostly mechanical if it updates the model constants, report labels, mock model fixtures, scenario metadata, artifact names, and preflight timeout behavior together.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions