Update QA lab parity gate for GPT-5.5 vs Opus 4.7 and harden preflight

## Problem

The QA lab parity gate and related tests are only partially updated for the current frontier model targets as of 2026-04-29. The OpenAI candidate lane has moved to `openai/gpt-5.5`, but several workflow names, artifact paths, mock-model fixtures, reports, scenario docs, and the Anthropic baseline still encode GPT-5.4 / Opus 4.6 era assumptions.

That makes the expensive QA/parity gate harder to trust: it can be green while still validating the old Opus baseline, and stale labels like `gpt54` / `opus46` make it unclear what actually ran.

## Current upstream evidence

Checked against current `origin/main` on 2026-04-29 at `fa8a7d70ee` (`docs: fix clawsweeper skill metadata`).

### Parity workflow is mixed current/stale

`.github/workflows/parity-gate.yml` currently has:

- job/workflow text still referring to `OpenAI / Opus 4.6`
- `OPENCLAW_CI_OPENAI_MODEL` defaulting to `openai/gpt-5.5`
- candidate lane still using `--alt-model openai/gpt-5.4-alt`
- candidate output dir still `.artifacts/qa-e2e/gpt54`
- baseline lane still using `--model anthropic/claude-opus-4-6`
- baseline lane still using `--alt-model anthropic/claude-sonnet-4-6`
- baseline output dir still `.artifacts/qa-e2e/opus46`
- parity report baseline label still `anthropic/claude-opus-4-6`

### QA lab defaults are partially updated

`extensions/qa-lab/src/providers/live-frontier/catalog.ts` now defaults the primary live frontier model to:

```ts
openai/gpt-5.5
```

`extensions/qa-lab/src/providers/live-frontier/index.ts` and `model-selection.runtime.ts` also know about `openai/gpt-5.5`.

However, the related parity/reporting surfaces still carry Opus 4.6 assumptions:

- `extensions/qa-lab/src/providers/live-frontier/parity.ts`
- `extensions/qa-lab/src/providers/live-frontier/character-eval.ts`
- `extensions/qa-lab/src/agentic-parity-report.test.ts`
- `extensions/qa-lab/src/providers/mock-openai/server.ts`

### Provider support for Opus 4.7 already exists elsewhere

The Anthropic provider layer already includes `anthropic/claude-opus-4-7` / `claude-opus-4.7` mappings in core provider code. This issue is therefore about QA-lab/parity wiring drift, not missing Anthropic provider support.

### Scenario metadata still targets Opus 4.6

The Anthropic Opus live smoke scenarios still describe and require Opus 4.6:

- `qa/scenarios/models/anthropic-opus-api-key-smoke.md`
- `qa/scenarios/models/anthropic-opus-setup-token-smoke.md`

These should be moved to Opus 4.7, or made family/parameter driven if exact latest-model names are expected to change frequently.

## What still works

The harness is not dead. A recent upstream parity-gate run succeeded with the current mixed configuration:

- Run: https://github.com/openclaw/openclaw/actions/runs/25100029375
- Candidate: `openai/gpt-5.5`
- Baseline: `anthropic/claude-opus-4-6`
- Result: candidate passed 12/12, baseline passed 12/12, parity verdict passed

Local focused validation also passed against current main:

```bash
pnpm exec vitest run --config test/vitest/vitest.extension-qa.config.ts \
  extensions/qa-lab/src/providers/mock-openai/server.test.ts \
  extensions/qa-lab/src/qa-gateway-config.test.ts \
  extensions/qa-lab/src/suite-planning.test.ts \
  extensions/qa-lab/src/agentic-parity-report.test.ts \
  extensions/qa-lab/src/scenario-catalog.test.ts
```

Result: 5 files, 132 tests passed.

Full extension QA unit lane also passed:

```bash
pnpm exec vitest run --config test/vitest/vitest.extension-qa.config.ts
```

Result: 63 files, 524 tests passed.

Private QA runtime build also passed:

```bash
OPENCLAW_BUILD_PRIVATE_QA=1 OPENCLAW_ENABLE_PRIVATE_QA_CLI=1 pnpm build
```

## What needs hardening

The quick preflight path did not reliably work locally. Running mock preflight twice with isolated OpenClaw state failed on `approval-turn-tool-followthrough` with:

```text
gateway timeout after 25000ms
```

Representative command shape:

```bash
env \
  HOME=/tmp/openclaw-origin-main-qa-home \
  OPENCLAW_HOME=/tmp/openclaw-origin-main-qa-home \
  OPENCLAW_STATE_DIR=/tmp/openclaw-origin-main-qa-state \
  OPENCLAW_CONFIG_PATH=/tmp/openclaw-origin-main-qa-home/openclaw.json \
  OPENCLAW_BUILD_PRIVATE_QA=1 \
  OPENCLAW_ENABLE_PRIVATE_QA_CLI=1 \
  OPENCLAW_QA_SUITE_PROGRESS=1 \
  OPENAI_API_KEY= \
  ANTHROPIC_API_KEY= \
  OPENCLAW_LIVE_OPENAI_KEY= \
  OPENCLAW_LIVE_ANTHROPIC_KEY= \
  OPENCLAW_LIVE_GEMINI_KEY= \
  OPENCLAW_LIVE_SETUP_TOKEN_VALUE= \
  pnpm openclaw qa suite \
    --provider-mode mock-openai \
    --parity-pack agentic \
    --concurrency 1 \
    --model openai/gpt-5.5 \
    --alt-model openai/gpt-5.5-alt \
    --preflight \
    --output-dir .artifacts/qa-e2e/preflight
```

This does not look like an unknown-model failure. The full CI parity suite runs `approval-turn-tool-followthrough` later in the 12-scenario pack after the gateway/agent path is already warm, and it passes there. The preflight path runs it cold, with a short timeout, so it is not a reliable quick “does this even work?” sentinel.

## Desired target state

The QA lab should clearly and truthfully validate the current target comparison:

```text
openai/gpt-5.5 candidate vs anthropic/claude-opus-4-7 baseline
```

All user-facing labels, report labels, mock models, scenario docs, artifact paths, and workflow names should either use the current exact model names or be renamed to generic stable names like `openai-candidate` / `anthropic-baseline` to avoid repeated drift.

## Implementation checklist

- Update `.github/workflows/parity-gate.yml`:
  - rename job/workflow text from Opus 4.6 to Opus 4.7
  - use `anthropic/claude-opus-4-7` for the baseline lane
  - decide whether `anthropic/claude-sonnet-4-6` remains the correct alternate model or whether the alternate should also move
  - replace `openai/gpt-5.4-alt` with `openai/gpt-5.5-alt`, or derive the mock alt model from the primary model
  - rename `.artifacts/qa-e2e/gpt54` and `.artifacts/qa-e2e/opus46` to current or generic names

- Update QA lab parity/reporting code:
  - `extensions/qa-lab/src/providers/live-frontier/parity.ts`
  - `extensions/qa-lab/src/providers/live-frontier/character-eval.ts`
  - `extensions/qa-lab/src/agentic-parity-report.test.ts`
  - any report title, baseline label, summary, fixture, and expected snapshot text that still says Opus 4.6 or GPT-5.4

- Update mock provider fixtures/tests:
  - `extensions/qa-lab/src/providers/mock-openai/server.ts`
  - advertise `claude-opus-4-7` in mock model lists where appropriate
  - keep compatibility aliases only where intentionally needed
  - ensure the mock provider variant resolver still maps `openai/*` and `anthropic/*` by provider family rather than brittle exact model strings

- Update model scenario metadata/docs:
  - `qa/scenarios/models/anthropic-opus-api-key-smoke.md`
  - `qa/scenarios/models/anthropic-opus-setup-token-smoke.md`
  - move `requiredModel` and expected summaries from `claude-opus-4-6` to `claude-opus-4-7`, or introduce a parameterized/family-level requirement if that is the preferred QA contract

- Sweep for stale strings before opening the PR:

```bash
rg "gpt54|gpt-5\.4-alt|opus46|opus-4-6|Opus 4\.6|GPT-5\.4" \
  .github/workflows \
  extensions/qa-lab \
  qa/scenarios \
  test/helpers/auto-reply
```

- Harden `--preflight`:
  - increase the first cold agent-run timeout for preflight, or
  - add a lightweight warmup call before `approval-turn-tool-followthrough`, or
  - make the gateway child-call timeout retryable for QA preflight when the gateway is healthy but the first agent RPC times out
  - keep the preflight cheap; the point is to avoid paying for the full long parity gate just to discover obvious breakage

## Acceptance criteria

- `rg "gpt54|gpt-5\.4-alt|opus46|opus-4-6|Opus 4\.6|GPT-5\.4"` only returns intentional compatibility aliases or historical comments with explicit justification.
- `pnpm exec vitest run --config test/vitest/vitest.extension-qa.config.ts` passes.
- `OPENCLAW_BUILD_PRIVATE_QA=1 OPENCLAW_ENABLE_PRIVATE_QA_CLI=1 pnpm build` passes.
- Mock preflight passes with GPT-5.5 candidate naming:

```bash
pnpm openclaw qa suite \
  --provider-mode mock-openai \
  --parity-pack agentic \
  --concurrency 1 \
  --model openai/gpt-5.5 \
  --alt-model openai/gpt-5.5-alt \
  --preflight
```

- The full parity gate runs candidate `openai/gpt-5.5` against baseline `anthropic/claude-opus-4-7` and produces a report that truthfully names those models.
- The expensive 12-scenario parity suite remains mock-provider compatible and does not require live API keys in mock mode.

## Why this matters

These QA gates are long-running and expensive enough that they need to be unambiguous. A green parity gate should mean “current OpenAI candidate vs current Anthropic Opus baseline passed,” not “GPT-5.5 passed against an older Opus 4.6 baseline while several labels still say GPT-5.4/Opus 4.6.”

A follow-up PR can be mostly mechanical if it updates the model constants, report labels, mock model fixtures, scenario metadata, artifact names, and preflight timeout behavior together.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Update QA lab parity gate for GPT-5.5 vs Opus 4.7 and harden preflight #74262

Problem

Current upstream evidence

Parity workflow is mixed current/stale

QA lab defaults are partially updated

Provider support for Opus 4.7 already exists elsewhere

Scenario metadata still targets Opus 4.6

What still works

What needs hardening

Desired target state

Implementation checklist

Acceptance criteria

Why this matters

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Update QA lab parity gate for GPT-5.5 vs Opus 4.7 and harden preflight #74262

Description

Problem

Current upstream evidence

Parity workflow is mixed current/stale

QA lab defaults are partially updated

Provider support for Opus 4.7 already exists elsewhere

Scenario metadata still targets Opus 4.6

What still works

What needs hardening

Desired target state

Implementation checklist

Acceptance criteria

Why this matters

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions