Skip to content

fix(qa-lab): bump parity baseline to Opus 4.7 / GPT-5.5 and lengthen approval-followthrough timeouts#79347

Closed
100yenadmin wants to merge 1 commit into
openclaw:mainfrom
electricsheephq:qa-lab-parity-opus47-gpt55-bump
Closed

fix(qa-lab): bump parity baseline to Opus 4.7 / GPT-5.5 and lengthen approval-followthrough timeouts#79347
100yenadmin wants to merge 1 commit into
openclaw:mainfrom
electricsheephq:qa-lab-parity-opus47-gpt55-bump

Conversation

@100yenadmin

Copy link
Copy Markdown
Contributor

Summary

Slim follow-up to closed PR #74290 picking up only the parts that are still load-bearing on current main:

  • Bump the live mock-openai parity baseline from anthropic/claude-opus-4-6 / anthropic/claude-sonnet-4-6 to claude-opus-4-7 / claude-sonnet-4-7, and the candidate alt from openai/gpt-5.4-alt to openai/gpt-5.5-alt, in openclaw-release-checks.yml:708-712,782 and qa-live-transports-convex.yml:190,199-200,210. The Opus 4.7 / GPT-5.5 defaults are already active elsewhere on main (CHANGELOG.md:803, docs/providers/anthropic.md:108,262, openclaw-live-and-e2e-checks-reusable.yml:1894). The parity baseline was the last surface still one model-generation behind.
  • Raise the four liveTurnTimeoutMs fallbacks in qa/scenarios/runtime/approval-turn-tool-followthrough.md from 20s/30s to 60s. The mock provider's resolveTurnTimeoutMs returns the fallback unchanged (extensions/qa-lab/src/providers/shared/mock-provider-definition.ts:36), so cold mock-gateway parity runs were timing out exactly where the retired parity-gate.yml env-var comments said they would.

#74290 was the natural home for this but is closed because most of its diff was against .github/workflows/parity-gate.yml, which b9eb31b54c (#74622 'ci: fold parity into QA release validation') deleted on main. Rebasing #74290 would have conflicted on every workflow file and many fixtures already updated. This follow-up keeps just the bumps that survived.

Refs #74290 / #74262.

What this PR does NOT touch

  • The retired .github/workflows/parity-gate.yml (deleted on main by ci: fold parity into QA release validation #74622).
  • Internal artifact directory names gpt54/opus46 (cosmetic; renaming would require updating consumer paths and is out of scope for a slim bump).
  • The Discord QA scenario lane (qa-live-transports-convex.yml:550-551) and the release-validation lane (openclaw-release-checks.yml:487) that intentionally pin openai/gpt-5.4 for separate purposes.

Real behavior proof

  • Behavior addressed: parity report on release/nightly should compare current Opus 4.7 against current GPT-5.5; cold-startup approval-turn-tool-followthrough should not flake on the 20s/30s budgets.
  • Tested locally: ran pnpm exec vitest run extensions/qa-lab/src/providers/mock-openai/server.test.ts extensions/qa-lab/src/qa-gateway-config.test.ts extensions/qa-lab/src/suite-planning.test.ts extensions/qa-lab/src/cli.runtime.test.ts (all green); pnpm check:test-types clean; pnpm exec oxfmt --check on all four modified files clean.
  • Workflow YAML changes are syntax-only (string substitution) — no semantic shape change. Validated visually that all consumer paths (--baseline-summary, --baseline-label, output_dir, --candidate-summary) line up after the rename.
  • The 60s timeout matches the existing 60s waitForGatewayHealthy budget at line 48 of the same scenario, so the cold-startup window is now consistent across the file.

Test plan

  • pnpm exec oxfmt --check --threads=1 on all four modified files
  • pnpm check:test-types clean
  • Targeted vitest passes on the qa-lab fixtures most likely to break on baseline-name changes
  • CI parity job runs successfully on at least one nightly cron after merge

…approval-turn-tool-followthrough timeouts

Carries forward the surface-bump portion of #74290 (closed in favor of
this slim follow-up since the parity-gate.yml workflow file the original
PR also touched was retired by #74622 'ci: fold parity into QA release
validation').

The mock-openai parity lanes that now live in
`openclaw-release-checks.yml` and `qa-live-transports-convex.yml`
were still pinned to `anthropic/claude-opus-4-6` /
`anthropic/claude-sonnet-4-6` for the baseline and
`openai/gpt-5.4-alt` for the candidate alt model. That left the parity
baseline one model-generation behind the active Opus 4.7 / GPT-5.5
defaults already used elsewhere on main (CHANGELOG.md:803,
docs/providers/anthropic.md:108, openclaw-live-and-e2e-checks-reusable.yml:1894).

The `approval-turn-tool-followthrough` scenario was using 20s/30s
`liveTurnTimeoutMs` fallbacks that timed out on cold mock-gateway
parity runs (the deleted `parity-gate.yml` env-var comments described
exactly this scenario flake). Bumping all four turn fallbacks to 60s
matches what the mock provider's `resolveTurnTimeoutMs` returns for
fallbackMs (it returns the fallback unchanged) so cold starts have
breathing room before the approval/follow-through chain has to
complete.

This PR does NOT touch:
- The retired `.github/workflows/parity-gate.yml` (deleted on main
  by #74622)
- Internal artifact directory names `gpt54`/`opus46` (cosmetic, out
  of scope for a slim follow-up)
- The Discord QA scenario lane and the release-validation lane that
  intentionally pin `openai/gpt-5.4` (separate concerns)

Refs #74290.
@openclaw-barnacle openclaw-barnacle Bot added triage: needs-real-behavior-proof Candidate: external PR needs after-fix proof from a real setup. size: XS labels May 8, 2026
@clawsweeper

clawsweeper Bot commented May 8, 2026

Copy link
Copy Markdown
Contributor

Codex review: needs real behavior proof before merge.

Summary
The PR updates QA parity workflow model strings and changelog entries, and raises the approval-turn-tool-followthrough scenario fallbacks to 60 seconds.

Reproducibility: Do we have a high-confidence way to reproduce the issue? Source inspection clearly reproduces the stale workflow model strings and short scenario fallbacks on current main, but I did not run the cold mock-preflight timeout path in this read-only review.

Real behavior proof
Needs real behavior proof before merge: The PR body lists tests/typecheck/format and visual inspection, but no after-fix parity or mock-preflight run output; an external PR needs real behavior proof before merge. After adding proof, update the PR body; ClawSweeper should re-review automatically. If it does not, ask a maintainer to comment @clawsweeper re-review.

Next step before merge
Contributor action is needed: fix the unsupported Sonnet 4.7 alternate or add catalog/test support, then provide real after-fix runtime proof that automation cannot supply for them.

Security
Cleared: The diff changes workflow argument strings, scenario timeout literals, and changelog text without adding actions, permissions, secrets, dependency sources, or code execution paths.

Review findings

  • [P2] Keep the Anthropic alternate on a supported model — .github/workflows/openclaw-release-checks.yml:712
Review details

Best possible solution:

Land a narrow QA parity update that bumps the supported primary Opus baseline and GPT alternate, keeps or adds only supported Anthropic alternate catalog entries, raises the timeout budget, and includes a real mock-preflight or parity-run proof artifact.

Do we have a high-confidence way to reproduce the issue?

Do we have a high-confidence way to reproduce the issue? Source inspection clearly reproduces the stale workflow model strings and short scenario fallbacks on current main, but I did not run the cold mock-preflight timeout path in this read-only review.

Is this the best way to solve the issue?

Is this the best way to solve the issue? Not yet: the Opus/GPT bump and 60s timeout are narrow, but the PR should not introduce claude-sonnet-4-7 unless the current provider/mock catalogs and tests support it, and it still needs real behavior proof.

Full review comments:

  • [P2] Keep the Anthropic alternate on a supported model — .github/workflows/openclaw-release-checks.yml:712
    This changes the parity alternate to anthropic/claude-sonnet-4-7, but current main has no claude-sonnet-4-7 provider, mock catalog, alias, or test fixture; the Anthropic plugin and QA mock catalogs still expose Sonnet 4.6. A scenario that switches to the alternate model would now record a model OpenClaw does not otherwise support, so keep anthropic/claude-sonnet-4-6 here or update the provider/mock catalogs and tests together. The same fix is needed in qa-live-transports-convex.yml.
    Confidence: 0.86

Overall correctness: patch is incorrect
Overall confidence: 0.84

Acceptance criteria:

  • pnpm exec oxfmt --check --threads=1 .github/workflows/openclaw-release-checks.yml .github/workflows/qa-live-transports-convex.yml CHANGELOG.md qa/scenarios/runtime/approval-turn-tool-followthrough.md
  • pnpm test extensions/qa-lab/src/qa-gateway-config.test.ts extensions/qa-lab/src/providers/mock-openai/server.test.ts extensions/qa-lab/src/suite-planning.test.ts extensions/qa-lab/src/cli.runtime.test.ts
  • OPENCLAW_BUILD_PRIVATE_QA=1 OPENCLAW_ENABLE_PRIVATE_QA_CLI=1 pnpm openclaw qa suite --provider-mode mock-openai --parity-pack agentic --concurrency 1 --model openai/gpt-5.5 --alt-model openai/gpt-5.5-alt --preflight

What I checked:

  • Current release parity workflow is still on the old comparison: Current main runs the release QA parity candidate alt as openai/gpt-5.4-alt, the baseline as anthropic/claude-opus-4-6, and the baseline alt as anthropic/claude-sonnet-4-6. (.github/workflows/openclaw-release-checks.yml:708, 3c6dd9fcb208)
  • Current Convex QA parity workflow is also stale: Current main's Convex QA parity lane uses openai/gpt-5.4-alt, anthropic/claude-opus-4-6, anthropic/claude-sonnet-4-6, and reports the Opus 4.6 label. (.github/workflows/qa-live-transports-convex.yml:190, 3c6dd9fcb208)
  • Timeout fallback mismatch exists on current main: The scenario waits 60s for gateway health, then uses 20s/30s fallbacks for the first agent turn, outbound wait, approval turn, and follow-up condition. (qa/scenarios/runtime/approval-turn-tool-followthrough.md:57, 3c6dd9fcb208)
  • Mock Anthropic catalog does not include Sonnet 4.7: The mock provider config only advertises claude-opus-4-6 and claude-sonnet-4-6 for Anthropic, while the PR changes workflow alternate arguments to anthropic/claude-sonnet-4-7. (extensions/qa-lab/src/providers/shared/mock-model-config.ts:72, 3c6dd9fcb208)
  • Current Anthropic plugin recognizes Opus 4.7 but not Sonnet 4.7: The Anthropic runtime constants include claude-opus-4-7 and claude-sonnet-4-6; rg found no claude-sonnet-4-7 or sonnet-4-7 support anywhere in the repo. (extensions/anthropic/register.runtime.ts:60, 3c6dd9fcb208)
  • PR proof is test-only: The PR body reports targeted Vitest, test types, formatter checks, and visual YAML inspection, but does not include a post-change mock preflight, release parity run, logs, screenshot, recording, or linked artifact showing the changed behavior. (23dc8ce98801)

Likely related people:

  • vincentkoc: Merged ci: fold parity into QA release validation #74622, which folded the parity gate into QA release validation and touched the same workflow surface this PR now adjusts. (role: recent maintainer; confidence: high; commits: b9eb31b54cfa; files: .github/workflows/openclaw-release-checks.yml, .github/workflows/qa-live-transports-convex.yml)
  • 100yenadmin: Current changelog credits the QA/parity gate work in benchmarks: add first-wave GPT-5.4 vs Opus 4.6 parity harness #64441 to this contributor, and the related open issue plus closed follow-up PR focus on the same QA-lab parity model drift. (role: original parity contributor / adjacent owner; confidence: medium; commits: [PR 64441 (re](PR 64441 (referenced in CHANGELOG.md)); files: CHANGELOG.md, extensions/qa-lab/src/agentic-parity-report.ts, extensions/qa-lab/src/providers/mock-openai/server.ts)
  • Shakker: Current checkout blame for the affected parity workflow and scenario lines points to commit bc5a4bdb47, which recently touched those files on main. (role: recent line maintainer; confidence: low; commits: bc5a4bdb4763; files: .github/workflows/openclaw-release-checks.yml, .github/workflows/qa-live-transports-convex.yml, qa/scenarios/runtime/approval-turn-tool-followthrough.md)

Remaining risk / open question:

  • The PR has no real parity or mock-preflight run proof, so the changed workflow arguments are not demonstrated in the runtime path.
  • Current repo catalogs do not expose claude-sonnet-4-7; using that name may make parity output claim a model OpenClaw does not otherwise support.

Codex review notes: model gpt-5.5, reasoning high; reviewed against 3c6dd9fcb208.

@steipete

steipete commented May 9, 2026

Copy link
Copy Markdown
Contributor

Thanks @100yenadmin. I could not push the rebased maintainer fixups back to the fork (403), so I landed this via maintainer replacement #79698.

It keeps your Opus 4.7 / GPT-5.5 parity refresh and 60s approval-turn timeout change, adds changelog contributor credit, and includes mock-openai proof for the changed approval-turn scenario on both candidate and baseline refs.

Landed in 44d7d6f.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size: XS triage: needs-real-behavior-proof Candidate: external PR needs after-fix proof from a real setup.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants