fix(qa-lab): bump parity baseline to Opus 4.7 / GPT-5.5 and lengthen approval-followthrough timeouts#79347
Conversation
…approval-turn-tool-followthrough timeouts Carries forward the surface-bump portion of #74290 (closed in favor of this slim follow-up since the parity-gate.yml workflow file the original PR also touched was retired by #74622 'ci: fold parity into QA release validation'). The mock-openai parity lanes that now live in `openclaw-release-checks.yml` and `qa-live-transports-convex.yml` were still pinned to `anthropic/claude-opus-4-6` / `anthropic/claude-sonnet-4-6` for the baseline and `openai/gpt-5.4-alt` for the candidate alt model. That left the parity baseline one model-generation behind the active Opus 4.7 / GPT-5.5 defaults already used elsewhere on main (CHANGELOG.md:803, docs/providers/anthropic.md:108, openclaw-live-and-e2e-checks-reusable.yml:1894). The `approval-turn-tool-followthrough` scenario was using 20s/30s `liveTurnTimeoutMs` fallbacks that timed out on cold mock-gateway parity runs (the deleted `parity-gate.yml` env-var comments described exactly this scenario flake). Bumping all four turn fallbacks to 60s matches what the mock provider's `resolveTurnTimeoutMs` returns for fallbackMs (it returns the fallback unchanged) so cold starts have breathing room before the approval/follow-through chain has to complete. This PR does NOT touch: - The retired `.github/workflows/parity-gate.yml` (deleted on main by #74622) - Internal artifact directory names `gpt54`/`opus46` (cosmetic, out of scope for a slim follow-up) - The Discord QA scenario lane and the release-validation lane that intentionally pin `openai/gpt-5.4` (separate concerns) Refs #74290.
|
Codex review: needs real behavior proof before merge. Summary Reproducibility: Do we have a high-confidence way to reproduce the issue? Source inspection clearly reproduces the stale workflow model strings and short scenario fallbacks on current main, but I did not run the cold mock-preflight timeout path in this read-only review. Real behavior proof Next step before merge Security Review findings
Review detailsBest possible solution: Land a narrow QA parity update that bumps the supported primary Opus baseline and GPT alternate, keeps or adds only supported Anthropic alternate catalog entries, raises the timeout budget, and includes a real mock-preflight or parity-run proof artifact. Do we have a high-confidence way to reproduce the issue? Do we have a high-confidence way to reproduce the issue? Source inspection clearly reproduces the stale workflow model strings and short scenario fallbacks on current main, but I did not run the cold mock-preflight timeout path in this read-only review. Is this the best way to solve the issue? Is this the best way to solve the issue? Not yet: the Opus/GPT bump and 60s timeout are narrow, but the PR should not introduce Full review comments:
Overall correctness: patch is incorrect Acceptance criteria:
What I checked:
Likely related people:
Remaining risk / open question:
Codex review notes: model gpt-5.5, reasoning high; reviewed against 3c6dd9fcb208. |
|
Thanks @100yenadmin. I could not push the rebased maintainer fixups back to the fork (403), so I landed this via maintainer replacement #79698. It keeps your Opus 4.7 / GPT-5.5 parity refresh and 60s approval-turn timeout change, adds changelog contributor credit, and includes mock-openai proof for the changed approval-turn scenario on both candidate and baseline refs. Landed in 44d7d6f. |
Summary
Slim follow-up to closed PR #74290 picking up only the parts that are still load-bearing on current main:
anthropic/claude-opus-4-6/anthropic/claude-sonnet-4-6toclaude-opus-4-7/claude-sonnet-4-7, and the candidate alt fromopenai/gpt-5.4-alttoopenai/gpt-5.5-alt, inopenclaw-release-checks.yml:708-712,782andqa-live-transports-convex.yml:190,199-200,210. The Opus 4.7 / GPT-5.5 defaults are already active elsewhere on main (CHANGELOG.md:803,docs/providers/anthropic.md:108,262,openclaw-live-and-e2e-checks-reusable.yml:1894). The parity baseline was the last surface still one model-generation behind.liveTurnTimeoutMsfallbacks inqa/scenarios/runtime/approval-turn-tool-followthrough.mdfrom 20s/30s to 60s. The mock provider'sresolveTurnTimeoutMsreturns the fallback unchanged (extensions/qa-lab/src/providers/shared/mock-provider-definition.ts:36), so cold mock-gateway parity runs were timing out exactly where the retiredparity-gate.ymlenv-var comments said they would.#74290 was the natural home for this but is closed because most of its diff was against
.github/workflows/parity-gate.yml, whichb9eb31b54c(#74622 'ci: fold parity into QA release validation') deleted on main. Rebasing #74290 would have conflicted on every workflow file and many fixtures already updated. This follow-up keeps just the bumps that survived.Refs #74290 / #74262.
What this PR does NOT touch
.github/workflows/parity-gate.yml(deleted on main by ci: fold parity into QA release validation #74622).gpt54/opus46(cosmetic; renaming would require updating consumer paths and is out of scope for a slim bump).qa-live-transports-convex.yml:550-551) and the release-validation lane (openclaw-release-checks.yml:487) that intentionally pinopenai/gpt-5.4for separate purposes.Real behavior proof
approval-turn-tool-followthroughshould not flake on the 20s/30s budgets.pnpm exec vitest run extensions/qa-lab/src/providers/mock-openai/server.test.ts extensions/qa-lab/src/qa-gateway-config.test.ts extensions/qa-lab/src/suite-planning.test.ts extensions/qa-lab/src/cli.runtime.test.ts(all green);pnpm check:test-typesclean;pnpm exec oxfmt --checkon all four modified files clean.--baseline-summary,--baseline-label,output_dir,--candidate-summary) line up after the rename.waitForGatewayHealthybudget at line 48 of the same scenario, so the cold-startup window is now consistent across the file.Test plan
pnpm exec oxfmt --check --threads=1on all four modified filespnpm check:test-typesclean