Skip to content

fix(qa-lab): refresh parity models and approval timeout#79698

Merged
steipete merged 1 commit into
mainfrom
maint/79347-qa-parity-opus47-gpt55
May 9, 2026
Merged

fix(qa-lab): refresh parity models and approval timeout#79698
steipete merged 1 commit into
mainfrom
maint/79347-qa-parity-opus47-gpt55

Conversation

@steipete

@steipete steipete commented May 9, 2026

Copy link
Copy Markdown
Contributor

Summary

Real behavior proof

  • Behavior addressed: QA parity workflows still used older Opus 4.6 / GPT-5.4-alt labels, and the approval-turn-tool-followthrough scenario had short 20s/30s mock fallback timeouts.
  • Real environment tested: local OpenClaw checkout with private QA CLI enabled, built with OPENCLAW_BUILD_PRIVATE_QA=1 OPENCLAW_ENABLE_PRIVATE_QA_CLI=1.
  • Exact steps or command run after this patch: ran the changed approval-turn scenario through pnpm openclaw qa suite for both candidate openai/gpt-5.5 + openai/gpt-5.5-alt and baseline anthropic/claude-opus-4-7 + anthropic/claude-sonnet-4-7; then generated the focused parity report.
  • Evidence after fix: terminal output from the patched QA CLI runs:
$ OPENCLAW_BUILD_PRIVATE_QA=1 OPENCLAW_ENABLE_PRIVATE_QA_CLI=1 pnpm openclaw qa suite --provider-mode mock-openai --scenario approval-turn-tool-followthrough --concurrency 1 --model openai/gpt-5.5 --alt-model openai/gpt-5.5-alt --output-dir .artifacts/qa-e2e/pr79347-approval-gpt55
passed 1/1

$ OPENCLAW_BUILD_PRIVATE_QA=1 OPENCLAW_ENABLE_PRIVATE_QA_CLI=1 pnpm openclaw qa suite --provider-mode mock-openai --scenario approval-turn-tool-followthrough --concurrency 1 --model anthropic/claude-opus-4-7 --alt-model anthropic/claude-sonnet-4-7 --output-dir .artifacts/qa-e2e/pr79347-approval-opus47
passed 1/1
  • Observed result after fix: changed scenario passed with the new GPT-5.5-alt candidate ref and Opus 4.7/Sonnet 4.7 baseline refs; focused parity metrics were 100%/100%. The full candidate --parity-pack agentic run also used openai/gpt-5.5-alt and passed the changed approval-turn scenario, while two unrelated existing scenarios timed out.
  • What was not tested: full parity-report pass for the whole pack; focused one-scenario parity report exits nonzero because it intentionally lacks full-pack coverage.

Verification

  • OPENCLAW_BUILD_PRIVATE_QA=1 OPENCLAW_ENABLE_PRIVATE_QA_CLI=1 pnpm build
  • pnpm test extensions/qa-lab/src/providers/mock-openai/server.test.ts extensions/qa-lab/src/qa-gateway-config.test.ts extensions/qa-lab/src/suite-planning.test.ts extensions/qa-lab/src/cli.runtime.test.ts
  • pnpm check:workflows
  • pnpm check:test-types
  • pnpm exec oxfmt --check --threads=1 .github/workflows/openclaw-release-checks.yml .github/workflows/qa-live-transports-convex.yml CHANGELOG.md qa/scenarios/runtime/approval-turn-tool-followthrough.md
  • git diff --check origin/main...HEAD

Refs #74290 / #74262. Supersedes #79347.

@openclaw-barnacle openclaw-barnacle Bot added size: XS maintainer Maintainer-authored PR labels May 9, 2026
@steipete steipete force-pushed the maint/79347-qa-parity-opus47-gpt55 branch from 0049f3a to a71e6b2 Compare May 9, 2026 07:14
@openclaw-barnacle openclaw-barnacle Bot added the channel: qa-channel Channel integration: qa-channel label May 9, 2026
@clawsweeper

clawsweeper Bot commented May 9, 2026

Copy link
Copy Markdown
Contributor

ClawSweeper status: review started.

I am starting a fresh review of this pull request: fix(qa-lab): refresh parity models and approval timeout This is item 1/1 in the current shard. Shard 0/1.

This placeholder means the worker is alive and reading the current context. I will edit this same comment with the actual review when the claws are done clicking.

Crustacean status: shell secured, claws on keyboard, evidence pebbles being sorted.

…approval-turn-tool-followthrough timeouts

Carries forward the surface-bump portion of #74290 (closed in favor of
this slim follow-up since the parity-gate.yml workflow file the original
PR also touched was retired by #74622 'ci: fold parity into QA release
validation').

The mock-openai parity lanes that now live in
`openclaw-release-checks.yml` and `qa-live-transports-convex.yml`
were still pinned to `anthropic/claude-opus-4-6` /
`anthropic/claude-sonnet-4-6` for the baseline and
`openai/gpt-5.4-alt` for the candidate alt model. That left the parity
baseline one model-generation behind the active Opus 4.7 / GPT-5.5
defaults already used elsewhere on main (CHANGELOG.md:803,
docs/providers/anthropic.md:108, openclaw-live-and-e2e-checks-reusable.yml:1894).

The `approval-turn-tool-followthrough` scenario was using 20s/30s
`liveTurnTimeoutMs` fallbacks that timed out on cold mock-gateway
parity runs (the deleted `parity-gate.yml` env-var comments described
exactly this scenario flake). Bumping all four turn fallbacks to 60s
matches what the mock provider's `resolveTurnTimeoutMs` returns for
fallbackMs (it returns the fallback unchanged) so cold starts have
breathing room before the approval/follow-through chain has to
complete.

This PR does NOT touch:
- The retired `.github/workflows/parity-gate.yml` (deleted on main
  by #74622)
- Internal artifact directory names `gpt54`/`opus46` (cosmetic, out
  of scope for a slim follow-up)
- The Discord QA scenario lane and the release-validation lane that
  intentionally pin `openai/gpt-5.4` (separate concerns)

Refs #74290.
@steipete steipete force-pushed the maint/79347-qa-parity-opus47-gpt55 branch from a71e6b2 to d7210a2 Compare May 9, 2026 07:15
@openclaw-barnacle openclaw-barnacle Bot removed the channel: qa-channel Channel integration: qa-channel label May 9, 2026
@steipete steipete merged commit 44d7d6f into main May 9, 2026
96 of 98 checks passed
@steipete steipete deleted the maint/79347-qa-parity-opus47-gpt55 branch May 9, 2026 07:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

maintainer Maintainer-authored PR size: XS

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants