fix(qa-lab): bump parity baseline to Opus 4.7 / GPT-5.5 and lengthen approval-turn-tool-followthrough timeouts

claude · claude · commit 23dc8ce98801 · 2026-05-08T17:20:44.000+07:00
Carries forward the surface-bump portion of #74290 (closed in favor of this slim follow-up since the parity-gate.yml workflow file the original PR also touched was retired by #74622 'ci: fold parity into QA release validation'). The mock-openai parity lanes that now live in `openclaw-release-checks.yml` and `qa-live-transports-convex.yml` were still pinned to `anthropic/claude-opus-4-6` / `anthropic/claude-sonnet-4-6` for the baseline and `openai/gpt-5.4-alt` for the candidate alt model. That left the parity baseline one model-generation behind the active Opus 4.7 / GPT-5.5 defaults already used elsewhere on main (CHANGELOG.md:803, docs/providers/anthropic.md:108, openclaw-live-and-e2e-checks-reusable.yml:1894). The `approval-turn-tool-followthrough` scenario was using 20s/30s `liveTurnTimeoutMs` fallbacks that timed out on cold mock-gateway parity runs (the deleted `parity-gate.yml` env-var comments described exactly this scenario flake). Bumping all four turn fallbacks to 60s matches what the mock provider's `resolveTurnTimeoutMs` returns for fallbackMs (it returns the fallback unchanged) so cold starts have breathing room before the approval/follow-through chain has to complete. This PR does NOT touch: - The retired `.github/workflows/parity-gate.yml` (deleted on main by #74622) - Internal artifact directory names `gpt54`/`opus46` (cosmetic, out of scope for a slim follow-up) - The Discord QA scenario lane and the release-validation lane that intentionally pin `openai/gpt-5.4` (separate concerns) Refs #74290.
diff --git a/.github/workflows/openclaw-release-checks.yml b/.github/workflows/openclaw-release-checks.yml
@@ -705,11 +705,11 @@ jobs:
           case "${QA_PARITY_LANE}" in
             candidate)
               model="${OPENCLAW_CI_OPENAI_MODEL}"
-              alt_model="openai/gpt-5.4-alt"
+              alt_model="openai/gpt-5.5-alt"
               ;;
             baseline)
-              model="anthropic/claude-opus-4-6"
-              alt_model="anthropic/claude-sonnet-4-6"
+              model="anthropic/claude-opus-4-7"
+              alt_model="anthropic/claude-sonnet-4-7"
               ;;
             *)
               echo "Unknown QA parity lane: ${QA_PARITY_LANE}" >&2
@@ -779,7 +779,7 @@ jobs:
             --candidate-summary .artifacts/qa-e2e/gpt54/qa-suite-summary.json \
             --baseline-summary .artifacts/qa-e2e/opus46/qa-suite-summary.json \
             --candidate-label "${OPENCLAW_CI_OPENAI_MODEL}" \
-            --baseline-label anthropic/claude-opus-4-6 \
+            --baseline-label anthropic/claude-opus-4-7 \
             --output-dir .artifacts/qa-e2e/parity
 
       - name: Upload parity artifacts
diff --git a/.github/workflows/qa-live-transports-convex.yml b/.github/workflows/qa-live-transports-convex.yml
@@ -187,17 +187,17 @@ jobs:
             --parity-pack agentic \
             --concurrency "${QA_PARITY_CONCURRENCY}" \
             --model "${OPENCLAW_CI_OPENAI_MODEL}" \
-            --alt-model openai/gpt-5.4-alt \
+            --alt-model openai/gpt-5.5-alt \
             --output-dir .artifacts/qa-e2e/gpt54
 
-      - name: Run Opus 4.6 lane
+      - name: Run Opus 4.7 lane
         run: |
           pnpm openclaw qa suite \
             --provider-mode mock-openai \
             --parity-pack agentic \
             --concurrency "${QA_PARITY_CONCURRENCY}" \
-            --model anthropic/claude-opus-4-6 \
-            --alt-model anthropic/claude-sonnet-4-6 \
+            --model anthropic/claude-opus-4-7 \
+            --alt-model anthropic/claude-sonnet-4-7 \
             --output-dir .artifacts/qa-e2e/opus46
 
       - name: Generate parity report
@@ -207,7 +207,7 @@ jobs:
             --candidate-summary .artifacts/qa-e2e/gpt54/qa-suite-summary.json \
             --baseline-summary .artifacts/qa-e2e/opus46/qa-suite-summary.json \
             --candidate-label "${OPENCLAW_CI_OPENAI_MODEL}" \
-            --baseline-label anthropic/claude-opus-4-6 \
+            --baseline-label anthropic/claude-opus-4-7 \
             --output-dir .artifacts/qa-e2e/parity
 
       - name: Upload parity artifacts
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -184,6 +184,8 @@ Docs: https://docs.openclaw.ai
 
 ### Fixes
 
+- QA-lab/parity: bump the live mock-openai parity baseline from `claude-opus-4-6`/`claude-sonnet-4-6` to `claude-opus-4-7`/`claude-sonnet-4-7` and the candidate alt from `gpt-5.4-alt` to `gpt-5.5-alt` in `openclaw-release-checks.yml` and `qa-live-transports-convex.yml`, matching the active Opus 4.7 / GPT-5.5 defaults already used elsewhere on main. Carries forward the surface-bump portion of #74290.
+- QA-lab/scenarios: raise the `approval-turn-tool-followthrough` per-turn fallback timeouts from 20s/30s to 60s so cold mock-gateway parity runs do not flake on the approval-turn chain. Carries forward the timeout-bump portion of #74290.
 - Agents/compaction: keep the recent tail after manual `/compact` when Pi returns an empty or no-op compaction summary, preventing blank checkpoints from replacing the live context.
 - fix(discord): gate user allowlist name resolution [AI]. (#79002) Thanks @pgondhi987.
 - fix(msteams): gate startup user allowlist resolution [AI]. (#79003) Thanks @pgondhi987.
diff --git a/qa/scenarios/runtime/approval-turn-tool-followthrough.md b/qa/scenarios/runtime/approval-turn-tool-followthrough.md
@@ -54,14 +54,14 @@ steps:
             message:
               expr: config.preActionPrompt
             timeoutMs:
-              expr: liveTurnTimeoutMs(env, 20000)
+              expr: liveTurnTimeoutMs(env, 60000)
       - call: waitForOutboundMessage
         args:
           - ref: state
           - lambda:
               params: [candidate]
               expr: "candidate.conversation.id === 'qa-operator'"
-          - expr: liveTurnTimeoutMs(env, 20000)
+          - expr: liveTurnTimeoutMs(env, 60000)
       - set: beforeApprovalCursor
         value:
           expr: state.getSnapshot().messages.length
@@ -72,7 +72,7 @@ steps:
             message:
               expr: config.approvalPrompt
             timeoutMs:
-              expr: liveTurnTimeoutMs(env, 30000)
+              expr: liveTurnTimeoutMs(env, 60000)
       - set: expectedReplyAny
         value:
           expr: config.expectedReplyAny.map(normalizeLowercaseStringOrEmpty)
@@ -81,7 +81,7 @@ steps:
         args:
           - lambda:
               expr: "state.getSnapshot().messages.slice(beforeApprovalCursor).filter((candidate) => candidate.direction === 'outbound' && candidate.conversation.id === 'qa-operator' && expectedReplyAny.some((needle) => normalizeLowercaseStringOrEmpty(candidate.text).includes(needle))).at(-1)"
-          - expr: liveTurnTimeoutMs(env, 20000)
+          - expr: liveTurnTimeoutMs(env, 60000)
           - expr: "env.providerMode === 'mock-openai' ? 100 : 250"
     detailsExpr: outbound.text
 ```