agents: GPT-5.4 parity proof rollup #65224
agents: GPT-5.4 parity proof rollup #65224100yenadmin wants to merge 48 commits intoopenclaw:mainfrom
Conversation
Greptile SummaryThis rollup consolidates the GPT-5.4 / Opus 4.6 agentic parity proof work: an Anthropic Confidence Score: 5/5Safe to merge — all findings are P2 (style/cleanup) with no impact on core runtime behavior. The PR is scoped entirely to the qa-lab extension and qa/scenarios. No core runtime code is touched. The three comments flag a stale doc string, a module-level phase counter that is a test-isolation smell, and an unreachable empty-string fallthrough — none of which affect production behavior or the happy-path test coverage that the PR's own 143/143 suite validates. extensions/qa-lab/src/mock-openai-server.ts — stale streaming comment, shared subagentFanoutPhase, and empty-string fallthrough in repo-contract handler. Prompt To Fix All With AIThis is a comment left during a code review.
Path: extensions/qa-lab/src/mock-openai-server.ts
Line: 885-888
Comment:
**Stale comment contradicts implementation**
The block comment says "Streaming is intentionally out of scope for this mock because the suite runner supports non-streaming fallback," but the handler at line 1397 does serve SSE when `body.stream === true`. A future reader following the comment will be confused about whether the SSE path is intentional or accidental dead code.
```suggestion
// Scope: handles Anthropic Messages requests with text and tool_result content
// blocks, supporting both non-streaming (JSON response) and streaming
// (SSE) modes. The scenario dispatch is shared with the /v1/responses route
// so both lanes exercise identical mock scenario logic.
```
How can I resolve this? If you propose a fix, please make it concise.
---
This is a comment left during a code review.
Path: extensions/qa-lab/src/mock-openai-server.ts
Line: 126
Comment:
**Module-level `subagentFanoutPhase` shared across server instances**
`subagentFanoutPhase` is a module-level `let`, so it's shared across every `startQaMockOpenAiServer()` call in the same Node.js process. `startQaMockOpenAiServer()` resets it to `0` at startup (line 1247), but if a second server is instantiated while the first is still serving requests — for instance in a parallel Vitest worker that imports the same module — the phase counter from the first server bleeds into the second and vice-versa.
Moving the variable inside `startQaMockOpenAiServer` (passed by reference into the request handler closure) eliminates the shared state entirely:
```typescript
// inside startQaMockOpenAiServer:
let subagentFanoutPhase = 0;
// pass it to buildResponsesPayload via a closure or parameter
```
How can I resolve this? If you propose a fix, please make it concise.
---
This is a comment left during a code review.
Path: extensions/qa-lab/src/mock-openai-server.ts
Line: 517-529
Comment:
**Silent empty-string fallthrough for unmatched write output**
If `toolOutput` is truthy and the prompt matches `repo contract followthrough check`, but the output doesn't satisfy either success pattern (`successfully (?:wrote|...)/i` or `status:\s*complete/i`), this returns `""`. `buildAssistantEvents("")` produces a valid zero-text assistant turn, so the `waitForCondition` for "read:", "wrote:", "status:" would silently time out instead of failing fast.
In the expected flow this is unreachable (the gateway's write tool returns output matching `successfully wrote`), but a write error (e.g., permission denied) would produce an unexpected format and cause an opaque timeout rather than a clear failure.
How can I resolve this? If you propose a fix, please make it concise.Reviews (1): Last reviewed commit: "qa: roll up parity proof closeout" | Re-trigger Greptile |
|
Current maintainer score on this rollup: 10/10 ready to merge on branch-owned proof / release-certification scope. Why:
Suggested follow-ups are documented in the PR body as non-blocking enhancement paths, not reasons to hold this merge. |
There was a problem hiding this comment.
Pull request overview
Rolls up the remaining GPT-5.4 / Codex parity “proof + release-certification” work into a single, reviewable QA-lab/docs change set, including offline structural parity support (OpenAI + Anthropic mock lanes) and stronger evidence requirements for scenario passes.
Changes:
- Strengthens the agentic parity pack by adding new scenarios and tightening scenario assertions to require real tool evidence (via
/debug/requests) where appropriate. - Extends the QA-lab mock infrastructure and gateway config to support an Anthropic baseline lane (
/v1/messages, SSE), mock auth staging, and provider-qualified model refs in mock mode. - Makes
qa-suite-summary.jsonself-describing (runprovenance) and updates parity-report gating semantics + docs/workflow to reflect the mock structural gate vs live proof split.
Reviewed changes
Copilot reviewed 24 out of 24 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| qa/scenarios/subagent-handoff.md | Requires sessions_spawn evidence in mock runs to prevent prose-only “delegation” passes. |
| qa/scenarios/subagent-fanout-synthesis.md | Adds mock-only assertions to ensure multiple real sessions_spawn dispatches occurred. |
| qa/scenarios/source-docs-discovery-report.md | Requires at least one real read tool call in mock mode before accepting the prose report. |
| qa/scenarios/model-switch-tool-continuity.md | Generalizes the “alternate model was used” assertion to match configured alternate model. |
| qa/scenarios/memory-recall.md | Documents why the scenario remains prose-only (no tool-call gating) and how fake-success is still addressed. |
| qa/scenarios/instruction-followthrough-repo-contract.md | Adds a new repo-contract followthrough scenario that asserts read→write ordering and no “permission bounce” behavior. |
| qa/scenarios/image-understanding-attachment.md | Improves mock evidence by reusing a single debug request snapshot and asserting image attachment presence. |
| qa/scenarios/config-restart-capability-flip.md | Requires recorded image_generate tool evidence in mock mode after capability restoration. |
| extensions/qa-lab/src/suite.ts | Adds qa-suite-summary.json run provenance via buildQaSuiteSummaryJson() and records executed scenario ids when filtered. |
| extensions/qa-lab/src/suite.summary-json.test.ts | New tests covering summary provenance fields and scenarioIds semantics. |
| extensions/qa-lab/src/scenario-catalog.test.ts | Adds regressions for guarding mock-only assertions and validates the new repo-contract scenario config. |
| extensions/qa-lab/src/qa-gateway-config.ts | Adds mock Anthropic provider config and enables provider-qualified refs (openai/*, anthropic/*) through mock lane. |
| extensions/qa-lab/src/qa-gateway-config.test.ts | Verifies new provider mappings and allowPrivateNetwork settings for mock providers. |
| extensions/qa-lab/src/mock-openai-server.ts | Adds provider-variant tagging, Anthropic /v1/messages adapter (incl. SSE), and routing fixes for remember/exact-reply behavior. |
| extensions/qa-lab/src/mock-openai-server.test.ts | Expands coverage for Anthropic lane routing, SSE streaming, tool_result ordering, remember-prompt routing, and providerVariant tagging. |
| extensions/qa-lab/src/gateway-child.ts | Stages placeholder mock auth profiles for offline parity runs and defaults providerMode consistently. |
| extensions/qa-lab/src/gateway-child.test.ts | Adds coverage for mock auth staging and providerMode defaulting. |
| extensions/qa-lab/src/cli.runtime.test.ts | Updates expected agentic parity scenario ids to include the expanded pack. |
| extensions/qa-lab/src/agentic-parity.ts | Expands parity pack to 11 scenarios and marks which scenarios count toward valid tool-call rate. |
| extensions/qa-lab/src/agentic-parity-report.ts | Adds provenance verification (label ↔ run metadata), strengthens required-scenario gate semantics, and refines fake-success detection + tool-call rate denominator. |
| extensions/qa-lab/src/agentic-parity-report.test.ts | Adds regressions for required-scenario failures, provenance mismatch errors, tool-call metric exclusions, and report header parametrization. |
| docs/help/gpt54-codex-agentic-parity.md | Rewrites docs to reflect the 2-PR closeout framing, 11-scenario pack, and mock structural gate vs live proof distinction. |
| docs/help/gpt54-codex-agentic-parity-maintainers.md | Updates maintainer guidance/checklists to match the rollup structure and new proof requirements. |
| .github/workflows/parity-gate.yml | Adds a PR workflow that runs the offline mock structural parity gate and uploads artifacts. |
|
Hardening pass complete on the current head Addressed the remaining proof-layer review feedback by:
Current branch-owned validation: CI=1 pnpm exec vitest run \
extensions/qa-lab/src/mock-openai-server.test.ts \
extensions/qa-lab/src/agentic-parity-report.test.ts \
extensions/qa-lab/src/scenario-catalog.test.ts \
extensions/qa-lab/src/cli.runtime.test.ts \
extensions/qa-lab/src/qa-gateway-config.test.ts \
extensions/qa-lab/src/suite.summary-json.test.ts \
extensions/qa-lab/src/gateway-child.test.tsResult: This should leave the rollup thread-clean on current head. |
There was a problem hiding this comment.
Pull request overview
Adds the wave-2 “parity proof” layer for GPT‑5.4 vs Opus 4.6 by expanding the QA parity scenario pack, strengthening mock-mode evidence (tool-call/debug assertions + run provenance), and wiring a PR CI workflow that runs the offline structural parity gate.
Changes:
- Expands the agentic parity pack (now includes subagent, memory, capability-flip, and repo-instruction followthrough scenarios) and adds mock-mode evidence assertions via
/debug/requests. - Extends the QA mock server to support an Anthropic
/v1/messageslane and tags request snapshots with provider variants for downstream verification. - Writes self-describing
qa-suite-summary.jsonrun metadata and adds a.github/workflows/parity-gate.ymlCI gate + updated documentation.
Reviewed changes
Copilot reviewed 24 out of 24 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| qa/scenarios/subagent-handoff.md | Adds mock /debug/requests assertion requiring sessions_spawn to prevent prose-only fake delegation. |
| qa/scenarios/subagent-fanout-synthesis.md | Adds mock /debug/requests assertions to ensure real fanout spawns occurred. |
| qa/scenarios/source-docs-discovery-report.md | Adds mock /debug/requests assertion requiring a read call before prose report. |
| qa/scenarios/model-switch-tool-continuity.md | Makes alternate-model assertion compare against configured alternate model. |
| qa/scenarios/memory-recall.md | Documents why this scenario remains prose-only (no tool-call assertion). |
| qa/scenarios/instruction-followthrough-repo-contract.md | New scenario to validate repo-instruction followthrough (read-order + write + no permission bounce). |
| qa/scenarios/image-understanding-attachment.md | Strengthens mock evidence by asserting the request carried image inputs (via debug snapshot). |
| qa/scenarios/config-restart-capability-flip.md | Adds mock /debug/requests assertion requiring image_generate tool call post-restart. |
| extensions/qa-lab/src/suite.ts | Adds typed qa-suite-summary.json builder with run metadata and scenarioId recording. |
| extensions/qa-lab/src/suite.summary-json.test.ts | Unit tests for the new summary JSON builder and run metadata semantics. |
| extensions/qa-lab/src/scenario-catalog.test.ts | Catalog regression tests for mock-guarded debug assertions and new scenario presence. |
| extensions/qa-lab/src/qa-gateway-config.ts | Adds mock Anthropics provider config + strips /v1 for Messages base URL; enables private-network requests for mock providers. |
| extensions/qa-lab/src/qa-gateway-config.test.ts | Tests for mock provider mapping (openai + anthropic) and request.allowPrivateNetwork settings. |
| extensions/qa-lab/src/mock-openai-server.ts | Adds provider-variant tagging and Anthropic /v1/messages adapter (incl. SSE), plus scenario-state isolation. |
| extensions/qa-lab/src/mock-openai-server.test.ts | Expands coverage for Anthropic adapter, provider-variant tagging, and new scenario flows. |
| extensions/qa-lab/src/gateway-child.ts | Stages mock auth profiles and normalizes provider mode handling for mock runs. |
| extensions/qa-lab/src/gateway-child.test.ts | Tests for default provider mode and mock-auth staging behavior. |
| extensions/qa-lab/src/cli.runtime.test.ts | Updates runtime parity-pack scenario ID expectations to the expanded pack. |
| extensions/qa-lab/src/agentic-parity.ts | Expands the parity scenario registry and distinguishes tool-backed scenarios for metrics. |
| extensions/qa-lab/src/agentic-parity-report.ts | Adds run-provenance verification, expands fake-success detection, updates tool-call rate calculation, and improves report header. |
| extensions/qa-lab/src/agentic-parity-report.test.ts | Adds regressions for new gate semantics (required failures, label mismatches, fake-success patterns, tool-rate behavior). |
| docs/help/gpt54-codex-agentic-parity.md | Rewrites parity docs for the closeout structure, expanded pack, and gate/proof distinction. |
| docs/help/gpt54-codex-agentic-parity-maintainers.md | Maintainer-focused guidance for review units, checklist, and proof expectations. |
| .github/workflows/parity-gate.yml | New PR workflow running the mock structural parity gate and uploading artifacts. |
There was a problem hiding this comment.
Pull request overview
Adds the “parity proof” layer for the GPT-5.4 parity program by expanding the QA-lab agentic parity pack, strengthening scenario/tool-evidence assertions (offline via /debug/requests), adding an Anthropic /v1/messages mock lane + mock auth staging, emitting self-describing qa-suite-summary.json run metadata, and wiring a CI parity gate workflow + updated docs.
Changes:
- Expand the agentic parity pack to 11 scenarios and add scenario-level tool-evidence assertions for mock runs.
- Extend the QA mock server to support Anthropic
/v1/messages(including streaming), plus provider-variant tagging and additional scenario branches. - Add
qa-suite-summary.jsonrunprovenance metadata, parity gate precondition checks, and a PR-triggered CI workflow that runs the offline structural gate.
Reviewed changes
Copilot reviewed 24 out of 24 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| qa/scenarios/subagent-handoff.md | Adds /debug/requests assertion requiring sessions_spawn during handoff. |
| qa/scenarios/subagent-fanout-synthesis.md | Adds /debug/requests assertion requiring multiple sessions_spawn calls in fanout. |
| qa/scenarios/source-docs-discovery-report.md | Adds /debug/requests assertion requiring at least one read tool call. |
| qa/scenarios/model-switch-tool-continuity.md | Makes alternate-model assertion dynamic (based on config) instead of hardcoded. |
| qa/scenarios/memory-recall.md | Documents why this scenario intentionally remains prose-only. |
| qa/scenarios/instruction-followthrough-repo-contract.md | New scenario validating repo-instruction followthrough and tool ordering. |
| qa/scenarios/image-understanding-attachment.md | Refactors mock image-input assertion to reuse a cached debug request lookup. |
| qa/scenarios/config-restart-capability-flip.md | Adds mock-only /debug/requests assertion requiring image_generate post-restart. |
| extensions/qa-lab/src/suite.ts | Introduces qa-suite-summary.json typed builder + embeds run metadata and scenarioIds semantics. |
| extensions/qa-lab/src/suite.summary-json.test.ts | New unit tests for buildQaSuiteSummaryJson. |
| extensions/qa-lab/src/scenario-catalog.test.ts | Adds regression tests for mock-only guards and new repo-contract scenario presence. |
| extensions/qa-lab/src/qa-gateway-config.ts | Adds mock Anthropic provider config + private-network allowance for mock providers. |
| extensions/qa-lab/src/qa-gateway-config.test.ts | Validates provider-qualified model refs route through the mock lane and request config. |
| extensions/qa-lab/src/mock-openai-server.ts | Adds provider-variant tagging, per-instance scenario state, and /v1/messages Anthropic adapter with SSE. |
| extensions/qa-lab/src/mock-openai-server.test.ts | Expands coverage for Anthropic adapter, streaming, variant tagging, and new scenario behaviors. |
| extensions/qa-lab/src/gateway-child.ts | Adds mock auth profile staging and providerMode defaulting for gateway child. |
| extensions/qa-lab/src/gateway-child.test.ts | Tests mock auth staging and providerMode defaulting behavior. |
| extensions/qa-lab/src/cli.runtime.test.ts | Updates parity pack list used by the CLI runtime tests. |
| extensions/qa-lab/src/agentic-parity.ts | Expands parity scenario registry and adds tool-backed scenario title list. |
| extensions/qa-lab/src/agentic-parity-report.ts | Adds run provenance handling, label verification, tool-backed tool-call-rate semantics, and fake-success detection changes. |
| extensions/qa-lab/src/agentic-parity-report.test.ts | Adds extensive tests for new parity-gate semantics (required failures, provenance checks, fake-success heuristics). |
| docs/help/gpt54-codex-agentic-parity.md | Rewrites parity program documentation for the rollup structure and updated pack. |
| docs/help/gpt54-codex-agentic-parity-maintainers.md | Updates maintainer review guidance and gate/proof distinctions. |
| .github/workflows/parity-gate.yml | New PR workflow running the offline “mock structural” parity gate and uploading artifacts. |
There was a problem hiding this comment.
Pull request overview
Adds the “parity proof” layer for the GPT‑5.4 parity program by expanding the QA parity pack, strengthening mock/offline execution (including an Anthropic baseline lane), and wiring CI to run a mock structural parity gate on relevant PRs.
Changes:
- Expands the agentic parity pack to 11 scenarios and adds scenario-level tool/evidence assertions (via
/debug/requests) for tool-mediated lanes. - Extends the QA mock server to support an Anthropic
/v1/messagesroute (including SSE) and adds mock auth staging so the gate runs without real keys. - Writes self-describing
qa-suite-summary.jsonrun metadata and adds a PR workflow (parity-gate.yml) to run the mock structural gate and upload artifacts.
Reviewed changes
Copilot reviewed 24 out of 24 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| qa/scenarios/subagent-handoff.md | Adds /debug/requests assertion requiring sessions_spawn during handoff in mock mode. |
| qa/scenarios/subagent-fanout-synthesis.md | Adds /debug/requests assertion requiring ≥2 sessions_spawn calls in mock mode. |
| qa/scenarios/source-docs-discovery-report.md | Adds /debug/requests assertion requiring a read call in mock mode. |
| qa/scenarios/model-switch-tool-continuity.md | Adjusts assertion to validate the alternate model dynamically. |
| qa/scenarios/memory-recall.md | Adds rationale comment for prose-only coverage (no tool-call assertion). |
| qa/scenarios/instruction-followthrough-repo-contract.md | New scenario enforcing “read-first then write” repo-instruction followthrough with ordering assertions. |
| qa/scenarios/image-understanding-attachment.md | Caches the matched debug request and asserts imageInputCount in mock mode. |
| qa/scenarios/config-restart-capability-flip.md | Adds /debug/requests assertion requiring image_generate post-restart in mock mode. |
| extensions/qa-lab/src/suite.ts | Exports summary JSON types + builder with a run provenance block; records scenarioIds; reuses QaProviderMode. |
| extensions/qa-lab/src/suite.summary-json.test.ts | Adds unit tests for the new summary JSON builder/run metadata. |
| extensions/qa-lab/src/scenario-catalog.test.ts | Adds regressions for mock-only debug assertion guards + new scenario presence. |
| extensions/qa-lab/src/qa-gateway-config.ts | Adds mock Anthropics provider config and trims /v1 for Messages base URL; allows private network for mock providers. |
| extensions/qa-lab/src/qa-gateway-config.test.ts | Tests new provider mappings and request settings. |
| extensions/qa-lab/src/mock-openai-server.ts | Adds providerVariant tagging and an Anthropic /v1/messages adapter (non-stream + SSE), plus scenario-state isolation. |
| extensions/qa-lab/src/mock-openai-server.test.ts | Adds extensive tests for new Anthropic route, SSE, providerVariant tagging, and new scenario flows. |
| extensions/qa-lab/src/gateway-child.ts | Adds mock auth profile staging and defaults provider mode for gateway-child. |
| extensions/qa-lab/src/gateway-child.test.ts | Tests mock auth staging + providerMode defaulting. |
| extensions/qa-lab/src/cli.runtime.test.ts | Updates expected parity-pack scenario IDs to include the new scenarios. |
| extensions/qa-lab/src/agentic-parity.ts | Expands parity scenario list and introduces tool-backed scenario title subset. |
| extensions/qa-lab/src/agentic-parity-report.ts | Adds run provenance typing + label verification, refines valid-tool-call metric, and strengthens required-scenario failure semantics. |
| extensions/qa-lab/src/agentic-parity-report.test.ts | Updates tests for expanded pack and new gate semantics. |
| docs/help/gpt54-codex-agentic-parity.md | Rewrites parity program docs/runbook for the expanded pack and mock-vs-live proof model. |
| docs/help/gpt54-codex-agentic-parity-maintainers.md | Updates maintainer review notes to match the rollup structure and evidence sources. |
| .github/workflows/parity-gate.yml | Adds CI workflow to run mock structural parity gate and upload artifacts. |
There was a problem hiding this comment.
Pull request overview
Adds the “proof layer” for the GPT‑5.4 parity program by expanding the QA parity scenario pack, strengthening mock-mode evidence (tool-call assertions + provider-lane mocking), and wiring a CI gate that runs fully offline with self-describing artifacts.
Changes:
- Expanded the agentic parity pack to 11 scenarios and added mock-only assertions (via
/debug/requests) to prevent “prose-only” fake progress in tool-mediated scenarios. - Added an Anthropic
/v1/messagesmock adapter plus mock-auth staging so both candidate/baseline lanes run offline without real provider credentials. - Extended QA suite artifacts with
runprovenance metadata and added a PR workflow (parity-gate.yml) that runs the mock parity gate and uploads artifacts.
Reviewed changes
Copilot reviewed 24 out of 24 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| qa/scenarios/subagent-handoff.md | Adds /debug/requests assertion requiring sessions_spawn during handoff. |
| qa/scenarios/subagent-fanout-synthesis.md | Adds mock-only tool-call assertion requiring 2× sessions_spawn. |
| qa/scenarios/source-docs-discovery-report.md | Adds mock-only assertion requiring at least one read tool call. |
| qa/scenarios/model-switch-tool-continuity.md | Caches /debug/requests once and reuses it for assertions. |
| qa/scenarios/memory-recall.md | Documents why this scenario remains prose-only (no tool-call gate). |
| qa/scenarios/instruction-followthrough-repo-contract.md | New scenario enforcing instruction-file read order + write + explicit reporting. |
| qa/scenarios/image-understanding-attachment.md | Strengthens mock evidence by asserting image attachment reached provider (imageInputCount). |
| qa/scenarios/config-restart-capability-flip.md | Adds mock-only assertion requiring image_generate tool call post-restart. |
| extensions/qa-lab/src/suite.ts | Exports summary JSON types and builds a run provenance block into qa-suite-summary.json. |
| extensions/qa-lab/src/suite.summary-json.test.ts | Tests qa-suite-summary.json run metadata and scenarioIds semantics. |
| extensions/qa-lab/src/scenario-catalog.test.ts | Adds regression checks for mock-only guards and the new repo-contract scenario. |
| extensions/qa-lab/src/qa-gateway-config.ts | Adds mock provider entries for openai and anthropic and allows private-network requests in mock mode. |
| extensions/qa-lab/src/qa-gateway-config.test.ts | Validates provider-qualified model refs map to the mock provider lanes. |
| extensions/qa-lab/src/mock-openai-server.ts | Adds provider-variant tagging and an Anthropic /v1/messages adapter sharing the same dispatcher. |
| extensions/qa-lab/src/mock-openai-server.test.ts | Expands coverage for Anthropic adapter, provider tagging, and new scenario flows. |
| extensions/qa-lab/src/gateway-child.ts | Stages placeholder auth profiles in mock-openai mode and defaults providerMode. |
| extensions/qa-lab/src/gateway-child.test.ts | Tests mock auth profile staging and providerMode defaulting. |
| extensions/qa-lab/src/cli.runtime.test.ts | Updates parity scenario list used by CLI runtime tests. |
| extensions/qa-lab/src/agentic-parity.ts | Expands parity pack list and tracks which scenarios count toward tool-call-rate metrics. |
| extensions/qa-lab/src/agentic-parity-report.ts | Adds run-label verification, required-scenario failure semantics, and tool-backed tool-call-rate calculation. |
| extensions/qa-lab/src/agentic-parity-report.test.ts | Updates tests for new scenario pack, label verification, and required-scenario gate behavior. |
| docs/help/gpt54-codex-agentic-parity.md | Rewrites parity documentation for the rollup, pack composition, and proof modes. |
| docs/help/gpt54-codex-agentic-parity-maintainers.md | Updates maintainer review notes and release checklist for the consolidated rollups. |
| .github/workflows/parity-gate.yml | Adds CI workflow to run both mock lanes, generate parity report, and upload artifacts. |
There was a problem hiding this comment.
Pull request overview
This PR implements the CI/test-harness “proof layer” for the GPT-5.4 parity program: it runs OpenAI and Anthropic lanes through the same agentic QA scenario pack in mock mode, emits self-describing artifacts, and enforces a pass/fail parity gate in CI.
Changes:
- Expand the agentic parity pack to 11 scenarios and add per-scenario
/debug/requestsassertions to prevent prose-only “fake tool use”. - Add Anthropic
/v1/messagessupport to the mock provider and stage mock auth profiles so the gate runs offline without real keys. - Write richer
qa-suite-summary.jsonrun metadata and add a PR workflow (parity-gate.yml) that executes the mock structural parity gate and uploads artifacts.
Reviewed changes
Copilot reviewed 24 out of 24 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| qa/scenarios/subagent-handoff.md | Adds /debug/requests assertion requiring a real sessions_spawn for the handoff scenario. |
| qa/scenarios/subagent-fanout-synthesis.md | Adds mock-only tool-call assertions verifying fanout truly spawned subagents. |
| qa/scenarios/source-docs-discovery-report.md | Adds mock-only assertion requiring a real read tool call before the discovery report prose. |
| qa/scenarios/model-switch-tool-continuity.md | Caches /debug/requests once in mock mode and asserts tool+model continuity post-switch. |
| qa/scenarios/memory-recall.md | Documents why the memory-recall scenario remains prose-only (no tool-call gating). |
| qa/scenarios/instruction-followthrough-repo-contract.md | Introduces a new repo-instruction followthrough scenario with ordering/tool-call checks. |
| qa/scenarios/image-understanding-attachment.md | Strengthens mock evidence by asserting imageInputCount on the scenario’s debug request. |
| qa/scenarios/config-restart-capability-flip.md | Adds mock-only assertion requiring an image_generate planned tool call post-restart. |
| extensions/qa-lab/src/suite.ts | Exports summary JSON types and adds run metadata via buildQaSuiteSummaryJson(). |
| extensions/qa-lab/src/suite.summary-json.test.ts | Adds tests for qa-suite-summary.json run metadata and scenarioIds encoding. |
| extensions/qa-lab/src/scenario-catalog.test.ts | Adds regression tests for mock-guarded debug assertions and the new scenario’s config. |
| extensions/qa-lab/src/qa-gateway-config.ts | Adds mock openai+anthropic provider configs and enables private-network requests for mock base URLs. |
| extensions/qa-lab/src/qa-gateway-config.test.ts | Tests provider-qualified model refs mapping through the mock lane and request config defaults. |
| extensions/qa-lab/src/mock-openai-server.ts | Adds provider-variant tagging and an Anthropic /v1/messages adapter (incl. SSE streaming) sharing the same dispatcher. |
| extensions/qa-lab/src/mock-openai-server.test.ts | Adds extensive coverage for the Anthropic adapter, providerVariant tagging, and new scenario branches. |
| extensions/qa-lab/src/gateway-child.ts | Stages placeholder auth profiles in mock mode so runs don’t require real keys. |
| extensions/qa-lab/src/gateway-child.test.ts | Tests default providerMode and mock auth profile staging behavior. |
| extensions/qa-lab/src/cli.runtime.test.ts | Updates expected parity scenario IDs list used by the CLI runtime tests. |
| extensions/qa-lab/src/agentic-parity.ts | Expands parity scenario list and defines which scenarios count toward valid tool-call rate. |
| extensions/qa-lab/src/agentic-parity-report.ts | Adds run-label verification, refines fake-success detection (failure-tone only), and updates tool-call rate denominator. |
| extensions/qa-lab/src/agentic-parity-report.test.ts | Updates and expands parity-report tests for the 11-scenario pack and new gate semantics. |
| docs/help/gpt54-codex-agentic-parity.md | Rewrites parity documentation for the rollup model and the 11-scenario pack. |
| docs/help/gpt54-codex-agentic-parity-maintainers.md | Updates maintainer notes to reflect the two-rollup structure and new proof responsibilities. |
| .github/workflows/parity-gate.yml | Adds CI workflow to run mock parity lanes, generate parity report, and upload artifacts. |
Closes the 'summary cannot be label-verified' half of criterion 5 on the GPT-5.4 parity completion gate in openclaw#64227. Background: the parity gate in openclaw#64441 compares two qa-suite-summary.json files and trusts whatever candidateLabel / baselineLabel the caller passes. Today the summary JSON only contains { scenarios, counts }, so nothing in the summary records which provider/model the run actually used. If a maintainer swaps candidate and baseline summary paths in a parity-report call, the verdict is silently mislabeled and nobody can retroactively verify which run produced which summary. Changes: - Add a 'run' block to qa-suite-summary.json with startedAt, finishedAt, providerMode, primaryModel (+ provider and model splits), alternateModel (+ provider and model splits), fastMode, concurrency, scenarioIds (when explicitly filtered). - Extract a pure 'buildQaSuiteSummaryJson(params)' helper so the summary JSON shape is unit-testable and the parity gate (and any future parity wrapper) can import the exact same type rather than reverse-engineering the JSON shape at runtime. - Thread 'scenarioIds' from 'runQaSuite' into writeQaSuiteArtifacts so --scenario-ids flags are recorded in the summary. Unit tests added (src/suite.summary-json.test.ts, 5 cases): - records provider/model/mode so parity gates can verify labels - includes scenarioIds in run metadata when provided - records an Anthropic baseline lane cleanly for parity runs - leaves split fields null when a model ref is malformed - keeps scenarios and counts alongside the run metadata This is additive: existing callers of qa-suite-summary.json continue to see the same { scenarios, counts } shape, just with an extra run field. No existing consumers of the JSON need to change. The follow-up 'qa parity run' CLI wrapper (run the parity pack twice against candidate + baseline, emit two labeled summaries in one command) stacks cleanly on top of this change and will land as a separate PR once openclaw#64441 and openclaw#64662 merge so the wrapper can call runQaParityReportCommand directly. Local validation: - pnpm test extensions/qa-lab/src/suite.summary-json.test.ts (5/5 pass) - pnpm test extensions/qa-lab/src/suite.summary-json.test.ts extensions/qa-lab/src/cli.runtime.test.ts extensions/qa-lab/src/scenario-catalog.test.ts (34/34 pass) Refs openclaw#64227 Unblocks the final parity run for openclaw#64441 / openclaw#64662 by making summaries self-describing.
…antics Addresses 4 loop-6 Copilot / codex-connector findings on PR openclaw#64689 (re-opened as openclaw#64789): 1. P2 codex + Copilot: empty `scenarioIds` array was serialized as `[]` because of a truthiness check. The CLI passes an empty array when --scenario is omitted, so full-suite runs would incorrectly record an explicit empty selection. Fix: switch to a `length > 0` check so '[] or undefined' both encode as `null` in the summary run metadata. 2. Copilot: `buildQaSuiteSummaryJson` was exported for parity-gate consumers but its return type was `Record<string, unknown>`, which defeated the point of exporting it. Fix: introduce a concrete `QaSuiteSummaryJson` type that matches the JSON shape 1-for-1 and make the builder return it. Downstream code (parity gate, parity run wrapper) can now import the type and keep consumers type-checked. 3. Copilot: `QaSuiteSummaryJsonParams.providerMode` re-declared the `'mock-openai' | 'live-frontier'` string union even though `QaProviderMode` is already imported from model-selection.ts. Fix: reuse `QaProviderMode` so provider-mode additions flow through both types at once. 4. Copilot: test fixtures omitted `steps` from the fake scenario results, creating shape drift with the real suite scenario-result shape. Fix: pad the test fixtures with `steps: []` and tighten the scenarioIds assertion to read `json.run.scenarioIds` directly (the new concrete return type makes the type-cast unnecessary). New regression: `treats an empty scenarioIds array as unspecified (no filter)` — passes `scenarioIds: []` and asserts the summary records `scenarioIds: null`. Validation: pnpm test extensions/qa-lab/src/suite.summary-json.test.ts (6/6 pass). Refs openclaw#64227
Addresses the pass-3 codex-connector P2 on openclaw#64789 (repl of openclaw#64689): `run.scenarioIds` was copied from the raw `params.scenarioIds` caller input, but `runQaSuite` normalizes that input through `selectQaSuiteScenarios` which dedupes via `Set` and reorders the selection to catalog order. When callers repeat --scenario ids or pass them in non-catalog order, the summary metadata drifted from the scenarios actually executed, which can make parity/report tooling treat equivalent runs as different or trust inaccurate provenance. Fix: both writeQaSuiteArtifacts call sites in runQaSuite now pass `selectedCatalogScenarios.map(scenario => scenario.id)` instead of `params?.scenarioIds`, so the summary records the post-selection executed list. This also covers the full-suite case automatically (the executed list is the full lane-filtered catalog), giving parity consumers a stable record of exactly which scenarios landed in the run regardless of how the caller phrased the request. buildQaSuiteSummaryJson's `length > 0 ? [...] : null` pass-2 semantics are preserved so the public helper still treats an empty array as 'unspecified' for any future caller that legitimately passes one. Validation: pnpm test extensions/qa-lab/src/suite.summary-json.test.ts (6/6 pass). Refs openclaw#64227
Addresses the pass-4 codex-connector P2 on openclaw#64789: the pass-3 fix always passed `selectedCatalogScenarios.map(...)` to writeQaSuiteArtifacts, which made unfiltered full-suite runs indistinguishable from an explicit all-scenarios selection in the summary metadata. The 'unfiltered → null' semantic (documented in the buildQaSuiteSummaryJson JSDoc and exercised by the "treats an empty scenarioIds array as unspecified" regression) was lost. Fix: both writeQaSuiteArtifacts call sites now condition on the caller's original `params.scenarioIds`. When the caller passed an explicit non-empty filter, record the post-selection executed list (pass-3 behavior, preserving Set-dedupe + catalog-order normalization). When the caller passed undefined or an empty array, pass undefined to writeQaSuiteArtifacts so buildQaSuiteSummaryJson's length-check serializes null (pass-2 behavior, preserving unfiltered semantics). This keeps both codex-connector findings satisfied simultaneously: - explicit --scenario filter reorders/dedupes through the executed list, not the raw caller input - unfiltered full-suite run records null, not a full catalog dump that would shadow "explicit all-scenarios" selections Validation: pnpm test extensions/qa-lab/src/suite.summary-json.test.ts (6/6 pass). Refs openclaw#64227
…nd three architecture diagrams
…lag, diagram cycles, PR M Does-not-own)
…nd-to-end against the qa-lab mock
…ted type, ordering assertion, remove false-positive positive-tone detection
…he fetchJson in model-switch
|
Maintainer update: I rebased this rollup onto current What moved into the rescue branch:
What I intentionally left out of the rescue branch:
I’m treating |
84ff65d to
a311b94
Compare
|
@pashpashpash I just cleared the conflicts btw must have hit while you were doing that. let me know how I can help. |
|
Thanks for driving the parity-proof work here. I split the stale/conflicted rollup and landed the proof slice via #65664:
That landed the qa-lab parity-proof pieces on current Closing this rollup as superseded by #65664 so the tracker stays aligned with what actually landed. |
Summary
The test harness and CI proof for the GPT-5.4 parity program. Runs GPT-5.4 and Opus 4.6 through the same 11 scenarios, compares the results, and produces a pass/fail verdict — all with real API keys (or without if you prefer).
Part of #64227. See the umbrella for how this fits with #65219 (runtime activation) and #65257 (behavioral fix).
What's in the rollup
This consolidates wave-2 PRs E, J, K, L, M, and N into one reviewable unit:
subagent-handoff,subagent-fanout-synthesis,memory-recall,thread-memory-isolation,config-restart-capability-flip./debug/requests— prose alone can't satisfy tool-mediated scenarios.memory-recallstays prose-only (justified in a comment — prior-turn recall is legitimate)./v1/messagesmock route — baseline lane runs offline through the same scenario dispatcher as the OpenAI route. Supports streaming viawriteAnthropicSse. Defaults empty-string model toclaude-opus-4-6.stageQaMockAuthProfiles()writes placeholder credentials so the gate runs without real API keys. Also fixes thelegacy.registration.tsbundler bug that blocked scenario execution.runmetadata — eachqa-suite-summary.jsoncarries a self-describingrunblock (primaryProvider,primaryModel,providerMode,scenarioIds).run.primaryProviderlabel verification —buildQaAgenticParityComparisonthrowsQaParityLabelMismatchErrorwhen the summary's provider doesn't match the caller label.resolveProviderVariant— tags mock request snapshots with"openai" | "anthropic" | "unknown"so parity consumers can verify which lane each request came from..github/workflows/parity-gate.yml) — runs the full gate on every PR touching the parity surface. Uploads artifacts. Fails onpass: false.How this PR relates to the others
Review status
0.