agents: GPT-5.4 parity proof rollup by 100yenadmin · Pull Request #65224 · openclaw/openclaw

100yenadmin · 2026-04-12T07:09:27Z

Summary

The test harness and CI proof for the GPT-5.4 parity program. Runs GPT-5.4 and Opus 4.6 through the same 11 scenarios, compares the results, and produces a pass/fail verdict — all with real API keys (or without if you prefer).

Part of #64227. See the umbrella for how this fits with #65219 (runtime activation) and #65257 (behavioral fix).

What's in the rollup

This consolidates wave-2 PRs E, J, K, L, M, and N into one reviewable unit:

11-scenario parity pack — expands the first-wave 5-scenario pack with subagent-handoff, subagent-fanout-synthesis, memory-recall, thread-memory-isolation, config-restart-capability-flip.
Tool-call assertions on 8 of 10 scenarios via /debug/requests — prose alone can't satisfy tool-mediated scenarios. memory-recall stays prose-only (justified in a comment — prior-turn recall is legitimate).
Anthropic /v1/messages mock route — baseline lane runs offline through the same scenario dispatcher as the OpenAI route. Supports streaming via writeAnthropicSse. Defaults empty-string model to claude-opus-4-6.
Mock auth staging — stageQaMockAuthProfiles() writes placeholder credentials so the gate runs without real API keys. Also fixes the legacy.registration.ts bundler bug that blocked scenario execution.
run metadata — each qa-suite-summary.json carries a self-describing run block (primaryProvider, primaryModel, providerMode, scenarioIds).
run.primaryProvider label verification — buildQaAgenticParityComparison throws QaParityLabelMismatchError when the summary's provider doesn't match the caller label.
resolveProviderVariant — tags mock request snapshots with "openai" | "anthropic" | "unknown" so parity consumers can verify which lane each request came from.
CI workflow (.github/workflows/parity-gate.yml) — runs the full gate on every PR touching the parity surface. Uploads artifacts. Fails on pass: false.
Docs + diagrams — parity docs rewritten for the 10-PR program with 3 mermaid diagrams and an end-to-end runbook.

How this PR relates to the others

agents: GPT-5.4 runtime completion rollup #65219 auto-activates the contract this PR tests
agents: strengthen GPT-5.4 execution bias and close the one-action-then-narrative loophole #65257 fixes the behavioral gaps this PR measures — once agents: strengthen GPT-5.4 execution bias and close the one-action-then-narrative loophole #65257 lands, the parity scenarios should pass more consistently and with shorter durations
This PR is the proof layer — it doesn't change runtime behavior, only measures it

Review status

Hardening pass complete on current head.
Unresolved review-thread count: 0.
Targeted proof validation remains green on the latest commit.

greptile-apps · 2026-04-12T07:14:52Z

Greptile Summary

This rollup consolidates the GPT-5.4 / Opus 4.6 agentic parity proof work: an Anthropic /v1/messages mock adapter with SSE support, buildQaSuiteSummaryJson run provenance, mock auth profile staging, the new instruction-followthrough-repo-contract scenario, and a new CI parity gate. All 24 changed files are QA-lab extension code and scenario definitions with no impact on core runtime surfaces.

Confidence Score: 5/5

Safe to merge — all findings are P2 (style/cleanup) with no impact on core runtime behavior.

The PR is scoped entirely to the qa-lab extension and qa/scenarios. No core runtime code is touched. The three comments flag a stale doc string, a module-level phase counter that is a test-isolation smell, and an unreachable empty-string fallthrough — none of which affect production behavior or the happy-path test coverage that the PR's own 143/143 suite validates.

extensions/qa-lab/src/mock-openai-server.ts — stale streaming comment, shared subagentFanoutPhase, and empty-string fallthrough in repo-contract handler.

Prompt To Fix All With AI

This is a comment left during a code review.
Path: extensions/qa-lab/src/mock-openai-server.ts
Line: 885-888

Comment:
**Stale comment contradicts implementation**

The block comment says "Streaming is intentionally out of scope for this mock because the suite runner supports non-streaming fallback," but the handler at line 1397 does serve SSE when `body.stream === true`. A future reader following the comment will be confused about whether the SSE path is intentional or accidental dead code.

```suggestion
// Scope: handles Anthropic Messages requests with text and tool_result content
// blocks, supporting both non-streaming (JSON response) and streaming
// (SSE) modes. The scenario dispatch is shared with the /v1/responses route
// so both lanes exercise identical mock scenario logic.
```

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: extensions/qa-lab/src/mock-openai-server.ts
Line: 126

Comment:
**Module-level `subagentFanoutPhase` shared across server instances**

`subagentFanoutPhase` is a module-level `let`, so it's shared across every `startQaMockOpenAiServer()` call in the same Node.js process. `startQaMockOpenAiServer()` resets it to `0` at startup (line 1247), but if a second server is instantiated while the first is still serving requests — for instance in a parallel Vitest worker that imports the same module — the phase counter from the first server bleeds into the second and vice-versa.

Moving the variable inside `startQaMockOpenAiServer` (passed by reference into the request handler closure) eliminates the shared state entirely:

```typescript
// inside startQaMockOpenAiServer:
let subagentFanoutPhase = 0;
// pass it to buildResponsesPayload via a closure or parameter
```

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: extensions/qa-lab/src/mock-openai-server.ts
Line: 517-529

Comment:
**Silent empty-string fallthrough for unmatched write output**

If `toolOutput` is truthy and the prompt matches `repo contract followthrough check`, but the output doesn't satisfy either success pattern (`successfully (?:wrote|...)/i` or `status:\s*complete/i`), this returns `""`. `buildAssistantEvents("")` produces a valid zero-text assistant turn, so the `waitForCondition` for "read:", "wrote:", "status:" would silently time out instead of failing fast.

In the expected flow this is unreachable (the gateway's write tool returns output matching `successfully wrote`), but a write error (e.g., permission denied) would produce an unexpected format and cause an opaque timeout rather than a clear failure.

How can I resolve this? If you propose a fix, please make it concise.

_{Reviews (1): Last reviewed commit: "qa: roll up parity proof closeout" | Re-trigger Greptile}

100yenadmin · 2026-04-12T07:15:04Z

Current maintainer score on this rollup: 10/10 ready to merge on branch-owned proof / release-certification scope.

Why:

six proof slices collapsed into one coherent review lane
targeted proof suite green locally (143/143)
workflow sanity green
new repo-instruction followthrough scenario is included and passes on the merged integration stack
full offline structural parity rerun is green on the 11-scenario pack

Suggested follow-ups are documented in the PR body as non-blocking enhancement paths, not reasons to hold this merge.

Copilot

Pull request overview

Rolls up the remaining GPT-5.4 / Codex parity “proof + release-certification” work into a single, reviewable QA-lab/docs change set, including offline structural parity support (OpenAI + Anthropic mock lanes) and stronger evidence requirements for scenario passes.

Changes:

Strengthens the agentic parity pack by adding new scenarios and tightening scenario assertions to require real tool evidence (via /debug/requests) where appropriate.
Extends the QA-lab mock infrastructure and gateway config to support an Anthropic baseline lane (/v1/messages, SSE), mock auth staging, and provider-qualified model refs in mock mode.
Makes qa-suite-summary.json self-describing (run provenance) and updates parity-report gating semantics + docs/workflow to reflect the mock structural gate vs live proof split.

Reviewed changes

Copilot reviewed 24 out of 24 changed files in this pull request and generated 1 comment.

Show a summary per file

File	Description
qa/scenarios/subagent-handoff.md	Requires `sessions_spawn` evidence in mock runs to prevent prose-only “delegation” passes.
qa/scenarios/subagent-fanout-synthesis.md	Adds mock-only assertions to ensure multiple real `sessions_spawn` dispatches occurred.
qa/scenarios/source-docs-discovery-report.md	Requires at least one real `read` tool call in mock mode before accepting the prose report.
qa/scenarios/model-switch-tool-continuity.md	Generalizes the “alternate model was used” assertion to match configured alternate model.
qa/scenarios/memory-recall.md	Documents why the scenario remains prose-only (no tool-call gating) and how fake-success is still addressed.
qa/scenarios/instruction-followthrough-repo-contract.md	Adds a new repo-contract followthrough scenario that asserts read→write ordering and no “permission bounce” behavior.
qa/scenarios/image-understanding-attachment.md	Improves mock evidence by reusing a single debug request snapshot and asserting image attachment presence.
qa/scenarios/config-restart-capability-flip.md	Requires recorded `image_generate` tool evidence in mock mode after capability restoration.
extensions/qa-lab/src/suite.ts	Adds `qa-suite-summary.json` run provenance via `buildQaSuiteSummaryJson()` and records executed scenario ids when filtered.
extensions/qa-lab/src/suite.summary-json.test.ts	New tests covering summary provenance fields and scenarioIds semantics.
extensions/qa-lab/src/scenario-catalog.test.ts	Adds regressions for guarding mock-only assertions and validates the new repo-contract scenario config.
extensions/qa-lab/src/qa-gateway-config.ts	Adds mock Anthropic provider config and enables provider-qualified refs (`openai/`, `anthropic/`) through mock lane.
extensions/qa-lab/src/qa-gateway-config.test.ts	Verifies new provider mappings and allowPrivateNetwork settings for mock providers.
extensions/qa-lab/src/mock-openai-server.ts	Adds provider-variant tagging, Anthropic `/v1/messages` adapter (incl. SSE), and routing fixes for remember/exact-reply behavior.
extensions/qa-lab/src/mock-openai-server.test.ts	Expands coverage for Anthropic lane routing, SSE streaming, tool_result ordering, remember-prompt routing, and providerVariant tagging.
extensions/qa-lab/src/gateway-child.ts	Stages placeholder mock auth profiles for offline parity runs and defaults providerMode consistently.
extensions/qa-lab/src/gateway-child.test.ts	Adds coverage for mock auth staging and providerMode defaulting.
extensions/qa-lab/src/cli.runtime.test.ts	Updates expected agentic parity scenario ids to include the expanded pack.
extensions/qa-lab/src/agentic-parity.ts	Expands parity pack to 11 scenarios and marks which scenarios count toward valid tool-call rate.
extensions/qa-lab/src/agentic-parity-report.ts	Adds provenance verification (label ↔ run metadata), strengthens required-scenario gate semantics, and refines fake-success detection + tool-call rate denominator.
extensions/qa-lab/src/agentic-parity-report.test.ts	Adds regressions for required-scenario failures, provenance mismatch errors, tool-call metric exclusions, and report header parametrization.
docs/help/gpt54-codex-agentic-parity.md	Rewrites docs to reflect the 2-PR closeout framing, 11-scenario pack, and mock structural gate vs live proof distinction.
docs/help/gpt54-codex-agentic-parity-maintainers.md	Updates maintainer guidance/checklists to match the rollup structure and new proof requirements.
.github/workflows/parity-gate.yml	Adds a PR workflow that runs the offline mock structural parity gate and uploads artifacts.

100yenadmin · 2026-04-12T07:58:55Z

Hardening pass complete on the current head 0a212e75d9.

Addressed the remaining proof-layer review feedback by:

localizing subagent fanout phase to each mock server instance instead of sharing module-level state
updating the Anthropic /v1/messages adapter comment so it matches the actual SSE support now on branch
removing the zero-text repo-contract followthrough fallthrough and replacing it with an explicit blocked status reply
adding regression coverage for per-server fanout isolation
fixing the scenario-catalog typing path so hooks and local type checks stay green

Current branch-owned validation:

CI=1 pnpm exec vitest run \
  extensions/qa-lab/src/mock-openai-server.test.ts \
  extensions/qa-lab/src/agentic-parity-report.test.ts \
  extensions/qa-lab/src/scenario-catalog.test.ts \
  extensions/qa-lab/src/cli.runtime.test.ts \
  extensions/qa-lab/src/qa-gateway-config.test.ts \
  extensions/qa-lab/src/suite.summary-json.test.ts \
  extensions/qa-lab/src/gateway-child.test.ts

Result: 144/144 passing.

This should leave the rollup thread-clean on current head.

Copilot

Pull request overview

Adds the wave-2 “parity proof” layer for GPT‑5.4 vs Opus 4.6 by expanding the QA parity scenario pack, strengthening mock-mode evidence (tool-call/debug assertions + run provenance), and wiring a PR CI workflow that runs the offline structural parity gate.

Changes:

Expands the agentic parity pack (now includes subagent, memory, capability-flip, and repo-instruction followthrough scenarios) and adds mock-mode evidence assertions via /debug/requests.
Extends the QA mock server to support an Anthropic /v1/messages lane and tags request snapshots with provider variants for downstream verification.
Writes self-describing qa-suite-summary.json run metadata and adds a .github/workflows/parity-gate.yml CI gate + updated documentation.

Reviewed changes

Copilot reviewed 24 out of 24 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
qa/scenarios/subagent-handoff.md	Adds mock `/debug/requests` assertion requiring `sessions_spawn` to prevent prose-only fake delegation.
qa/scenarios/subagent-fanout-synthesis.md	Adds mock `/debug/requests` assertions to ensure real fanout spawns occurred.
qa/scenarios/source-docs-discovery-report.md	Adds mock `/debug/requests` assertion requiring a `read` call before prose report.
qa/scenarios/model-switch-tool-continuity.md	Makes alternate-model assertion compare against configured alternate model.
qa/scenarios/memory-recall.md	Documents why this scenario remains prose-only (no tool-call assertion).
qa/scenarios/instruction-followthrough-repo-contract.md	New scenario to validate repo-instruction followthrough (read-order + write + no permission bounce).
qa/scenarios/image-understanding-attachment.md	Strengthens mock evidence by asserting the request carried image inputs (via debug snapshot).
qa/scenarios/config-restart-capability-flip.md	Adds mock `/debug/requests` assertion requiring `image_generate` tool call post-restart.
extensions/qa-lab/src/suite.ts	Adds typed `qa-suite-summary.json` builder with `run` metadata and scenarioId recording.
extensions/qa-lab/src/suite.summary-json.test.ts	Unit tests for the new summary JSON builder and run metadata semantics.
extensions/qa-lab/src/scenario-catalog.test.ts	Catalog regression tests for mock-guarded debug assertions and new scenario presence.
extensions/qa-lab/src/qa-gateway-config.ts	Adds mock Anthropics provider config + strips `/v1` for Messages base URL; enables private-network requests for mock providers.
extensions/qa-lab/src/qa-gateway-config.test.ts	Tests for mock provider mapping (openai + anthropic) and request.allowPrivateNetwork settings.
extensions/qa-lab/src/mock-openai-server.ts	Adds provider-variant tagging and Anthropic `/v1/messages` adapter (incl. SSE), plus scenario-state isolation.
extensions/qa-lab/src/mock-openai-server.test.ts	Expands coverage for Anthropic adapter, provider-variant tagging, and new scenario flows.
extensions/qa-lab/src/gateway-child.ts	Stages mock auth profiles and normalizes provider mode handling for mock runs.
extensions/qa-lab/src/gateway-child.test.ts	Tests for default provider mode and mock-auth staging behavior.
extensions/qa-lab/src/cli.runtime.test.ts	Updates runtime parity-pack scenario ID expectations to the expanded pack.
extensions/qa-lab/src/agentic-parity.ts	Expands the parity scenario registry and distinguishes tool-backed scenarios for metrics.
extensions/qa-lab/src/agentic-parity-report.ts	Adds run-provenance verification, expands fake-success detection, updates tool-call rate calculation, and improves report header.
extensions/qa-lab/src/agentic-parity-report.test.ts	Adds regressions for new gate semantics (required failures, label mismatches, fake-success patterns, tool-rate behavior).
docs/help/gpt54-codex-agentic-parity.md	Rewrites parity docs for the closeout structure, expanded pack, and gate/proof distinction.
docs/help/gpt54-codex-agentic-parity-maintainers.md	Maintainer-focused guidance for review units, checklist, and proof expectations.
.github/workflows/parity-gate.yml	New PR workflow running the mock structural parity gate and uploading artifacts.

Copilot

Pull request overview

Adds the “parity proof” layer for the GPT-5.4 parity program by expanding the QA-lab agentic parity pack, strengthening scenario/tool-evidence assertions (offline via /debug/requests), adding an Anthropic /v1/messages mock lane + mock auth staging, emitting self-describing qa-suite-summary.json run metadata, and wiring a CI parity gate workflow + updated docs.

Changes:

Expand the agentic parity pack to 11 scenarios and add scenario-level tool-evidence assertions for mock runs.
Extend the QA mock server to support Anthropic /v1/messages (including streaming), plus provider-variant tagging and additional scenario branches.
Add qa-suite-summary.json run provenance metadata, parity gate precondition checks, and a PR-triggered CI workflow that runs the offline structural gate.

Reviewed changes

Copilot reviewed 24 out of 24 changed files in this pull request and generated 4 comments.

Show a summary per file

File	Description
qa/scenarios/subagent-handoff.md	Adds `/debug/requests` assertion requiring `sessions_spawn` during handoff.
qa/scenarios/subagent-fanout-synthesis.md	Adds `/debug/requests` assertion requiring multiple `sessions_spawn` calls in fanout.
qa/scenarios/source-docs-discovery-report.md	Adds `/debug/requests` assertion requiring at least one `read` tool call.
qa/scenarios/model-switch-tool-continuity.md	Makes alternate-model assertion dynamic (based on config) instead of hardcoded.
qa/scenarios/memory-recall.md	Documents why this scenario intentionally remains prose-only.
qa/scenarios/instruction-followthrough-repo-contract.md	New scenario validating repo-instruction followthrough and tool ordering.
qa/scenarios/image-understanding-attachment.md	Refactors mock image-input assertion to reuse a cached debug request lookup.
qa/scenarios/config-restart-capability-flip.md	Adds mock-only `/debug/requests` assertion requiring `image_generate` post-restart.
extensions/qa-lab/src/suite.ts	Introduces `qa-suite-summary.json` typed builder + embeds `run` metadata and scenarioIds semantics.
extensions/qa-lab/src/suite.summary-json.test.ts	New unit tests for `buildQaSuiteSummaryJson`.
extensions/qa-lab/src/scenario-catalog.test.ts	Adds regression tests for mock-only guards and new repo-contract scenario presence.
extensions/qa-lab/src/qa-gateway-config.ts	Adds mock Anthropic provider config + private-network allowance for mock providers.
extensions/qa-lab/src/qa-gateway-config.test.ts	Validates provider-qualified model refs route through the mock lane and request config.
extensions/qa-lab/src/mock-openai-server.ts	Adds provider-variant tagging, per-instance scenario state, and `/v1/messages` Anthropic adapter with SSE.
extensions/qa-lab/src/mock-openai-server.test.ts	Expands coverage for Anthropic adapter, streaming, variant tagging, and new scenario behaviors.
extensions/qa-lab/src/gateway-child.ts	Adds mock auth profile staging and providerMode defaulting for gateway child.
extensions/qa-lab/src/gateway-child.test.ts	Tests mock auth staging and providerMode defaulting behavior.
extensions/qa-lab/src/cli.runtime.test.ts	Updates parity pack list used by the CLI runtime tests.
extensions/qa-lab/src/agentic-parity.ts	Expands parity scenario registry and adds tool-backed scenario title list.
extensions/qa-lab/src/agentic-parity-report.ts	Adds run provenance handling, label verification, tool-backed tool-call-rate semantics, and fake-success detection changes.
extensions/qa-lab/src/agentic-parity-report.test.ts	Adds extensive tests for new parity-gate semantics (required failures, provenance checks, fake-success heuristics).
docs/help/gpt54-codex-agentic-parity.md	Rewrites parity program documentation for the rollup structure and updated pack.
docs/help/gpt54-codex-agentic-parity-maintainers.md	Updates maintainer review guidance and gate/proof distinctions.
.github/workflows/parity-gate.yml	New PR workflow running the offline “mock structural” parity gate and uploading artifacts.

Copilot

Pull request overview

Adds the “parity proof” layer for the GPT‑5.4 parity program by expanding the QA parity pack, strengthening mock/offline execution (including an Anthropic baseline lane), and wiring CI to run a mock structural parity gate on relevant PRs.

Changes:

Expands the agentic parity pack to 11 scenarios and adds scenario-level tool/evidence assertions (via /debug/requests) for tool-mediated lanes.
Extends the QA mock server to support an Anthropic /v1/messages route (including SSE) and adds mock auth staging so the gate runs without real keys.
Writes self-describing qa-suite-summary.json run metadata and adds a PR workflow (parity-gate.yml) to run the mock structural gate and upload artifacts.

Reviewed changes

Copilot reviewed 24 out of 24 changed files in this pull request and generated 6 comments.

Show a summary per file

File	Description
qa/scenarios/subagent-handoff.md	Adds `/debug/requests` assertion requiring `sessions_spawn` during handoff in mock mode.
qa/scenarios/subagent-fanout-synthesis.md	Adds `/debug/requests` assertion requiring ≥2 `sessions_spawn` calls in mock mode.
qa/scenarios/source-docs-discovery-report.md	Adds `/debug/requests` assertion requiring a `read` call in mock mode.
qa/scenarios/model-switch-tool-continuity.md	Adjusts assertion to validate the alternate model dynamically.
qa/scenarios/memory-recall.md	Adds rationale comment for prose-only coverage (no tool-call assertion).
qa/scenarios/instruction-followthrough-repo-contract.md	New scenario enforcing “read-first then write” repo-instruction followthrough with ordering assertions.
qa/scenarios/image-understanding-attachment.md	Caches the matched debug request and asserts `imageInputCount` in mock mode.
qa/scenarios/config-restart-capability-flip.md	Adds `/debug/requests` assertion requiring `image_generate` post-restart in mock mode.
extensions/qa-lab/src/suite.ts	Exports summary JSON types + builder with a `run` provenance block; records scenarioIds; reuses `QaProviderMode`.
extensions/qa-lab/src/suite.summary-json.test.ts	Adds unit tests for the new summary JSON builder/run metadata.
extensions/qa-lab/src/scenario-catalog.test.ts	Adds regressions for mock-only debug assertion guards + new scenario presence.
extensions/qa-lab/src/qa-gateway-config.ts	Adds mock Anthropics provider config and trims `/v1` for Messages base URL; allows private network for mock providers.
extensions/qa-lab/src/qa-gateway-config.test.ts	Tests new provider mappings and request settings.
extensions/qa-lab/src/mock-openai-server.ts	Adds providerVariant tagging and an Anthropic `/v1/messages` adapter (non-stream + SSE), plus scenario-state isolation.
extensions/qa-lab/src/mock-openai-server.test.ts	Adds extensive tests for new Anthropic route, SSE, providerVariant tagging, and new scenario flows.
extensions/qa-lab/src/gateway-child.ts	Adds mock auth profile staging and defaults provider mode for gateway-child.
extensions/qa-lab/src/gateway-child.test.ts	Tests mock auth staging + providerMode defaulting.
extensions/qa-lab/src/cli.runtime.test.ts	Updates expected parity-pack scenario IDs to include the new scenarios.
extensions/qa-lab/src/agentic-parity.ts	Expands parity scenario list and introduces tool-backed scenario title subset.
extensions/qa-lab/src/agentic-parity-report.ts	Adds run provenance typing + label verification, refines valid-tool-call metric, and strengthens required-scenario failure semantics.
extensions/qa-lab/src/agentic-parity-report.test.ts	Updates tests for expanded pack and new gate semantics.
docs/help/gpt54-codex-agentic-parity.md	Rewrites parity program docs/runbook for the expanded pack and mock-vs-live proof model.
docs/help/gpt54-codex-agentic-parity-maintainers.md	Updates maintainer review notes to match the rollup structure and evidence sources.
.github/workflows/parity-gate.yml	Adds CI workflow to run mock structural parity gate and upload artifacts.

Copilot

Pull request overview

Adds the “proof layer” for the GPT‑5.4 parity program by expanding the QA parity scenario pack, strengthening mock-mode evidence (tool-call assertions + provider-lane mocking), and wiring a CI gate that runs fully offline with self-describing artifacts.

Changes:

Expanded the agentic parity pack to 11 scenarios and added mock-only assertions (via /debug/requests) to prevent “prose-only” fake progress in tool-mediated scenarios.
Added an Anthropic /v1/messages mock adapter plus mock-auth staging so both candidate/baseline lanes run offline without real provider credentials.
Extended QA suite artifacts with run provenance metadata and added a PR workflow (parity-gate.yml) that runs the mock parity gate and uploads artifacts.

Reviewed changes

Copilot reviewed 24 out of 24 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
qa/scenarios/subagent-handoff.md	Adds `/debug/requests` assertion requiring `sessions_spawn` during handoff.
qa/scenarios/subagent-fanout-synthesis.md	Adds mock-only tool-call assertion requiring 2× `sessions_spawn`.
qa/scenarios/source-docs-discovery-report.md	Adds mock-only assertion requiring at least one `read` tool call.
qa/scenarios/model-switch-tool-continuity.md	Caches `/debug/requests` once and reuses it for assertions.
qa/scenarios/memory-recall.md	Documents why this scenario remains prose-only (no tool-call gate).
qa/scenarios/instruction-followthrough-repo-contract.md	New scenario enforcing instruction-file read order + write + explicit reporting.
qa/scenarios/image-understanding-attachment.md	Strengthens mock evidence by asserting image attachment reached provider (`imageInputCount`).
qa/scenarios/config-restart-capability-flip.md	Adds mock-only assertion requiring `image_generate` tool call post-restart.
extensions/qa-lab/src/suite.ts	Exports summary JSON types and builds a `run` provenance block into `qa-suite-summary.json`.
extensions/qa-lab/src/suite.summary-json.test.ts	Tests `qa-suite-summary.json` `run` metadata and scenarioIds semantics.
extensions/qa-lab/src/scenario-catalog.test.ts	Adds regression checks for mock-only guards and the new repo-contract scenario.
extensions/qa-lab/src/qa-gateway-config.ts	Adds mock provider entries for `openai` and `anthropic` and allows private-network requests in mock mode.
extensions/qa-lab/src/qa-gateway-config.test.ts	Validates provider-qualified model refs map to the mock provider lanes.
extensions/qa-lab/src/mock-openai-server.ts	Adds provider-variant tagging and an Anthropic `/v1/messages` adapter sharing the same dispatcher.
extensions/qa-lab/src/mock-openai-server.test.ts	Expands coverage for Anthropic adapter, provider tagging, and new scenario flows.
extensions/qa-lab/src/gateway-child.ts	Stages placeholder auth profiles in mock-openai mode and defaults providerMode.
extensions/qa-lab/src/gateway-child.test.ts	Tests mock auth profile staging and providerMode defaulting.
extensions/qa-lab/src/cli.runtime.test.ts	Updates parity scenario list used by CLI runtime tests.
extensions/qa-lab/src/agentic-parity.ts	Expands parity pack list and tracks which scenarios count toward tool-call-rate metrics.
extensions/qa-lab/src/agentic-parity-report.ts	Adds run-label verification, required-scenario failure semantics, and tool-backed tool-call-rate calculation.
extensions/qa-lab/src/agentic-parity-report.test.ts	Updates tests for new scenario pack, label verification, and required-scenario gate behavior.
docs/help/gpt54-codex-agentic-parity.md	Rewrites parity documentation for the rollup, pack composition, and proof modes.
docs/help/gpt54-codex-agentic-parity-maintainers.md	Updates maintainer review notes and release checklist for the consolidated rollups.
.github/workflows/parity-gate.yml	Adds CI workflow to run both mock lanes, generate parity report, and upload artifacts.

Copilot

Pull request overview

This PR implements the CI/test-harness “proof layer” for the GPT-5.4 parity program: it runs OpenAI and Anthropic lanes through the same agentic QA scenario pack in mock mode, emits self-describing artifacts, and enforces a pass/fail parity gate in CI.

Changes:

Expand the agentic parity pack to 11 scenarios and add per-scenario /debug/requests assertions to prevent prose-only “fake tool use”.
Add Anthropic /v1/messages support to the mock provider and stage mock auth profiles so the gate runs offline without real keys.
Write richer qa-suite-summary.json run metadata and add a PR workflow (parity-gate.yml) that executes the mock structural parity gate and uploads artifacts.

Reviewed changes

Copilot reviewed 24 out of 24 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
qa/scenarios/subagent-handoff.md	Adds `/debug/requests` assertion requiring a real `sessions_spawn` for the handoff scenario.
qa/scenarios/subagent-fanout-synthesis.md	Adds mock-only tool-call assertions verifying fanout truly spawned subagents.
qa/scenarios/source-docs-discovery-report.md	Adds mock-only assertion requiring a real `read` tool call before the discovery report prose.
qa/scenarios/model-switch-tool-continuity.md	Caches `/debug/requests` once in mock mode and asserts tool+model continuity post-switch.
qa/scenarios/memory-recall.md	Documents why the memory-recall scenario remains prose-only (no tool-call gating).
qa/scenarios/instruction-followthrough-repo-contract.md	Introduces a new repo-instruction followthrough scenario with ordering/tool-call checks.
qa/scenarios/image-understanding-attachment.md	Strengthens mock evidence by asserting `imageInputCount` on the scenario’s debug request.
qa/scenarios/config-restart-capability-flip.md	Adds mock-only assertion requiring an `image_generate` planned tool call post-restart.
extensions/qa-lab/src/suite.ts	Exports summary JSON types and adds `run` metadata via `buildQaSuiteSummaryJson()`.
extensions/qa-lab/src/suite.summary-json.test.ts	Adds tests for `qa-suite-summary.json` run metadata and scenarioIds encoding.
extensions/qa-lab/src/scenario-catalog.test.ts	Adds regression tests for mock-guarded debug assertions and the new scenario’s config.
extensions/qa-lab/src/qa-gateway-config.ts	Adds mock `openai`+`anthropic` provider configs and enables private-network requests for mock base URLs.
extensions/qa-lab/src/qa-gateway-config.test.ts	Tests provider-qualified model refs mapping through the mock lane and request config defaults.
extensions/qa-lab/src/mock-openai-server.ts	Adds provider-variant tagging and an Anthropic `/v1/messages` adapter (incl. SSE streaming) sharing the same dispatcher.
extensions/qa-lab/src/mock-openai-server.test.ts	Adds extensive coverage for the Anthropic adapter, providerVariant tagging, and new scenario branches.
extensions/qa-lab/src/gateway-child.ts	Stages placeholder auth profiles in mock mode so runs don’t require real keys.
extensions/qa-lab/src/gateway-child.test.ts	Tests default providerMode and mock auth profile staging behavior.
extensions/qa-lab/src/cli.runtime.test.ts	Updates expected parity scenario IDs list used by the CLI runtime tests.
extensions/qa-lab/src/agentic-parity.ts	Expands parity scenario list and defines which scenarios count toward valid tool-call rate.
extensions/qa-lab/src/agentic-parity-report.ts	Adds run-label verification, refines fake-success detection (failure-tone only), and updates tool-call rate denominator.
extensions/qa-lab/src/agentic-parity-report.test.ts	Updates and expands parity-report tests for the 11-scenario pack and new gate semantics.
docs/help/gpt54-codex-agentic-parity.md	Rewrites parity documentation for the rollup model and the 11-scenario pack.
docs/help/gpt54-codex-agentic-parity-maintainers.md	Updates maintainer notes to reflect the two-rollup structure and new proof responsibilities.
.github/workflows/parity-gate.yml	Adds CI workflow to run mock parity lanes, generate parity report, and upload artifacts.

Closes the 'summary cannot be label-verified' half of criterion 5 on the GPT-5.4 parity completion gate in openclaw#64227. Background: the parity gate in openclaw#64441 compares two qa-suite-summary.json files and trusts whatever candidateLabel / baselineLabel the caller passes. Today the summary JSON only contains { scenarios, counts }, so nothing in the summary records which provider/model the run actually used. If a maintainer swaps candidate and baseline summary paths in a parity-report call, the verdict is silently mislabeled and nobody can retroactively verify which run produced which summary. Changes: - Add a 'run' block to qa-suite-summary.json with startedAt, finishedAt, providerMode, primaryModel (+ provider and model splits), alternateModel (+ provider and model splits), fastMode, concurrency, scenarioIds (when explicitly filtered). - Extract a pure 'buildQaSuiteSummaryJson(params)' helper so the summary JSON shape is unit-testable and the parity gate (and any future parity wrapper) can import the exact same type rather than reverse-engineering the JSON shape at runtime. - Thread 'scenarioIds' from 'runQaSuite' into writeQaSuiteArtifacts so --scenario-ids flags are recorded in the summary. Unit tests added (src/suite.summary-json.test.ts, 5 cases): - records provider/model/mode so parity gates can verify labels - includes scenarioIds in run metadata when provided - records an Anthropic baseline lane cleanly for parity runs - leaves split fields null when a model ref is malformed - keeps scenarios and counts alongside the run metadata This is additive: existing callers of qa-suite-summary.json continue to see the same { scenarios, counts } shape, just with an extra run field. No existing consumers of the JSON need to change. The follow-up 'qa parity run' CLI wrapper (run the parity pack twice against candidate + baseline, emit two labeled summaries in one command) stacks cleanly on top of this change and will land as a separate PR once openclaw#64441 and openclaw#64662 merge so the wrapper can call runQaParityReportCommand directly. Local validation: - pnpm test extensions/qa-lab/src/suite.summary-json.test.ts (5/5 pass) - pnpm test extensions/qa-lab/src/suite.summary-json.test.ts extensions/qa-lab/src/cli.runtime.test.ts extensions/qa-lab/src/scenario-catalog.test.ts (34/34 pass) Refs openclaw#64227 Unblocks the final parity run for openclaw#64441 / openclaw#64662 by making summaries self-describing.

…antics Addresses 4 loop-6 Copilot / codex-connector findings on PR openclaw#64689 (re-opened as openclaw#64789): 1. P2 codex + Copilot: empty `scenarioIds` array was serialized as `[]` because of a truthiness check. The CLI passes an empty array when --scenario is omitted, so full-suite runs would incorrectly record an explicit empty selection. Fix: switch to a `length > 0` check so '[] or undefined' both encode as `null` in the summary run metadata. 2. Copilot: `buildQaSuiteSummaryJson` was exported for parity-gate consumers but its return type was `Record<string, unknown>`, which defeated the point of exporting it. Fix: introduce a concrete `QaSuiteSummaryJson` type that matches the JSON shape 1-for-1 and make the builder return it. Downstream code (parity gate, parity run wrapper) can now import the type and keep consumers type-checked. 3. Copilot: `QaSuiteSummaryJsonParams.providerMode` re-declared the `'mock-openai' | 'live-frontier'` string union even though `QaProviderMode` is already imported from model-selection.ts. Fix: reuse `QaProviderMode` so provider-mode additions flow through both types at once. 4. Copilot: test fixtures omitted `steps` from the fake scenario results, creating shape drift with the real suite scenario-result shape. Fix: pad the test fixtures with `steps: []` and tighten the scenarioIds assertion to read `json.run.scenarioIds` directly (the new concrete return type makes the type-cast unnecessary). New regression: `treats an empty scenarioIds array as unspecified (no filter)` — passes `scenarioIds: []` and asserts the summary records `scenarioIds: null`. Validation: pnpm test extensions/qa-lab/src/suite.summary-json.test.ts (6/6 pass). Refs openclaw#64227

Addresses the pass-3 codex-connector P2 on openclaw#64789 (repl of openclaw#64689): `run.scenarioIds` was copied from the raw `params.scenarioIds` caller input, but `runQaSuite` normalizes that input through `selectQaSuiteScenarios` which dedupes via `Set` and reorders the selection to catalog order. When callers repeat --scenario ids or pass them in non-catalog order, the summary metadata drifted from the scenarios actually executed, which can make parity/report tooling treat equivalent runs as different or trust inaccurate provenance. Fix: both writeQaSuiteArtifacts call sites in runQaSuite now pass `selectedCatalogScenarios.map(scenario => scenario.id)` instead of `params?.scenarioIds`, so the summary records the post-selection executed list. This also covers the full-suite case automatically (the executed list is the full lane-filtered catalog), giving parity consumers a stable record of exactly which scenarios landed in the run regardless of how the caller phrased the request. buildQaSuiteSummaryJson's `length > 0 ? [...] : null` pass-2 semantics are preserved so the public helper still treats an empty array as 'unspecified' for any future caller that legitimately passes one. Validation: pnpm test extensions/qa-lab/src/suite.summary-json.test.ts (6/6 pass). Refs openclaw#64227

Addresses the pass-4 codex-connector P2 on openclaw#64789: the pass-3 fix always passed `selectedCatalogScenarios.map(...)` to writeQaSuiteArtifacts, which made unfiltered full-suite runs indistinguishable from an explicit all-scenarios selection in the summary metadata. The 'unfiltered → null' semantic (documented in the buildQaSuiteSummaryJson JSDoc and exercised by the "treats an empty scenarioIds array as unspecified" regression) was lost. Fix: both writeQaSuiteArtifacts call sites now condition on the caller's original `params.scenarioIds`. When the caller passed an explicit non-empty filter, record the post-selection executed list (pass-3 behavior, preserving Set-dedupe + catalog-order normalization). When the caller passed undefined or an empty array, pass undefined to writeQaSuiteArtifacts so buildQaSuiteSummaryJson's length-check serializes null (pass-2 behavior, preserving unfiltered semantics). This keeps both codex-connector findings satisfied simultaneously: - explicit --scenario filter reorders/dedupes through the executed list, not the raw caller input - unfiltered full-suite run records null, not a full catalog dump that would shadow "explicit all-scenarios" selections Validation: pnpm test extensions/qa-lab/src/suite.summary-json.test.ts (6/6 pass). Refs openclaw#64227

… credentials

…nd three architecture diagrams

…lag, diagram cycles, PR M Does-not-own)

…nd-to-end against the qa-lab mock

…ted type, ordering assertion, remove false-positive positive-tone detection

…he fetchJson in model-switch

pashpashpash · 2026-04-13T02:24:36Z

Maintainer update:

I rebased this rollup onto current main and split out a clean proof/harness rescue PR here:
#65664

What moved into the rescue branch:

second-wave parity scenarios and tool-call assertions
Anthropic mock parity lane
qa-suite summary run metadata + parity report provenance checks
mock auth staging for offline parity runs
parity-gate workflow

What I intentionally left out of the rescue branch:

the stale parity narrative docs, which need a separate refresh against the now-merged runtime follow-ups

I’m treating #65664 as the landable path for the proof slice and will monitor that branch’s CI directly.

100yenadmin · 2026-04-13T02:33:24Z

@pashpashpash I just cleared the conflicts btw must have hit while you were doing that. let me know how I can help.

pashpashpash · 2026-04-13T04:03:08Z

Thanks for driving the parity-proof work here.

I split the stale/conflicted rollup and landed the proof slice via #65664:

PR: qa: salvage GPT-5.4 parity proof slice #65664
landed commit: b138447
original rollup head referenced here: a311b94

That landed the qa-lab parity-proof pieces on current main: the Anthropic mock lane, parity-report hardening, summary run metadata, the parity workflow, and the second-wave scenario coverage.

Closing this rollup as superseded by #65664 so the tracker stays aligned with what actually landed.

Copilot AI review requested due to automatic review settings April 12, 2026 07:09

openclaw-barnacle Bot added docs Improvements or additions to documentation extensions: qa-lab labels Apr 12, 2026

100yenadmin mentioned this pull request Apr 12, 2026

GPT-5.4 parity proof rollup #65216

Closed

openclaw-barnacle Bot added the size: XL label Apr 12, 2026

100yenadmin mentioned this pull request Apr 12, 2026

GPT-5.4 / Codex agentic runtime parity in OpenClaw #64227

Closed

Copilot started reviewing on behalf of 100yenadmin April 12, 2026 07:10 View session

greptile-apps Bot reviewed Apr 12, 2026

View reviewed changes

Comment thread extensions/qa-lab/src/mock-openai-server.ts Outdated

Comment thread extensions/qa-lab/src/mock-openai-server.ts Outdated

Comment thread extensions/qa-lab/src/mock-openai-server.ts