Skip to content

agents: GPT-5.4 parity proof rollup #65224

Closed
100yenadmin wants to merge 48 commits intoopenclaw:mainfrom
electricsheephq:rollup/gpt54-parity-proof
Closed

agents: GPT-5.4 parity proof rollup #65224
100yenadmin wants to merge 48 commits intoopenclaw:mainfrom
electricsheephq:rollup/gpt54-parity-proof

Conversation

@100yenadmin
Copy link
Copy Markdown
Contributor

@100yenadmin 100yenadmin commented Apr 12, 2026

Summary

The test harness and CI proof for the GPT-5.4 parity program. Runs GPT-5.4 and Opus 4.6 through the same 11 scenarios, compares the results, and produces a pass/fail verdict — all with real API keys (or without if you prefer).

Part of #64227. See the umbrella for how this fits with #65219 (runtime activation) and #65257 (behavioral fix).

What's in the rollup

This consolidates wave-2 PRs E, J, K, L, M, and N into one reviewable unit:

  • 11-scenario parity pack — expands the first-wave 5-scenario pack with subagent-handoff, subagent-fanout-synthesis, memory-recall, thread-memory-isolation, config-restart-capability-flip.
  • Tool-call assertions on 8 of 10 scenarios via /debug/requests — prose alone can't satisfy tool-mediated scenarios. memory-recall stays prose-only (justified in a comment — prior-turn recall is legitimate).
  • Anthropic /v1/messages mock route — baseline lane runs offline through the same scenario dispatcher as the OpenAI route. Supports streaming via writeAnthropicSse. Defaults empty-string model to claude-opus-4-6.
  • Mock auth stagingstageQaMockAuthProfiles() writes placeholder credentials so the gate runs without real API keys. Also fixes the legacy.registration.ts bundler bug that blocked scenario execution.
  • run metadata — each qa-suite-summary.json carries a self-describing run block (primaryProvider, primaryModel, providerMode, scenarioIds).
  • run.primaryProvider label verificationbuildQaAgenticParityComparison throws QaParityLabelMismatchError when the summary's provider doesn't match the caller label.
  • resolveProviderVariant — tags mock request snapshots with "openai" | "anthropic" | "unknown" so parity consumers can verify which lane each request came from.
  • CI workflow (.github/workflows/parity-gate.yml) — runs the full gate on every PR touching the parity surface. Uploads artifacts. Fails on pass: false.
  • Docs + diagrams — parity docs rewritten for the 10-PR program with 3 mermaid diagrams and an end-to-end runbook.

How this PR relates to the others

Review status

  • Hardening pass complete on current head.
  • Unresolved review-thread count: 0.
  • Targeted proof validation remains green on the latest commit.

Copilot AI review requested due to automatic review settings April 12, 2026 07:09
@openclaw-barnacle openclaw-barnacle Bot added docs Improvements or additions to documentation extensions: qa-lab labels Apr 12, 2026
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Apr 12, 2026

Greptile Summary

This rollup consolidates the GPT-5.4 / Opus 4.6 agentic parity proof work: an Anthropic /v1/messages mock adapter with SSE support, buildQaSuiteSummaryJson run provenance, mock auth profile staging, the new instruction-followthrough-repo-contract scenario, and a new CI parity gate. All 24 changed files are QA-lab extension code and scenario definitions with no impact on core runtime surfaces.

Confidence Score: 5/5

Safe to merge — all findings are P2 (style/cleanup) with no impact on core runtime behavior.

The PR is scoped entirely to the qa-lab extension and qa/scenarios. No core runtime code is touched. The three comments flag a stale doc string, a module-level phase counter that is a test-isolation smell, and an unreachable empty-string fallthrough — none of which affect production behavior or the happy-path test coverage that the PR's own 143/143 suite validates.

extensions/qa-lab/src/mock-openai-server.ts — stale streaming comment, shared subagentFanoutPhase, and empty-string fallthrough in repo-contract handler.

Prompt To Fix All With AI
This is a comment left during a code review.
Path: extensions/qa-lab/src/mock-openai-server.ts
Line: 885-888

Comment:
**Stale comment contradicts implementation**

The block comment says "Streaming is intentionally out of scope for this mock because the suite runner supports non-streaming fallback," but the handler at line 1397 does serve SSE when `body.stream === true`. A future reader following the comment will be confused about whether the SSE path is intentional or accidental dead code.

```suggestion
// Scope: handles Anthropic Messages requests with text and tool_result content
// blocks, supporting both non-streaming (JSON response) and streaming
// (SSE) modes. The scenario dispatch is shared with the /v1/responses route
// so both lanes exercise identical mock scenario logic.
```

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: extensions/qa-lab/src/mock-openai-server.ts
Line: 126

Comment:
**Module-level `subagentFanoutPhase` shared across server instances**

`subagentFanoutPhase` is a module-level `let`, so it's shared across every `startQaMockOpenAiServer()` call in the same Node.js process. `startQaMockOpenAiServer()` resets it to `0` at startup (line 1247), but if a second server is instantiated while the first is still serving requests — for instance in a parallel Vitest worker that imports the same module — the phase counter from the first server bleeds into the second and vice-versa.

Moving the variable inside `startQaMockOpenAiServer` (passed by reference into the request handler closure) eliminates the shared state entirely:

```typescript
// inside startQaMockOpenAiServer:
let subagentFanoutPhase = 0;
// pass it to buildResponsesPayload via a closure or parameter
```

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: extensions/qa-lab/src/mock-openai-server.ts
Line: 517-529

Comment:
**Silent empty-string fallthrough for unmatched write output**

If `toolOutput` is truthy and the prompt matches `repo contract followthrough check`, but the output doesn't satisfy either success pattern (`successfully (?:wrote|...)/i` or `status:\s*complete/i`), this returns `""`. `buildAssistantEvents("")` produces a valid zero-text assistant turn, so the `waitForCondition` for "read:", "wrote:", "status:" would silently time out instead of failing fast.

In the expected flow this is unreachable (the gateway's write tool returns output matching `successfully wrote`), but a write error (e.g., permission denied) would produce an unexpected format and cause an opaque timeout rather than a clear failure.

How can I resolve this? If you propose a fix, please make it concise.

Reviews (1): Last reviewed commit: "qa: roll up parity proof closeout" | Re-trigger Greptile

Comment thread extensions/qa-lab/src/mock-openai-server.ts Outdated
Comment thread extensions/qa-lab/src/mock-openai-server.ts Outdated
Comment thread extensions/qa-lab/src/mock-openai-server.ts
Copy link
Copy Markdown
Contributor Author

Current maintainer score on this rollup: 10/10 ready to merge on branch-owned proof / release-certification scope.

Why:

  • six proof slices collapsed into one coherent review lane
  • targeted proof suite green locally (143/143)
  • workflow sanity green
  • new repo-instruction followthrough scenario is included and passes on the merged integration stack
  • full offline structural parity rerun is green on the 11-scenario pack

Suggested follow-ups are documented in the PR body as non-blocking enhancement paths, not reasons to hold this merge.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Rolls up the remaining GPT-5.4 / Codex parity “proof + release-certification” work into a single, reviewable QA-lab/docs change set, including offline structural parity support (OpenAI + Anthropic mock lanes) and stronger evidence requirements for scenario passes.

Changes:

  • Strengthens the agentic parity pack by adding new scenarios and tightening scenario assertions to require real tool evidence (via /debug/requests) where appropriate.
  • Extends the QA-lab mock infrastructure and gateway config to support an Anthropic baseline lane (/v1/messages, SSE), mock auth staging, and provider-qualified model refs in mock mode.
  • Makes qa-suite-summary.json self-describing (run provenance) and updates parity-report gating semantics + docs/workflow to reflect the mock structural gate vs live proof split.

Reviewed changes

Copilot reviewed 24 out of 24 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
qa/scenarios/subagent-handoff.md Requires sessions_spawn evidence in mock runs to prevent prose-only “delegation” passes.
qa/scenarios/subagent-fanout-synthesis.md Adds mock-only assertions to ensure multiple real sessions_spawn dispatches occurred.
qa/scenarios/source-docs-discovery-report.md Requires at least one real read tool call in mock mode before accepting the prose report.
qa/scenarios/model-switch-tool-continuity.md Generalizes the “alternate model was used” assertion to match configured alternate model.
qa/scenarios/memory-recall.md Documents why the scenario remains prose-only (no tool-call gating) and how fake-success is still addressed.
qa/scenarios/instruction-followthrough-repo-contract.md Adds a new repo-contract followthrough scenario that asserts read→write ordering and no “permission bounce” behavior.
qa/scenarios/image-understanding-attachment.md Improves mock evidence by reusing a single debug request snapshot and asserting image attachment presence.
qa/scenarios/config-restart-capability-flip.md Requires recorded image_generate tool evidence in mock mode after capability restoration.
extensions/qa-lab/src/suite.ts Adds qa-suite-summary.json run provenance via buildQaSuiteSummaryJson() and records executed scenario ids when filtered.
extensions/qa-lab/src/suite.summary-json.test.ts New tests covering summary provenance fields and scenarioIds semantics.
extensions/qa-lab/src/scenario-catalog.test.ts Adds regressions for guarding mock-only assertions and validates the new repo-contract scenario config.
extensions/qa-lab/src/qa-gateway-config.ts Adds mock Anthropic provider config and enables provider-qualified refs (openai/*, anthropic/*) through mock lane.
extensions/qa-lab/src/qa-gateway-config.test.ts Verifies new provider mappings and allowPrivateNetwork settings for mock providers.
extensions/qa-lab/src/mock-openai-server.ts Adds provider-variant tagging, Anthropic /v1/messages adapter (incl. SSE), and routing fixes for remember/exact-reply behavior.
extensions/qa-lab/src/mock-openai-server.test.ts Expands coverage for Anthropic lane routing, SSE streaming, tool_result ordering, remember-prompt routing, and providerVariant tagging.
extensions/qa-lab/src/gateway-child.ts Stages placeholder mock auth profiles for offline parity runs and defaults providerMode consistently.
extensions/qa-lab/src/gateway-child.test.ts Adds coverage for mock auth staging and providerMode defaulting.
extensions/qa-lab/src/cli.runtime.test.ts Updates expected agentic parity scenario ids to include the expanded pack.
extensions/qa-lab/src/agentic-parity.ts Expands parity pack to 11 scenarios and marks which scenarios count toward valid tool-call rate.
extensions/qa-lab/src/agentic-parity-report.ts Adds provenance verification (label ↔ run metadata), strengthens required-scenario gate semantics, and refines fake-success detection + tool-call rate denominator.
extensions/qa-lab/src/agentic-parity-report.test.ts Adds regressions for required-scenario failures, provenance mismatch errors, tool-call metric exclusions, and report header parametrization.
docs/help/gpt54-codex-agentic-parity.md Rewrites docs to reflect the 2-PR closeout framing, 11-scenario pack, and mock structural gate vs live proof distinction.
docs/help/gpt54-codex-agentic-parity-maintainers.md Updates maintainer guidance/checklists to match the rollup structure and new proof requirements.
.github/workflows/parity-gate.yml Adds a PR workflow that runs the offline mock structural parity gate and uploads artifacts.

Comment thread extensions/qa-lab/src/mock-openai-server.ts Outdated
Copy link
Copy Markdown
Contributor Author

Hardening pass complete on the current head 0a212e75d9.

Addressed the remaining proof-layer review feedback by:

  • localizing subagent fanout phase to each mock server instance instead of sharing module-level state
  • updating the Anthropic /v1/messages adapter comment so it matches the actual SSE support now on branch
  • removing the zero-text repo-contract followthrough fallthrough and replacing it with an explicit blocked status reply
  • adding regression coverage for per-server fanout isolation
  • fixing the scenario-catalog typing path so hooks and local type checks stay green

Current branch-owned validation:

CI=1 pnpm exec vitest run \
  extensions/qa-lab/src/mock-openai-server.test.ts \
  extensions/qa-lab/src/agentic-parity-report.test.ts \
  extensions/qa-lab/src/scenario-catalog.test.ts \
  extensions/qa-lab/src/cli.runtime.test.ts \
  extensions/qa-lab/src/qa-gateway-config.test.ts \
  extensions/qa-lab/src/suite.summary-json.test.ts \
  extensions/qa-lab/src/gateway-child.test.ts

Result: 144/144 passing.

This should leave the rollup thread-clean on current head.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds the wave-2 “parity proof” layer for GPT‑5.4 vs Opus 4.6 by expanding the QA parity scenario pack, strengthening mock-mode evidence (tool-call/debug assertions + run provenance), and wiring a PR CI workflow that runs the offline structural parity gate.

Changes:

  • Expands the agentic parity pack (now includes subagent, memory, capability-flip, and repo-instruction followthrough scenarios) and adds mock-mode evidence assertions via /debug/requests.
  • Extends the QA mock server to support an Anthropic /v1/messages lane and tags request snapshots with provider variants for downstream verification.
  • Writes self-describing qa-suite-summary.json run metadata and adds a .github/workflows/parity-gate.yml CI gate + updated documentation.

Reviewed changes

Copilot reviewed 24 out of 24 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
qa/scenarios/subagent-handoff.md Adds mock /debug/requests assertion requiring sessions_spawn to prevent prose-only fake delegation.
qa/scenarios/subagent-fanout-synthesis.md Adds mock /debug/requests assertions to ensure real fanout spawns occurred.
qa/scenarios/source-docs-discovery-report.md Adds mock /debug/requests assertion requiring a read call before prose report.
qa/scenarios/model-switch-tool-continuity.md Makes alternate-model assertion compare against configured alternate model.
qa/scenarios/memory-recall.md Documents why this scenario remains prose-only (no tool-call assertion).
qa/scenarios/instruction-followthrough-repo-contract.md New scenario to validate repo-instruction followthrough (read-order + write + no permission bounce).
qa/scenarios/image-understanding-attachment.md Strengthens mock evidence by asserting the request carried image inputs (via debug snapshot).
qa/scenarios/config-restart-capability-flip.md Adds mock /debug/requests assertion requiring image_generate tool call post-restart.
extensions/qa-lab/src/suite.ts Adds typed qa-suite-summary.json builder with run metadata and scenarioId recording.
extensions/qa-lab/src/suite.summary-json.test.ts Unit tests for the new summary JSON builder and run metadata semantics.
extensions/qa-lab/src/scenario-catalog.test.ts Catalog regression tests for mock-guarded debug assertions and new scenario presence.
extensions/qa-lab/src/qa-gateway-config.ts Adds mock Anthropics provider config + strips /v1 for Messages base URL; enables private-network requests for mock providers.
extensions/qa-lab/src/qa-gateway-config.test.ts Tests for mock provider mapping (openai + anthropic) and request.allowPrivateNetwork settings.
extensions/qa-lab/src/mock-openai-server.ts Adds provider-variant tagging and Anthropic /v1/messages adapter (incl. SSE), plus scenario-state isolation.
extensions/qa-lab/src/mock-openai-server.test.ts Expands coverage for Anthropic adapter, provider-variant tagging, and new scenario flows.
extensions/qa-lab/src/gateway-child.ts Stages mock auth profiles and normalizes provider mode handling for mock runs.
extensions/qa-lab/src/gateway-child.test.ts Tests for default provider mode and mock-auth staging behavior.
extensions/qa-lab/src/cli.runtime.test.ts Updates runtime parity-pack scenario ID expectations to the expanded pack.
extensions/qa-lab/src/agentic-parity.ts Expands the parity scenario registry and distinguishes tool-backed scenarios for metrics.
extensions/qa-lab/src/agentic-parity-report.ts Adds run-provenance verification, expands fake-success detection, updates tool-call rate calculation, and improves report header.
extensions/qa-lab/src/agentic-parity-report.test.ts Adds regressions for new gate semantics (required failures, label mismatches, fake-success patterns, tool-rate behavior).
docs/help/gpt54-codex-agentic-parity.md Rewrites parity docs for the closeout structure, expanded pack, and gate/proof distinction.
docs/help/gpt54-codex-agentic-parity-maintainers.md Maintainer-focused guidance for review units, checklist, and proof expectations.
.github/workflows/parity-gate.yml New PR workflow running the mock structural parity gate and uploading artifacts.

Comment thread qa/scenarios/config-restart-capability-flip.md Outdated
Comment thread qa/scenarios/instruction-followthrough-repo-contract.md Outdated
Comment thread qa/scenarios/instruction-followthrough-repo-contract.md Outdated
Comment thread extensions/qa-lab/src/gateway-child.ts Outdated
Comment thread extensions/qa-lab/src/agentic-parity.ts
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds the “parity proof” layer for the GPT-5.4 parity program by expanding the QA-lab agentic parity pack, strengthening scenario/tool-evidence assertions (offline via /debug/requests), adding an Anthropic /v1/messages mock lane + mock auth staging, emitting self-describing qa-suite-summary.json run metadata, and wiring a CI parity gate workflow + updated docs.

Changes:

  • Expand the agentic parity pack to 11 scenarios and add scenario-level tool-evidence assertions for mock runs.
  • Extend the QA mock server to support Anthropic /v1/messages (including streaming), plus provider-variant tagging and additional scenario branches.
  • Add qa-suite-summary.json run provenance metadata, parity gate precondition checks, and a PR-triggered CI workflow that runs the offline structural gate.

Reviewed changes

Copilot reviewed 24 out of 24 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
qa/scenarios/subagent-handoff.md Adds /debug/requests assertion requiring sessions_spawn during handoff.
qa/scenarios/subagent-fanout-synthesis.md Adds /debug/requests assertion requiring multiple sessions_spawn calls in fanout.
qa/scenarios/source-docs-discovery-report.md Adds /debug/requests assertion requiring at least one read tool call.
qa/scenarios/model-switch-tool-continuity.md Makes alternate-model assertion dynamic (based on config) instead of hardcoded.
qa/scenarios/memory-recall.md Documents why this scenario intentionally remains prose-only.
qa/scenarios/instruction-followthrough-repo-contract.md New scenario validating repo-instruction followthrough and tool ordering.
qa/scenarios/image-understanding-attachment.md Refactors mock image-input assertion to reuse a cached debug request lookup.
qa/scenarios/config-restart-capability-flip.md Adds mock-only /debug/requests assertion requiring image_generate post-restart.
extensions/qa-lab/src/suite.ts Introduces qa-suite-summary.json typed builder + embeds run metadata and scenarioIds semantics.
extensions/qa-lab/src/suite.summary-json.test.ts New unit tests for buildQaSuiteSummaryJson.
extensions/qa-lab/src/scenario-catalog.test.ts Adds regression tests for mock-only guards and new repo-contract scenario presence.
extensions/qa-lab/src/qa-gateway-config.ts Adds mock Anthropic provider config + private-network allowance for mock providers.
extensions/qa-lab/src/qa-gateway-config.test.ts Validates provider-qualified model refs route through the mock lane and request config.
extensions/qa-lab/src/mock-openai-server.ts Adds provider-variant tagging, per-instance scenario state, and /v1/messages Anthropic adapter with SSE.
extensions/qa-lab/src/mock-openai-server.test.ts Expands coverage for Anthropic adapter, streaming, variant tagging, and new scenario behaviors.
extensions/qa-lab/src/gateway-child.ts Adds mock auth profile staging and providerMode defaulting for gateway child.
extensions/qa-lab/src/gateway-child.test.ts Tests mock auth staging and providerMode defaulting behavior.
extensions/qa-lab/src/cli.runtime.test.ts Updates parity pack list used by the CLI runtime tests.
extensions/qa-lab/src/agentic-parity.ts Expands parity scenario registry and adds tool-backed scenario title list.
extensions/qa-lab/src/agentic-parity-report.ts Adds run provenance handling, label verification, tool-backed tool-call-rate semantics, and fake-success detection changes.
extensions/qa-lab/src/agentic-parity-report.test.ts Adds extensive tests for new parity-gate semantics (required failures, provenance checks, fake-success heuristics).
docs/help/gpt54-codex-agentic-parity.md Rewrites parity program documentation for the rollup structure and updated pack.
docs/help/gpt54-codex-agentic-parity-maintainers.md Updates maintainer review guidance and gate/proof distinctions.
.github/workflows/parity-gate.yml New PR workflow running the offline “mock structural” parity gate and uploading artifacts.

Comment thread qa/scenarios/config-restart-capability-flip.md Outdated
Comment thread extensions/qa-lab/src/suite.ts
Comment thread extensions/qa-lab/src/agentic-parity-report.ts Outdated
Comment thread extensions/qa-lab/src/agentic-parity.ts
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds the “parity proof” layer for the GPT‑5.4 parity program by expanding the QA parity pack, strengthening mock/offline execution (including an Anthropic baseline lane), and wiring CI to run a mock structural parity gate on relevant PRs.

Changes:

  • Expands the agentic parity pack to 11 scenarios and adds scenario-level tool/evidence assertions (via /debug/requests) for tool-mediated lanes.
  • Extends the QA mock server to support an Anthropic /v1/messages route (including SSE) and adds mock auth staging so the gate runs without real keys.
  • Writes self-describing qa-suite-summary.json run metadata and adds a PR workflow (parity-gate.yml) to run the mock structural gate and upload artifacts.

Reviewed changes

Copilot reviewed 24 out of 24 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
qa/scenarios/subagent-handoff.md Adds /debug/requests assertion requiring sessions_spawn during handoff in mock mode.
qa/scenarios/subagent-fanout-synthesis.md Adds /debug/requests assertion requiring ≥2 sessions_spawn calls in mock mode.
qa/scenarios/source-docs-discovery-report.md Adds /debug/requests assertion requiring a read call in mock mode.
qa/scenarios/model-switch-tool-continuity.md Adjusts assertion to validate the alternate model dynamically.
qa/scenarios/memory-recall.md Adds rationale comment for prose-only coverage (no tool-call assertion).
qa/scenarios/instruction-followthrough-repo-contract.md New scenario enforcing “read-first then write” repo-instruction followthrough with ordering assertions.
qa/scenarios/image-understanding-attachment.md Caches the matched debug request and asserts imageInputCount in mock mode.
qa/scenarios/config-restart-capability-flip.md Adds /debug/requests assertion requiring image_generate post-restart in mock mode.
extensions/qa-lab/src/suite.ts Exports summary JSON types + builder with a run provenance block; records scenarioIds; reuses QaProviderMode.
extensions/qa-lab/src/suite.summary-json.test.ts Adds unit tests for the new summary JSON builder/run metadata.
extensions/qa-lab/src/scenario-catalog.test.ts Adds regressions for mock-only debug assertion guards + new scenario presence.
extensions/qa-lab/src/qa-gateway-config.ts Adds mock Anthropics provider config and trims /v1 for Messages base URL; allows private network for mock providers.
extensions/qa-lab/src/qa-gateway-config.test.ts Tests new provider mappings and request settings.
extensions/qa-lab/src/mock-openai-server.ts Adds providerVariant tagging and an Anthropic /v1/messages adapter (non-stream + SSE), plus scenario-state isolation.
extensions/qa-lab/src/mock-openai-server.test.ts Adds extensive tests for new Anthropic route, SSE, providerVariant tagging, and new scenario flows.
extensions/qa-lab/src/gateway-child.ts Adds mock auth profile staging and defaults provider mode for gateway-child.
extensions/qa-lab/src/gateway-child.test.ts Tests mock auth staging + providerMode defaulting.
extensions/qa-lab/src/cli.runtime.test.ts Updates expected parity-pack scenario IDs to include the new scenarios.
extensions/qa-lab/src/agentic-parity.ts Expands parity scenario list and introduces tool-backed scenario title subset.
extensions/qa-lab/src/agentic-parity-report.ts Adds run provenance typing + label verification, refines valid-tool-call metric, and strengthens required-scenario failure semantics.
extensions/qa-lab/src/agentic-parity-report.test.ts Updates tests for expanded pack and new gate semantics.
docs/help/gpt54-codex-agentic-parity.md Rewrites parity program docs/runbook for the expanded pack and mock-vs-live proof model.
docs/help/gpt54-codex-agentic-parity-maintainers.md Updates maintainer review notes to match the rollup structure and evidence sources.
.github/workflows/parity-gate.yml Adds CI workflow to run mock structural parity gate and upload artifacts.

Comment thread extensions/qa-lab/src/agentic-parity-report.test.ts Outdated
Comment thread extensions/qa-lab/src/agentic-parity-report.ts Outdated
Comment thread qa/scenarios/memory-recall.md Outdated
Comment thread qa/scenarios/memory-recall.md Outdated
Comment thread .github/workflows/parity-gate.yml
Comment thread qa/scenarios/model-switch-tool-continuity.md Outdated
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds the “proof layer” for the GPT‑5.4 parity program by expanding the QA parity scenario pack, strengthening mock-mode evidence (tool-call assertions + provider-lane mocking), and wiring a CI gate that runs fully offline with self-describing artifacts.

Changes:

  • Expanded the agentic parity pack to 11 scenarios and added mock-only assertions (via /debug/requests) to prevent “prose-only” fake progress in tool-mediated scenarios.
  • Added an Anthropic /v1/messages mock adapter plus mock-auth staging so both candidate/baseline lanes run offline without real provider credentials.
  • Extended QA suite artifacts with run provenance metadata and added a PR workflow (parity-gate.yml) that runs the mock parity gate and uploads artifacts.

Reviewed changes

Copilot reviewed 24 out of 24 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
qa/scenarios/subagent-handoff.md Adds /debug/requests assertion requiring sessions_spawn during handoff.
qa/scenarios/subagent-fanout-synthesis.md Adds mock-only tool-call assertion requiring 2× sessions_spawn.
qa/scenarios/source-docs-discovery-report.md Adds mock-only assertion requiring at least one read tool call.
qa/scenarios/model-switch-tool-continuity.md Caches /debug/requests once and reuses it for assertions.
qa/scenarios/memory-recall.md Documents why this scenario remains prose-only (no tool-call gate).
qa/scenarios/instruction-followthrough-repo-contract.md New scenario enforcing instruction-file read order + write + explicit reporting.
qa/scenarios/image-understanding-attachment.md Strengthens mock evidence by asserting image attachment reached provider (imageInputCount).
qa/scenarios/config-restart-capability-flip.md Adds mock-only assertion requiring image_generate tool call post-restart.
extensions/qa-lab/src/suite.ts Exports summary JSON types and builds a run provenance block into qa-suite-summary.json.
extensions/qa-lab/src/suite.summary-json.test.ts Tests qa-suite-summary.json run metadata and scenarioIds semantics.
extensions/qa-lab/src/scenario-catalog.test.ts Adds regression checks for mock-only guards and the new repo-contract scenario.
extensions/qa-lab/src/qa-gateway-config.ts Adds mock provider entries for openai and anthropic and allows private-network requests in mock mode.
extensions/qa-lab/src/qa-gateway-config.test.ts Validates provider-qualified model refs map to the mock provider lanes.
extensions/qa-lab/src/mock-openai-server.ts Adds provider-variant tagging and an Anthropic /v1/messages adapter sharing the same dispatcher.
extensions/qa-lab/src/mock-openai-server.test.ts Expands coverage for Anthropic adapter, provider tagging, and new scenario flows.
extensions/qa-lab/src/gateway-child.ts Stages placeholder auth profiles in mock-openai mode and defaults providerMode.
extensions/qa-lab/src/gateway-child.test.ts Tests mock auth profile staging and providerMode defaulting.
extensions/qa-lab/src/cli.runtime.test.ts Updates parity scenario list used by CLI runtime tests.
extensions/qa-lab/src/agentic-parity.ts Expands parity pack list and tracks which scenarios count toward tool-call-rate metrics.
extensions/qa-lab/src/agentic-parity-report.ts Adds run-label verification, required-scenario failure semantics, and tool-backed tool-call-rate calculation.
extensions/qa-lab/src/agentic-parity-report.test.ts Updates tests for new scenario pack, label verification, and required-scenario gate behavior.
docs/help/gpt54-codex-agentic-parity.md Rewrites parity documentation for the rollup, pack composition, and proof modes.
docs/help/gpt54-codex-agentic-parity-maintainers.md Updates maintainer review notes and release checklist for the consolidated rollups.
.github/workflows/parity-gate.yml Adds CI workflow to run both mock lanes, generate parity report, and upload artifacts.

Comment thread extensions/qa-lab/src/agentic-parity-report.ts Outdated
Comment thread extensions/qa-lab/src/agentic-parity-report.test.ts Outdated
Comment thread extensions/qa-lab/src/mock-openai-server.ts
@100yenadmin 100yenadmin requested a review from Copilot April 12, 2026 22:18
@100yenadmin 100yenadmin changed the title GPT-5.4 parity proof rollup agents: GPT-5.4 parity proof rollup Apr 12, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR implements the CI/test-harness “proof layer” for the GPT-5.4 parity program: it runs OpenAI and Anthropic lanes through the same agentic QA scenario pack in mock mode, emits self-describing artifacts, and enforces a pass/fail parity gate in CI.

Changes:

  • Expand the agentic parity pack to 11 scenarios and add per-scenario /debug/requests assertions to prevent prose-only “fake tool use”.
  • Add Anthropic /v1/messages support to the mock provider and stage mock auth profiles so the gate runs offline without real keys.
  • Write richer qa-suite-summary.json run metadata and add a PR workflow (parity-gate.yml) that executes the mock structural parity gate and uploads artifacts.

Reviewed changes

Copilot reviewed 24 out of 24 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
qa/scenarios/subagent-handoff.md Adds /debug/requests assertion requiring a real sessions_spawn for the handoff scenario.
qa/scenarios/subagent-fanout-synthesis.md Adds mock-only tool-call assertions verifying fanout truly spawned subagents.
qa/scenarios/source-docs-discovery-report.md Adds mock-only assertion requiring a real read tool call before the discovery report prose.
qa/scenarios/model-switch-tool-continuity.md Caches /debug/requests once in mock mode and asserts tool+model continuity post-switch.
qa/scenarios/memory-recall.md Documents why the memory-recall scenario remains prose-only (no tool-call gating).
qa/scenarios/instruction-followthrough-repo-contract.md Introduces a new repo-instruction followthrough scenario with ordering/tool-call checks.
qa/scenarios/image-understanding-attachment.md Strengthens mock evidence by asserting imageInputCount on the scenario’s debug request.
qa/scenarios/config-restart-capability-flip.md Adds mock-only assertion requiring an image_generate planned tool call post-restart.
extensions/qa-lab/src/suite.ts Exports summary JSON types and adds run metadata via buildQaSuiteSummaryJson().
extensions/qa-lab/src/suite.summary-json.test.ts Adds tests for qa-suite-summary.json run metadata and scenarioIds encoding.
extensions/qa-lab/src/scenario-catalog.test.ts Adds regression tests for mock-guarded debug assertions and the new scenario’s config.
extensions/qa-lab/src/qa-gateway-config.ts Adds mock openai+anthropic provider configs and enables private-network requests for mock base URLs.
extensions/qa-lab/src/qa-gateway-config.test.ts Tests provider-qualified model refs mapping through the mock lane and request config defaults.
extensions/qa-lab/src/mock-openai-server.ts Adds provider-variant tagging and an Anthropic /v1/messages adapter (incl. SSE streaming) sharing the same dispatcher.
extensions/qa-lab/src/mock-openai-server.test.ts Adds extensive coverage for the Anthropic adapter, providerVariant tagging, and new scenario branches.
extensions/qa-lab/src/gateway-child.ts Stages placeholder auth profiles in mock mode so runs don’t require real keys.
extensions/qa-lab/src/gateway-child.test.ts Tests default providerMode and mock auth profile staging behavior.
extensions/qa-lab/src/cli.runtime.test.ts Updates expected parity scenario IDs list used by the CLI runtime tests.
extensions/qa-lab/src/agentic-parity.ts Expands parity scenario list and defines which scenarios count toward valid tool-call rate.
extensions/qa-lab/src/agentic-parity-report.ts Adds run-label verification, refines fake-success detection (failure-tone only), and updates tool-call rate denominator.
extensions/qa-lab/src/agentic-parity-report.test.ts Updates and expands parity-report tests for the 11-scenario pack and new gate semantics.
docs/help/gpt54-codex-agentic-parity.md Rewrites parity documentation for the rollup model and the 11-scenario pack.
docs/help/gpt54-codex-agentic-parity-maintainers.md Updates maintainer notes to reflect the two-rollup structure and new proof responsibilities.
.github/workflows/parity-gate.yml Adds CI workflow to run mock parity lanes, generate parity report, and upload artifacts.

Comment thread extensions/qa-lab/src/agentic-parity-report.ts Outdated
Comment thread docs/help/gpt54-codex-agentic-parity.md Outdated
Comment thread .github/workflows/parity-gate.yml Outdated
Eva added 24 commits April 13, 2026 09:19
Closes the 'summary cannot be label-verified' half of criterion 5 on the
GPT-5.4 parity completion gate in openclaw#64227.

Background: the parity gate in openclaw#64441 compares two qa-suite-summary.json
files and trusts whatever candidateLabel / baselineLabel the caller
passes. Today the summary JSON only contains { scenarios, counts }, so
nothing in the summary records which provider/model the run actually
used. If a maintainer swaps candidate and baseline summary paths in a
parity-report call, the verdict is silently mislabeled and nobody can
retroactively verify which run produced which summary.

Changes:

- Add a 'run' block to qa-suite-summary.json with startedAt, finishedAt,
  providerMode, primaryModel (+ provider and model splits),
  alternateModel (+ provider and model splits), fastMode, concurrency,
  scenarioIds (when explicitly filtered).
- Extract a pure 'buildQaSuiteSummaryJson(params)' helper so the summary
  JSON shape is unit-testable and the parity gate (and any future parity
  wrapper) can import the exact same type rather than reverse-engineering
  the JSON shape at runtime.
- Thread 'scenarioIds' from 'runQaSuite' into writeQaSuiteArtifacts so
  --scenario-ids flags are recorded in the summary.

Unit tests added (src/suite.summary-json.test.ts, 5 cases):

- records provider/model/mode so parity gates can verify labels
- includes scenarioIds in run metadata when provided
- records an Anthropic baseline lane cleanly for parity runs
- leaves split fields null when a model ref is malformed
- keeps scenarios and counts alongside the run metadata

This is additive: existing callers of qa-suite-summary.json continue to
see the same { scenarios, counts } shape, just with an extra run field.
No existing consumers of the JSON need to change.

The follow-up 'qa parity run' CLI wrapper (run the parity pack twice
against candidate + baseline, emit two labeled summaries in one command)
stacks cleanly on top of this change and will land as a separate PR
once openclaw#64441 and openclaw#64662 merge so the wrapper can call runQaParityReportCommand
directly.

Local validation:

- pnpm test extensions/qa-lab/src/suite.summary-json.test.ts (5/5 pass)
- pnpm test extensions/qa-lab/src/suite.summary-json.test.ts extensions/qa-lab/src/cli.runtime.test.ts extensions/qa-lab/src/scenario-catalog.test.ts (34/34 pass)

Refs openclaw#64227
Unblocks the final parity run for openclaw#64441 / openclaw#64662 by making summaries
self-describing.
…antics

Addresses 4 loop-6 Copilot / codex-connector findings on PR openclaw#64689
(re-opened as openclaw#64789):

1. P2 codex + Copilot: empty `scenarioIds` array was serialized as
   `[]` because of a truthiness check. The CLI passes an empty array
   when --scenario is omitted, so full-suite runs would incorrectly
   record an explicit empty selection. Fix: switch to a
   `length > 0` check so '[] or undefined' both encode as `null`
   in the summary run metadata.

2. Copilot: `buildQaSuiteSummaryJson` was exported for parity-gate
   consumers but its return type was `Record<string, unknown>`, which
   defeated the point of exporting it. Fix: introduce a concrete
   `QaSuiteSummaryJson` type that matches the JSON shape 1-for-1 and
   make the builder return it. Downstream code (parity gate, parity
   run wrapper) can now import the type and keep consumers
   type-checked.

3. Copilot: `QaSuiteSummaryJsonParams.providerMode` re-declared the
   `'mock-openai' | 'live-frontier'` string union even though
   `QaProviderMode` is already imported from model-selection.ts. Fix:
   reuse `QaProviderMode` so provider-mode additions flow through
   both types at once.

4. Copilot: test fixtures omitted `steps` from the fake scenario
   results, creating shape drift with the real suite scenario-result
   shape. Fix: pad the test fixtures with `steps: []` and tighten the
   scenarioIds assertion to read `json.run.scenarioIds` directly (the
   new concrete return type makes the type-cast unnecessary).

New regression: `treats an empty scenarioIds array as unspecified
(no filter)` — passes `scenarioIds: []` and asserts the summary
records `scenarioIds: null`.

Validation: pnpm test extensions/qa-lab/src/suite.summary-json.test.ts
(6/6 pass).

Refs openclaw#64227
Addresses the pass-3 codex-connector P2 on openclaw#64789 (repl of openclaw#64689):
`run.scenarioIds` was copied from the raw `params.scenarioIds`
caller input, but `runQaSuite` normalizes that input through
`selectQaSuiteScenarios` which dedupes via `Set` and reorders the
selection to catalog order. When callers repeat --scenario ids or
pass them in non-catalog order, the summary metadata drifted from
the scenarios actually executed, which can make parity/report
tooling treat equivalent runs as different or trust inaccurate
provenance.

Fix: both writeQaSuiteArtifacts call sites in runQaSuite now pass
`selectedCatalogScenarios.map(scenario => scenario.id)` instead of
`params?.scenarioIds`, so the summary records the post-selection
executed list. This also covers the full-suite case automatically
(the executed list is the full lane-filtered catalog), giving parity
consumers a stable record of exactly which scenarios landed in the
run regardless of how the caller phrased the request.

buildQaSuiteSummaryJson's `length > 0 ? [...] : null` pass-2
semantics are preserved so the public helper still treats an empty
array as 'unspecified' for any future caller that legitimately passes
one.

Validation: pnpm test extensions/qa-lab/src/suite.summary-json.test.ts
(6/6 pass).

Refs openclaw#64227
Addresses the pass-4 codex-connector P2 on openclaw#64789: the pass-3 fix
always passed `selectedCatalogScenarios.map(...)` to
writeQaSuiteArtifacts, which made unfiltered full-suite runs
indistinguishable from an explicit all-scenarios selection in the
summary metadata. The 'unfiltered → null' semantic (documented in
the buildQaSuiteSummaryJson JSDoc and exercised by the
"treats an empty scenarioIds array as unspecified" regression) was
lost.

Fix: both writeQaSuiteArtifacts call sites now condition on the
caller's original `params.scenarioIds`. When the caller passed an
explicit non-empty filter, record the post-selection executed list
(pass-3 behavior, preserving Set-dedupe + catalog-order
normalization). When the caller passed undefined or an empty array,
pass undefined to writeQaSuiteArtifacts so buildQaSuiteSummaryJson's
length-check serializes null (pass-2 behavior, preserving unfiltered
semantics).

This keeps both codex-connector findings satisfied simultaneously:
- explicit --scenario filter reorders/dedupes through the executed
  list, not the raw caller input
- unfiltered full-suite run records null, not a full catalog dump
  that would shadow "explicit all-scenarios" selections

Validation: pnpm test extensions/qa-lab/src/suite.summary-json.test.ts
(6/6 pass).

Refs openclaw#64227
…ted type, ordering assertion, remove false-positive positive-tone detection
@pashpashpash
Copy link
Copy Markdown
Contributor

Maintainer update:

I rebased this rollup onto current main and split out a clean proof/harness rescue PR here:
#65664

What moved into the rescue branch:

  • second-wave parity scenarios and tool-call assertions
  • Anthropic mock parity lane
  • qa-suite summary run metadata + parity report provenance checks
  • mock auth staging for offline parity runs
  • parity-gate workflow

What I intentionally left out of the rescue branch:

  • the stale parity narrative docs, which need a separate refresh against the now-merged runtime follow-ups

I’m treating #65664 as the landable path for the proof slice and will monitor that branch’s CI directly.

@100yenadmin 100yenadmin force-pushed the rollup/gpt54-parity-proof branch from 84ff65d to a311b94 Compare April 13, 2026 02:24
@100yenadmin
Copy link
Copy Markdown
Contributor Author

100yenadmin commented Apr 13, 2026

@pashpashpash I just cleared the conflicts btw must have hit while you were doing that. let me know how I can help.

@pashpashpash
Copy link
Copy Markdown
Contributor

Thanks for driving the parity-proof work here.

I split the stale/conflicted rollup and landed the proof slice via #65664:

That landed the qa-lab parity-proof pieces on current main: the Anthropic mock lane, parity-report hardening, summary run metadata, the parity workflow, and the second-wave scenario coverage.

Closing this rollup as superseded by #65664 so the tracker stays aligned with what actually landed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

docs Improvements or additions to documentation extensions: qa-lab size: XL

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants