Skip to content

[CI]: Add transport-parity gate (same-model cross-provider + cross-runtime) — sibling to QA parity-gate #78457

@100yenadmin

Description

@100yenadmin

Summary

Propose a sibling QA gate to the existing model-parity gate (introduced in #74290, later folded into openclaw-release-checks.yml / full-release-validation.yml by #74622) that catches a different class of regression: silent drift between the two paths to the same logical model, and between runtime harnesses for the same model+provider.

This gate would have caught — or made trivially diagnosable — every issue in the cluster around #78055, including the doctor config-rewrite regression filed as #78407.

Motivation

The existing parity gate compares two different models:

  • candidate openai/gpt-5.5-alt vs baseline anthropic/claude-opus-4-7

That answers a product question (do GPT-5.5 and Opus 4.7 give equivalent answers for a user choosing between them). It does not exercise the surfaces that have produced the recent run of regressions:

  1. [Bug]: Subagent announce can deliver stale output and subagent sessions may inherit unrelated history #78055 family (test: guard websocket stale final turn lineage #78147, fix: trace OpenAI WebSocket response lineage #78146, fix: reset websocket lineage after final answers #78142) — stale response.completed lineage on the openai-codex WebSocket transport. The same prompt routed through raw openai HTTP would have produced a divergent (correct) trajectory; a same-model-different-provider parity gate would have flagged the WS-only stale-final replay immediately.
  2. [Bug]: openclaw doctor --fix rewrites openai-codex/* model refs to openai/* on 2026.5.4 → 2026.5.5 update, locking out ChatGPT-OAuth users #78407 (doctor --fix rewrites openai-codex/*openai/* on update) — config-migration silently flipped half the install from one transport to the other. A provider-parity gate would have failed when the post-doctor config produced a different (failing) auth resolution than the pre-doctor config for identical scenario inputs.
  3. fix(subagents): keep thread-bound spawns isolated by default #78060 (subagent thread-bound spawns implicitly forking requester history) — the implicit-fork path differs between pi native runtime and the codex CLI subprocess harness; a runtime-parity gate over the same scenarios would have surfaced the inconsistency.

#77221 (CLI tool-vs-subcommand error message) is in a different test family and is not in scope here.

Proposed scope

A new gate, structured as a matrix in extensions/qa-lab/, asserting equivalence across two axes for the same scenario inputs already used by the existing character-eval / agentic-parity suites:

fixtures × ( openai-api-http × openai-codex-ws ) × ( pi × codex )
  • Axis 1 — Provider parity (same model, different transport): openai/gpt-5.5 vs openai-codex/gpt-5.5. Same logical model, different auth surface, different request shape, different lineage code (HTTP vs WS, no previous_response_id vs previous_response_id-based incremental). Any divergence beyond a published tolerance is a bug.
  • Axis 2 — Runtime parity (same model+provider, different harness): pi native runtime vs codex CLI subprocess. Different tool-loop, different streaming surface, different memory wiring. Any divergence is a bug in one of them.

Cell assertions per scenario:

Implementation sketch

Reuse the qa-lab primitives that already exist in this clone:

  • extensions/qa-lab/src/providers/mock-openai/server.ts — already extended in Update QA lab parity gate for GPT-5.5 vs Opus 4.7 #74290; add a second profile variant exposing the openai-codex Responses surface.
  • extensions/qa-lab/src/providers/shared/mock-model-config.ts — add openai-codex/gpt-5.5 alongside the existing openai/gpt-5.5-alt entry.
  • extensions/qa-lab/src/qa-gateway-config.test.ts — extend the gateway-boot test pattern with the four-cell matrix.
  • New extensions/qa-lab/src/transport-parity.ts + transport-parity.test.ts — orchestrator that runs the matrix per fixture and produces a parity-report-style summary.
  • New extensions/qa-lab/src/runtime-parity.ts — codex-CLI sandbox (mirror the pattern in qa-live-transports-convex.yml for transport sandboxing).

CI wiring: add a step in openclaw-release-checks.yml (the home that #74622 folded the parity gate into), gated behind the same OPENCLAW_BUILD_PRIVATE_QA=1 build flag the existing parity tests use.

Concrete starter (would also close #78407 as a side-effect)

A narrow first slice — fixture-replay regression for the doctor flow — can land independently of the broader matrix and is the smallest unit of value:

I'm planning to open an umbrella draft PR that adds at least the doctor-flow fixture-replay test (failing, reproducing #78407) and lays out the transport-parity scaffolding as TODOs the maintainers can flesh out — happy to split into smaller PRs if the maintainer prefers per-axis review.

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions