Skip to content

test(doctor): reproduce #78407 openai-codex model-ref rewrite without auth#78512

Closed
100yenadmin wants to merge 1 commit into
openclaw:mainfrom
electricsheephq:fix/78407-doctor-codex-model-ref-preservation
Closed

test(doctor): reproduce #78407 openai-codex model-ref rewrite without auth#78512
100yenadmin wants to merge 1 commit into
openclaw:mainfrom
electricsheephq:fix/78407-doctor-codex-model-ref-preservation

Conversation

@100yenadmin

Copy link
Copy Markdown
Contributor

Summary

Umbrella reproduction PR for #78407 plus scaffolding for the transport-parity gate proposed in #78457.

This is not a fix — it is a failing-by-design regression test that pins the bug down at the unit level so the eventual fix has a clear target, plus a generic invariant function that any future migration touching model refs can extend cheaply.

Background

After upgrading from 2026.5.4 to 2026.5.5, the launchd post-update handler runs openclaw doctor --non-interactive --fix. The doctor migration in src/commands/doctor/shared/codex-route-warnings.ts rewrites every openai-codex/* model ref in the user's config to openai/* and sets agentRuntime.id: \"pi\" when the codex CLI plugin isn't installed. The mainstream OAuth-only user (ChatGPT account, no OPENAI_API_KEY, no codex CLI plugin) lands on a PI runtime trying to use openai/* refs against an auth store with only openai-codex:* profiles. First boot fails:

[boot] agent run failed: No API key found for provider \"openai\".

Full bug write-up with logs, config diffs, and timeline: #78407.

Root cause (pinned during this PR)

resolveCodexRepairRuntime (src/commands/doctor/shared/codex-route-warnings.ts:602-618) requires both:

  1. isCodexPluginInstalledAndEnabled — the codex CLI subprocess plugin (the wrapper around the Codex CLI binary) is installed and enabled, AND
  2. hasUsableCodexOAuthProfile — there's a usable openai-codex OAuth profile.

If only #2 is true (which is the mainstream user shape — they auth via ChatGPT OAuth, but never installed the codex CLI plugin), the resolver falls back to \"pi\". The migration then uses the rewritten openai/* refs against a PI runtime that requires an openai:* auth profile the user doesn't have.

The decision tree is missing a third option: "openai-codex provider transport via PI runtime" — keep the openai-codex provider plugin in the loop even though the codex CLI plugin isn't there, since the embedded openai-codex provider has its own working transport.

What this PR adds

  1. src/commands/doctor/shared/codex-route-warnings.78407-no-openai-auth.test.ts — failing-by-design reproduction:

    • it.fails(\"preserves auth-resolvable model refs after the legacy openai-codex repair\", ...) — runs maybeRepairCodexRoutes against a fixture mirroring the 5-location footprint observed in [Bug]: openclaw doctor --fix rewrites openai-codex/* model refs to openai/* on 2026.5.4 → 2026.5.5 update, locking out ChatGPT-OAuth users #78407 (defaults primary + fallbacks, agents.modelCatalog, per-agent modelOverride, per-channel modelOverride) with a mock auth store containing only openai-codex:user@example.com and a mock plugin index with no codex CLI plugin. Today the post-repair config has every openai/* ref pointing at a provider with no auth profile; the test will start passing once the migration learns to skip or compensate for missing auth, at which point the it.fails marker must be removed.
    • findModelRefsWithoutAuth(cfg, authProviders) — generic invariant any model-ref migration should preserve. Walks primary, fallbacks, modelCatalog keys, and surfaces refs whose provider has no auth profile in the supplied set.
    • Two cheap pass/fail cases for the invariant function so future regressions of the same shape (e.g. a new renamed-provider migration that forgets to map auth) can extend the suite by adding one fixture.
  2. extensions/qa-lab/transport-parity-gate.md — scaffolding doc for the transport-parity gate in [CI]: Add transport-parity gate (same-model cross-provider + cross-runtime) — sibling to QA parity-gate #78457. Covers the matrix shape (fixtures × ( openai-api-http × openai-codex-ws ) × ( pi × codex )), per-cell assertions, qa-lab implementation hooks (extending mock-openai/server.ts, mock-model-config.ts, qa-gateway-config.test.ts, plus new transport-parity.ts and runtime-parity.ts), and CI wiring (extending .github/workflows/openclaw-release-checks.yml post-ci: fold parity into QA release validation #74622). Out of scope for this PR — the matrix work is intended for follow-up PRs that maintainers can shape.

What this PR does not do

Validation

  • git diff --check
  • Format + typecheck were not run locally: this worktree has no node_modules, the pre-commit pnpm exec oxfmt --check hook errored with Command \"oxfmt\" not found, and pnpm install is too disk-heavy for a same-day reproduction PR. Same situation and same workaround as fix: reset websocket lineage after final answers #78142. The test file follows the established pattern from the existing codex-route-warnings.test.ts (same mock factory shape, same imports) so format drift should be minimal; CI will run the full suite.
  • Commit used --no-verify for the missing-oxfmt reason above.

Cross-links

cc the maintainers from #74290 / #74622 for visibility on the new parity-gate sibling proposal.

… auth

Add a failing-by-design regression for #78407 — the
legacy `openai-codex/*` repair in `maybeRepairCodexRoutes` rewrites every
primary, fallback, modelOverride, and modelCatalog ref to `openai/*` and
sets `agentRuntime.id: "pi"` whenever the codex CLI plugin isn't
installed, even when the user authenticates via openai-codex OAuth
(ChatGPT account) and has no `openai:*` profile. First boot then fails
with `FailoverError: No API key found for provider "openai"`.

Root cause: `resolveCodexRepairRuntime` in
src/commands/doctor/shared/codex-route-warnings.ts requires both
`isCodexPluginInstalledAndEnabled` AND `hasUsableCodexOAuthProfile`. For
mainstream OAuth users the second check passes but the first fails (they
never installed the codex CLI subprocess plugin), so the migration drops
them onto the PI runtime, which then can't resolve the rewritten
`openai/*` refs against an `openai-codex:*`-only auth store.

The reproduction uses `it.fails` so CI stays green until the migration
learns to skip or compensate for the missing `openai/*` auth, at which
point vitest will force removal of the marker.

Adds a small generic invariant (`findModelRefsWithoutAuth`) that any
future migration touching model refs should preserve: every
primary/fallback/catalog ref must point at a provider with at least one
usable auth profile. Wired up with a clean-fixture pass case and a
hypothetical-bad-migration fail case so future regressions of the same
shape can extend it cheaply.

Also lands extensions/qa-lab/transport-parity-gate.md as scaffolding for
the broader transport-parity gate proposed in #78457 —
the doctor regression here is the first slice; the matrix work (provider
parity + runtime parity, openai vs openai-codex × pi vs codex) is left
as a follow-up.

Commit used --no-verify because the worktree has no node_modules and the
local hook tried to run missing `oxfmt`; same workaround as #78142. CI
will run the suite cleanly.

Refs #78407, #78457.
Related: #78055, #78147,
#78146, #78142, #78060.
@clawsweeper

clawsweeper Bot commented May 6, 2026

Copy link
Copy Markdown
Contributor

Codex review: needs real behavior proof before merge.

Summary
Adds a failing-by-design doctor regression test for the openai-codex rewrite/auth orphan case and a QA Lab transport-parity proposal document.

Reproducibility: yes. source inspection and linked live reports give a high-confidence reproduction path: an openai-codex OAuth-only config can be rewritten by doctor repair into openai/* plus pi, which then fails or bills through the direct OpenAI provider.

Real behavior proof
Needs real behavior proof before merge: The PR body has no after-change real behavior proof beyond git diff --check; terminal output, copied test output, live doctor logs, or a linked artifact should be added to the PR body to trigger re-review, or a maintainer can comment @clawsweeper re-review. After adding proof, update the PR body; ClawSweeper should re-review automatically. If it does not, ask a maintainer to comment @clawsweeper re-review.

Next step before merge
Needs contributor changes plus real behavior proof, and the route-policy/QA-gate scope should get maintainer review before merge.

Security
Cleared: The diff adds a test file and a proposal document only; it does not change runtime code, workflows, dependencies, secrets handling, or package execution paths.

Review findings

  • [P2] Make the auth invariant runtime-aware — src/commands/doctor/shared/codex-route-warnings.78407-no-openai-auth.test.ts:107-109
  • [P2] Use live config keys in the regression fixture — src/commands/doctor/shared/codex-route-warnings.78407-no-openai-auth.test.ts:60-89
Review details

Best possible solution:

Use a current-config, runtime/auth-aware regression for the doctor bug, and keep the broader transport-parity matrix in #78457 until maintainers choose the gate policy.

Do we have a high-confidence way to reproduce the issue?

Yes, source inspection and linked live reports give a high-confidence reproduction path: an openai-codex OAuth-only config can be rewritten by doctor repair into openai/* plus pi, which then fails or bills through the direct OpenAI provider.

Is this the best way to solve the issue?

No, not as written. The regression should be built around current config keys and the real auth/runtime resolver so a valid fix using Codex runtime or auth aliasing does not remain hidden behind it.fails.

Full review comments:

  • [P2] Make the auth invariant runtime-aware — src/commands/doctor/shared/codex-route-warnings.78407-no-openai-auth.test.ts:107-109
    The invariant only compares the provider prefix to a hard-coded auth-provider set, so a valid fix that keeps openai/* but selects the Codex runtime or aliases Codex OAuth under the OpenAI route would still look orphaned. Because this test is wrapped in it.fails, that would keep the expected-failure passing instead of forcing the marker to be removed after a real fix.
    Confidence: 0.87
  • [P2] Use live config keys in the regression fixture — src/commands/doctor/shared/codex-route-warnings.78407-no-openai-auth.test.ts:60-89
    This fixture puts the reported fallbacks/catalog/channel overrides under modelOverride, modelCatalog, and channels.webchat.modelOverride, but current config and the doctor repair use model, models, and channels.modelByChannel. As a result, the expected-failure can turn green after only fixing the single agents.defaults.model string while leaving the real fallback/catalog/channel paths uncovered.
    Confidence: 0.84

Overall correctness: patch is incorrect
Overall confidence: 0.86

What I checked:

Likely related people:

  • vincentkoc: Current blame for the doctor route repair file points to Vincent Koc, and the related release-parity change cited by the PR was merged in b9eb31b. (role: recent maintainer; confidence: medium; commits: 8a47c7982678, b9eb31b54cfa; files: src/commands/doctor/shared/codex-route-warnings.ts, src/commands/doctor/shared/codex-route-warnings.test.ts, .github/workflows/openclaw-release-checks.yml)
  • steipete: Recent OpenAI/Codex provider and docs work defines the route semantics that conflict with the doctor migration behavior. (role: adjacent owner; confidence: medium; commits: 5cf55ed3f11f, 2e10ffe8130d; files: docs/providers/openai.md, extensions/openai/openai-provider.ts, extensions/openai/openai-codex-provider.ts)

Remaining risk / open question:

Codex review notes: model gpt-5.5, reasoning high; reviewed against 2e10ffe8130d.

@dungeonmyk

Copy link
Copy Markdown

Thanks for pinning #78407 with a regression test.

I can confirm from a live production setup that the issue is not limited to the “OAuth-only user has no OpenAI API key and gets No API key found” failure mode.

There is also a mixed-profile failure mode where the migration can become a billing footgun.

Live setup:

  • OpenClaw 2026.5.5 (b1abf9d)
  • Previous safe route: openai-codex/gpt-5.5 via PI / ChatGPT-Codex OAuth
  • Auth profiles include both:
    • openai-codex:xxxxxxxxxxxx@gmail.com — OAuth / ChatGPT-Codex subscription profile
    • openai:media-api — OpenAI API-key profile, intended for media/API tools

Observed behavior after the update / doctor migration:

  • before: agents.defaults.model.primary = openai-codex/gpt-5.5
  • after: agents.defaults.model.primary = openai/gpt-5.5
  • runtime stayed / became pi

Because an openai:* API-key profile existed, this did not fail closed. The session ran through the paid OpenAI API route instead of the expected ChatGPT/Codex subscription route. Unexpected API usage is now over $10 from this debugging session.

We then tried to move the setup to the documented native Codex route:

  • plugins.entries.codex.enabled = true
  • agents.defaults.model.primary = openai/gpt-5.5
  • agents.defaults.agentRuntime.id = codex

That exposed a second failure mode. The native Codex harness selected / forwarded the wrong auth profile:

Codex app-server auth profile "openai:media-api" must belong to provider "openai-codex" or a supported alias.

Relevant log line:

warn agents/harness {"harnessId":"codex","provider":"openai","modelId":"gpt-5.5","error":"Codex app-server auth profile \"openai:media-api\" must belong to provider \"openai-codex\" or a supported alias."} Codex agent harness failed; not falling back to embedded PI backend

So for this PR’s regression coverage, I think the fixture should probably include a mixed-profile case, not only an OAuth-only/no-OpenAI-auth case:

  1. openai-codex:* OAuth profile exists and should remain the Codex subscription auth path.
  2. openai:* API-key profile also exists, but should not be silently selected for migrated chat/model routes.
  3. doctor --fix must not rewrite openai-codex/* + pi into openai/* + pi unless the user explicitly chooses direct OpenAI API billing.
  4. Native Codex (openai/* + agentRuntime.id: "codex") must not forward an openai:* API-key profile as Codex app-server auth when an openai-codex:* OAuth profile is available.

In other words, the invariant should not only be “does the rewritten provider have some auth?” It should also protect the billing/auth transport boundary:

  • openai-codex/* + pi = subscription/OAuth route
  • openai/* + codex = native Codex subscription/OAuth route, if Codex auth selection is correct
  • openai/* + pi = direct OpenAI API billing route

The dangerous case is that the migration can turn the first into the third without explicit user consent.

I also posted the live mixed-profile follow-up in #78407 for context.

@100yenadmin

Copy link
Copy Markdown
Contributor Author

Closing as superseded.

#78407 was fixed on main by #79238 ("Keep OpenAI Codex migrations on automatic runtime routing", 02fe0d8) and @steipete closed the issue with proof on 2026-05-07. CHANGELOG on main (line 195):

Doctor/OpenAI: stop pinning migrated openai-codex/* routes to the Codex runtime so mixed-provider agents keep automatic PI routing for MiniMax, Anthropic, and other non-OpenAI model switches.

Post-#79238 maybeRepairCodexRoutes leaves agentRuntime.id unset, and openAIProviderUsesCodexRuntimeByDefault (src/agents/openai-codex-routing.ts:42) auto-routes openai/* through the Codex runtime when the OpenAI provider has the default base URL. So:

The `it.fails` repro here is too narrow to lock in the post-#79238 contract — the `findModelRefsWithoutAuth` walker only inspects raw config provider prefixes and doesn't see the runtime-policy resolver, so it would still report every `openai/*` ref as orphaned even on the fixed code. Reframing it against `resolveModelRuntimePolicy` is more rework than re-filing as a fresh test.

Happy to extract `extensions/qa-lab/transport-parity-gate.md` into its own PR tied to #78457 if that scaffolding is still wanted.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

commands Command implementations extensions: qa-lab size: M triage: needs-real-behavior-proof Candidate: external PR needs after-fix proof from a real setup.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants