fix(codex): route auth failures to fallback provider chain#5948
Closed
A-FdL-Prog wants to merge 1 commit into
Closed
fix(codex): route auth failures to fallback provider chain#5948A-FdL-Prog wants to merge 1 commit into
A-FdL-Prog wants to merge 1 commit into
Conversation
Two related paths where Codex auth failures silently swallowed the
fallback chain instead of switching to the next provider:
1. cli.py — _ensure_runtime_credentials() calls resolve_runtime_provider()
before each turn. When provider is explicitly configured (not "auto"),
an AuthError from token refresh is re-raised and printed as a bold-red
error, returning False before the agent ever starts. The fallback chain
was never tried. Fix: on AuthError, iterate fallback_providers and
switch to the first one that resolves successfully.
2. run_agent.py — inside the codex_responses validity gate (inner retry
loop), response.status in {"failed","cancelled"} with non-empty output
items was treated as a valid response and broke out of the retry loop,
reaching _normalize_codex_response() outside the fallback machinery.
That function raises RuntimeError on status="failed", which propagates
to the outer except with no fallback logic. Fix: detect terminal status
codes before the output_items check and set response_invalid=True so
the existing fallback chain fires normally.
2 tasks
Collaborator
Collaborator
|
Likely duplicate of #15104 (merged) — same root cause: codex auth failures not routed to fallback provider chain. |
Contributor
|
Closing as superseded by #15104. Triage notes (high confidence): Thanks for the contribution — the underlying problem this PR addresses has been resolved by the linked PR on current main. If you believe this was closed in error, please comment and we'll reopen. (Bulk-closed during a CLI PR triage sweep.) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
Two code paths where Codex authentication failures silently bypassed the fallback provider chain instead of switching to the next configured provider.
1.
cli.py— pre-turn credential check blocks fallback entirely_ensure_runtime_credentials()is called before every agent turn. It callsresolve_runtime_provider(requested=self.requested_provider). When the provider is explicitly configured (e.g.provider: openai-codexin config.yaml — the typical setup),resolve_runtime_providerre-raisesAuthErroron token refresh failure rather than falling through. This error is caught, printed in bold red, and the function returnsFalse— so the agent never starts andfallback_providersis never consulted.Symptom: User sees a raw
Codex token refresh failed with status 401error in the Hermes TUI instead of a transparent switch to the fallback model.2.
run_agent.py—response.status=failedescapes the retry/fallback loopInside the
codex_responsesvalidity gate (inner retry loop), a response withstatus in {failed, cancelled}and non-emptyoutput_itemswas treated as valid (thenot isinstance(output_items, list)check passed) and broke out of the loop. It then reached_normalize_codex_response()— which is called outside the retry loop and raisesRuntimeErroron terminal status codes. That exception propagated to the outerexcept Exceptionhandler which has no fallback logic.Symptom: Quota exhaustion on the Codex provider (which returns
response.status=failed) showed a raw GPT error instead of switching to the fallback chain.Fix
cli.py: OnAuthErrorin_ensure_runtime_credentials, iteratefallback_providersand try each in order. On first success, swapself.requested_providerandself.modelto the fallback and continue normally. Only fail if every provider in the chain is exhausted.run_agent.py: Add an explicitelif _codex_resp_status in {failed, cancelled}:check before thenot isinstance(output_items, list)check inside the codex validity gate. Setsresponse_invalid = Truewith alogging.warningso the existing_try_activate_fallback()path fires.Testing
Reproduced locally with an expired Codex OAuth token (401 on refresh) and with quota exhaustion (
response.status=failed). Both now transparently switch to the first workingfallback_providersentry.