Skip to content

fix(codex): route auth failures to fallback provider chain#5948

Closed
A-FdL-Prog wants to merge 1 commit into
NousResearch:mainfrom
A-FdL-Prog:fix/codex-auth-failure-fallback
Closed

fix(codex): route auth failures to fallback provider chain#5948
A-FdL-Prog wants to merge 1 commit into
NousResearch:mainfrom
A-FdL-Prog:fix/codex-auth-failure-fallback

Conversation

@A-FdL-Prog

Copy link
Copy Markdown
Contributor

Problem

Two code paths where Codex authentication failures silently bypassed the fallback provider chain instead of switching to the next configured provider.

1. cli.py — pre-turn credential check blocks fallback entirely

_ensure_runtime_credentials() is called before every agent turn. It calls resolve_runtime_provider(requested=self.requested_provider). When the provider is explicitly configured (e.g. provider: openai-codex in config.yaml — the typical setup), resolve_runtime_provider re-raises AuthError on token refresh failure rather than falling through. This error is caught, printed in bold red, and the function returns False — so the agent never starts and fallback_providers is never consulted.

Symptom: User sees a raw Codex token refresh failed with status 401 error in the Hermes TUI instead of a transparent switch to the fallback model.

2. run_agent.pyresponse.status=failed escapes the retry/fallback loop

Inside the codex_responses validity gate (inner retry loop), a response with status in {failed, cancelled} and non-empty output_items was treated as valid (the not isinstance(output_items, list) check passed) and broke out of the loop. It then reached _normalize_codex_response() — which is called outside the retry loop and raises RuntimeError on terminal status codes. That exception propagated to the outer except Exception handler which has no fallback logic.

Symptom: Quota exhaustion on the Codex provider (which returns response.status=failed) showed a raw GPT error instead of switching to the fallback chain.

Fix

cli.py: On AuthError in _ensure_runtime_credentials, iterate fallback_providers and try each in order. On first success, swap self.requested_provider and self.model to the fallback and continue normally. Only fail if every provider in the chain is exhausted.

run_agent.py: Add an explicit elif _codex_resp_status in {failed, cancelled}: check before the not isinstance(output_items, list) check inside the codex validity gate. Sets response_invalid = True with a logging.warning so the existing _try_activate_fallback() path fires.

Testing

Reproduced locally with an expired Codex OAuth token (401 on refresh) and with quota exhaustion (response.status=failed). Both now transparently switch to the first working fallback_providers entry.

Two related paths where Codex auth failures silently swallowed the
fallback chain instead of switching to the next provider:

1. cli.py — _ensure_runtime_credentials() calls resolve_runtime_provider()
   before each turn. When provider is explicitly configured (not "auto"),
   an AuthError from token refresh is re-raised and printed as a bold-red
   error, returning False before the agent ever starts. The fallback chain
   was never tried. Fix: on AuthError, iterate fallback_providers and
   switch to the first one that resolves successfully.

2. run_agent.py — inside the codex_responses validity gate (inner retry
   loop), response.status in {"failed","cancelled"} with non-empty output
   items was treated as a valid response and broke out of the retry loop,
   reaching _normalize_codex_response() outside the fallback machinery.
   That function raises RuntimeError on status="failed", which propagates
   to the outer except with no fallback logic. Fix: detect terminal status
   codes before the output_items check and set response_invalid=True so
   the existing fallback chain fires normally.
@alt-glitch alt-glitch added type/bug Something isn't working comp/cli CLI entry point, hermes_cli/, setup wizard comp/agent Core agent loop, run_agent.py, prompt builder provider/copilot GitHub Copilot (ACP + Chat) P2 Medium — degraded but workaround exists duplicate This issue or pull request already exists labels Apr 30, 2026
@alt-glitch

Copy link
Copy Markdown
Collaborator

Likely duplicate of #15104 (merged) — same root cause: codex auth failures not routed to fallback provider chain. #15104 consolidated OAuth error parsing + failed-status fallback routing.

@alt-glitch

Copy link
Copy Markdown
Collaborator

Likely duplicate of #15104 (merged) — same root cause: codex auth failures not routed to fallback provider chain.

@teknium1

Copy link
Copy Markdown
Contributor

Closing as superseded by #15104.

Triage notes (high confidence):
cli.py:4501-4537 already contains the AuthError fallback chain logic the PR proposes; PR #15104 (merged 2026-04-24) consolidated codex failed-status fallback routing in run_agent.py.

Thanks for the contribution — the underlying problem this PR addresses has been resolved by the linked PR on current main. If you believe this was closed in error, please comment and we'll reopen.

(Bulk-closed during a CLI PR triage sweep.)

@teknium1 teknium1 closed this May 24, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/agent Core agent loop, run_agent.py, prompt builder comp/cli CLI entry point, hermes_cli/, setup wizard duplicate This issue or pull request already exists P2 Medium — degraded but workaround exists provider/copilot GitHub Copilot (ACP + Chat) type/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants