fix(codex): route auth failures to fallback provider chain by A-FdL-Prog · Pull Request #5948 · NousResearch/hermes-agent

A-FdL-Prog · 2026-04-07T21:34:19Z

Problem

Two code paths where Codex authentication failures silently bypassed the fallback provider chain instead of switching to the next configured provider.

1. `cli.py` — pre-turn credential check blocks fallback entirely

_ensure_runtime_credentials() is called before every agent turn. It calls resolve_runtime_provider(requested=self.requested_provider). When the provider is explicitly configured (e.g. provider: openai-codex in config.yaml — the typical setup), resolve_runtime_provider re-raises AuthError on token refresh failure rather than falling through. This error is caught, printed in bold red, and the function returns False — so the agent never starts and fallback_providers is never consulted.

Symptom: User sees a raw Codex token refresh failed with status 401 error in the Hermes TUI instead of a transparent switch to the fallback model.

2. `run_agent.py` — `response.status=failed` escapes the retry/fallback loop

Inside the codex_responses validity gate (inner retry loop), a response with status in {failed, cancelled} and non-empty output_items was treated as valid (the not isinstance(output_items, list) check passed) and broke out of the loop. It then reached _normalize_codex_response() — which is called outside the retry loop and raises RuntimeError on terminal status codes. That exception propagated to the outer except Exception handler which has no fallback logic.

Symptom: Quota exhaustion on the Codex provider (which returns response.status=failed) showed a raw GPT error instead of switching to the fallback chain.

Fix

cli.py: On AuthError in _ensure_runtime_credentials, iterate fallback_providers and try each in order. On first success, swap self.requested_provider and self.model to the fallback and continue normally. Only fail if every provider in the chain is exhausted.

run_agent.py: Add an explicit elif _codex_resp_status in {failed, cancelled}: check before the not isinstance(output_items, list) check inside the codex validity gate. Sets response_invalid = True with a logging.warning so the existing _try_activate_fallback() path fires.

Testing

Reproduced locally with an expired Codex OAuth token (401 on refresh) and with quota exhaustion (response.status=failed). Both now transparently switch to the first working fallback_providers entry.

Two related paths where Codex auth failures silently swallowed the fallback chain instead of switching to the next provider: 1. cli.py — _ensure_runtime_credentials() calls resolve_runtime_provider() before each turn. When provider is explicitly configured (not "auto"), an AuthError from token refresh is re-raised and printed as a bold-red error, returning False before the agent ever starts. The fallback chain was never tried. Fix: on AuthError, iterate fallback_providers and switch to the first one that resolves successfully. 2. run_agent.py — inside the codex_responses validity gate (inner retry loop), response.status in {"failed","cancelled"} with non-empty output items was treated as a valid response and broke out of the retry loop, reaching _normalize_codex_response() outside the fallback machinery. That function raises RuntimeError on status="failed", which propagates to the outer except with no fallback logic. Fix: detect terminal status codes before the output_items check and set response_invalid=True so the existing fallback chain fires normally.

alt-glitch · 2026-04-30T15:48:04Z

Likely duplicate of #15104 (merged) — same root cause: codex auth failures not routed to fallback provider chain. #15104 consolidated OAuth error parsing + failed-status fallback routing.

alt-glitch · 2026-04-30T15:48:40Z

Likely duplicate of #15104 (merged) — same root cause: codex auth failures not routed to fallback provider chain.

teknium1 · 2026-05-24T11:05:02Z

Closing as superseded by #15104.

Triage notes (high confidence):
cli.py:4501-4537 already contains the AuthError fallback chain logic the PR proposes; PR #15104 (merged 2026-04-24) consolidated codex failed-status fallback routing in run_agent.py.

Thanks for the contribution — the underlying problem this PR addresses has been resolved by the linked PR on current main. If you believe this was closed in error, please comment and we'll reopen.

(Bulk-closed during a CLI PR triage sweep.)

This was referenced Apr 24, 2026

fix(codex): consolidated OAuth error parsing + failed-status fallback routing + reauth UX #15104

Merged

fix(agent): consolidated fallback/retry correctness — SSL retry, cooldown, config context, auth fallback parity #15134

Merged

alt-glitch mentioned this pull request Apr 24, 2026

fix(auth): parse OpenAI nested error shape in Codex token refresh #11512

Closed

2 tasks

teknium1 closed this May 24, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(codex): route auth failures to fallback provider chain#5948

fix(codex): route auth failures to fallback provider chain#5948
A-FdL-Prog wants to merge 1 commit into
NousResearch:mainfrom
A-FdL-Prog:fix/codex-auth-failure-fallback

A-FdL-Prog commented Apr 7, 2026

Uh oh!

alt-glitch commented Apr 30, 2026

Uh oh!

alt-glitch commented Apr 30, 2026

Uh oh!

teknium1 commented May 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

A-FdL-Prog commented Apr 7, 2026

Problem

1. cli.py — pre-turn credential check blocks fallback entirely

2. run_agent.py — response.status=failed escapes the retry/fallback loop

Fix

Testing

Uh oh!

alt-glitch commented Apr 30, 2026

Uh oh!

alt-glitch commented Apr 30, 2026

Uh oh!

teknium1 commented May 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

1. `cli.py` — pre-turn credential check blocks fallback entirely

2. `run_agent.py` — `response.status=failed` escapes the retry/fallback loop