fix(codex): consolidated OAuth error parsing + failed-status fallback routing + reauth UX#15104
Merged
Conversation
OpenAI's OAuth token endpoint returns errors in a nested shape —
{"error": {"code": "refresh_token_reused", "message": "..."}} —
not the OAuth spec's flat {"error": "...", "error_description": "..."}.
The existing parser only handled the flat shape, so:
- `err.get("error")` returned a dict, the `isinstance(str)` guard
rejected it, and `code` stayed `"codex_refresh_failed"`.
- The dedicated `refresh_token_reused` branch (with its actionable
"re-run codex + hermes auth" message and `relogin_required=True`)
never fired.
- Users saw the generic "Codex token refresh failed with status 401"
when another Codex client (CLI, VS Code extension) had consumed
their single-use refresh token — giving no hint that re-auth was
required.
Parse both shapes, mapping OpenAI's nested `code`/`type` onto the
existing `code` variable so downstream branches (`refresh_token_reused`,
`invalid_grant`, etc.) fire correctly.
Add regression tests covering:
- nested `refresh_token_reused` → actionable message + relogin_required
- nested generic code → code + message surfaced
- flat OAuth-spec `invalid_grant` still handled (back-compat)
- unparseable body → generic fallback message, relogin_required=False
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Two related paths where Codex auth failures silently swallowed the
fallback chain instead of switching to the next provider:
1. cli.py — _ensure_runtime_credentials() calls resolve_runtime_provider()
before each turn. When provider is explicitly configured (not "auto"),
an AuthError from token refresh is re-raised and printed as a bold-red
error, returning False before the agent ever starts. The fallback chain
was never tried. Fix: on AuthError, iterate fallback_providers and
switch to the first one that resolves successfully.
2. run_agent.py — inside the codex_responses validity gate (inner retry
loop), response.status in {"failed","cancelled"} with non-empty output
items was treated as a valid response and broke out of the retry loop,
reaching _normalize_codex_response() outside the fallback machinery.
That function raises RuntimeError on status="failed", which propagates
to the outer except with no fallback logic. Fix: detect terminal status
codes before the output_items check and set response_invalid=True so
the existing fallback chain fires normally.
This was referenced Apr 24, 2026
Closed
This was referenced Apr 24, 2026
This was referenced May 12, 2026
jmmaloney4
added a commit
to jmmaloney4/hermes-agent
that referenced
this pull request
May 13, 2026
When the Codex Responses API returns HTTP 200 with response.status = "failed" or "cancelled" (e.g. quota exhaustion), the runtime correctly detects the soft failure but only attempts cross-provider fallback via _try_activate_fallback(). Same-provider credential pool rotation is never invoked, so the agent burns through all pool entries on one exhausted account before falling through to a different provider. Root cause: the response_invalid block (PR NousResearch#15104) added detection for codex soft failures but did not wire in _recover_with_credential_pool(). The exception handler path (429/402) does call pool recovery, but these soft failures bypass the exception handler entirely. Changes: - Add _classify_codex_soft_failure() method to AIAgent that pattern- matches the error message from response.error against billing and rate-limit signals, returning a FailoverReason or None. - Add early pool-recovery block in the response_invalid path for codex_responses mode: classify the error, attempt rotation, and continue on the new credential before falling through to cross- provider fallback. - Billing signals are checked first (insufficient, credits, quota, billing, payment required, usage limit, account deactivated). - Rate-limit signals checked second (rate limit, throttle, retry, too many requests, quota exceeded, requests/tokens per). - Non-quota soft failures (content policy, safety, cancelled) return None and do not trigger pool rotation. Fixes NousResearch#24159
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Consolidated salvage of 3 Codex/OpenAI auth PRs — error parsing, fallback routing, and reauth UX. Attribution preserved via rebase-merge.
Changes
hermes_cli/auth.py— parse OpenAI nested error shape{error: {code, message}}in Codex token refresh (was only reading flat OAuth spec shape); addforce_new_login=Trueparam to_login_openai_codexso callers can skip the reuse/import prompts for explicit re-auth flows. Cherry-picks from @j3ffffff (fix(auth): parse OpenAI nested error shape in Codex token refresh #11512) and @MattMaximo (fix(model): Fix command 'hermes model' to allow reuse, reauthentication, or new auth when for OpenAI Codex Auth to match the Anthropic flow #4522).cli.py— when primary provider auth fails at session start, walk the fallback chain before bailing. Previously_ensure_runtime_credentials()returned False immediately on AuthError. Cherry-pick from @A-FdL-Prog (fix(codex): route auth failures to fallback provider chain #5948).run_agent.py— in the codex_responses validate path, whenresponse.statusisfailedorcancelled(terminal provider failures like quota exhaustion), mark the response invalid so the fallback chain triggers instead of bubbling the error outside the retry loop. Same PR as above.Integrated with main's newer refactors: preserved the
_get_transport().validate_response()check, kept theoutput_textfallback for stream-backfill recovery, kept the token-expiry check in the reuse-prompt path.Credit
Validation
refresh_codex_oauth_purewith OpenAI nested errorcode/message, setsrelogin_requiredfallback_modelchain, switches to first working providerstatus=failed(quota out)_login_openai_codex(force_new_login=True)scripts/run_tests.sh tests/hermes_cli/test_auth_codex_provider.py tests/hermes_cli/test_codex_models.py tests/hermes_cli/test_auth_commands.py tests/agent/test_credential_pool.py tests/hermes_cli/test_runtime_provider_resolution.py tests/hermes_cli/test_codex_cli_model_picker.py— 181/181 passing.Closes as superseded
The following open PRs were obsoleted by #12360 ("Hermes owns its own Codex auth") which removed
_sync_codex_entry_from_cli(), the auto-import, and the post-refresh writeback. They will be closed with credit and a pointer to the merged design:run_agent.py:873-874)_sync_codex_entry_from_clirefinements~/.codex/auth.jsonbefore device code (explicit import path already exists)