fix(codex): consolidated OAuth error parsing + failed-status fallback routing + reauth UX by teknium1 · Pull Request #15104 · NousResearch/hermes-agent

teknium1 · 2026-04-24T11:53:27Z

Summary

Consolidated salvage of 3 Codex/OpenAI auth PRs — error parsing, fallback routing, and reauth UX. Attribution preserved via rebase-merge.

Changes

hermes_cli/auth.py — parse OpenAI nested error shape {error: {code, message}} in Codex token refresh (was only reading flat OAuth spec shape); add force_new_login=True param to _login_openai_codex so callers can skip the reuse/import prompts for explicit re-auth flows. Cherry-picks from @j3ffffff (fix(auth): parse OpenAI nested error shape in Codex token refresh #11512) and @MattMaximo (fix(model): Fix command 'hermes model' to allow reuse, reauthentication, or new auth when for OpenAI Codex Auth to match the Anthropic flow #4522).
cli.py — when primary provider auth fails at session start, walk the fallback chain before bailing. Previously _ensure_runtime_credentials() returned False immediately on AuthError. Cherry-pick from @A-FdL-Prog (fix(codex): route auth failures to fallback provider chain #5948).
run_agent.py — in the codex_responses validate path, when response.status is failed or cancelled (terminal provider failures like quota exhaustion), mark the response invalid so the fallback chain triggers instead of bubbling the error outside the retry loop. Same PR as above.

Integrated with main's newer refactors: preserved the _get_transport().validate_response() check, kept the output_text fallback for stream-backfill recovery, kept the token-expiry check in the reuse-prompt path.

Credit

Validation

	Before	After
`refresh_codex_oauth_pure` with OpenAI nested error	surfaced generic "status 401" message, no relogin hint	parses nested `code`/`message`, sets `relogin_required`
Primary Codex auth fails at startup with fallback configured	exits with red error message	walks `fallback_model` chain, switches to first working provider
Codex response arrives with `status=failed` (quota out)	error bubbles past retry loop	marked invalid, triggers fallback
`_login_openai_codex(force_new_login=True)`	ignored, still prompted to reuse	skips reuse/import prompts, runs device-code flow

scripts/run_tests.sh tests/hermes_cli/test_auth_codex_provider.py tests/hermes_cli/test_codex_models.py tests/hermes_cli/test_auth_commands.py tests/agent/test_credential_pool.py tests/hermes_cli/test_runtime_provider_resolution.py tests/hermes_cli/test_codex_cli_model_picker.py — 181/181 passing.

Closes as superseded

The following open PRs were obsoleted by #12360 ("Hermes owns its own Codex auth") which removed _sync_codex_entry_from_cli(), the auto-import, and the post-refresh writeback. They will be closed with credit and a pointer to the merged design:

Recover Codex auth from fresh CLI sessions #13912 @nkongecraig-max — Recover Codex auth from fresh CLI sessions (also scope creep: bundled unrelated optional-skills)
Fix Codex CLI sync clobbering manual pool entries #12924 @Alex-giao — Fix Codex CLI sync clobbering manual pool entries
test(conftest): isolate CODEX_HOME so token-refresh writeback doesn't leak between tests #10927 @luigileap — Isolate CODEX_HOME in tests (writeback removed)
fix: align codex auth status with usable credentials #10286 @redf0x1 — Align codex auth status with usable credentials (small unused-token guard worth revisiting separately)
fix(codex): force codex_responses api_mode for openai-codex provider #10044 @saamuelng601-pixel — Force codex_responses api_mode (already on main, run_agent.py:873-874)
[codex] fix provider auth compatibility #9444 @g-guthrie — _sync_codex_entry_from_cli refinements
fix: auth add openai-codex should import from ~/.codex/auth.json before device code #9284 @ASRagab — Import from ~/.codex/auth.json before device code (explicit import path already exists)
fix(codex): recover stale CLI auth sync state #6652 @lkyprogramer — Recover stale CLI auth sync state
fix(auth): recover stale codex auth from CLI session #3279 @lsaether — Recover stale codex auth from CLI session

OpenAI's OAuth token endpoint returns errors in a nested shape — {"error": {"code": "refresh_token_reused", "message": "..."}} — not the OAuth spec's flat {"error": "...", "error_description": "..."}. The existing parser only handled the flat shape, so: - `err.get("error")` returned a dict, the `isinstance(str)` guard rejected it, and `code` stayed `"codex_refresh_failed"`. - The dedicated `refresh_token_reused` branch (with its actionable "re-run codex + hermes auth" message and `relogin_required=True`) never fired. - Users saw the generic "Codex token refresh failed with status 401" when another Codex client (CLI, VS Code extension) had consumed their single-use refresh token — giving no hint that re-auth was required. Parse both shapes, mapping OpenAI's nested `code`/`type` onto the existing `code` variable so downstream branches (`refresh_token_reused`, `invalid_grant`, etc.) fire correctly. Add regression tests covering: - nested `refresh_token_reused` → actionable message + relogin_required - nested generic code → code + message surfaced - flat OAuth-spec `invalid_grant` still handled (back-compat) - unparseable body → generic fallback message, relogin_required=False Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Two related paths where Codex auth failures silently swallowed the fallback chain instead of switching to the next provider: 1. cli.py — _ensure_runtime_credentials() calls resolve_runtime_provider() before each turn. When provider is explicitly configured (not "auto"), an AuthError from token refresh is re-raised and printed as a bold-red error, returning False before the agent ever starts. The fallback chain was never tried. Fix: on AuthError, iterate fallback_providers and switch to the first one that resolves successfully. 2. run_agent.py — inside the codex_responses validity gate (inner retry loop), response.status in {"failed","cancelled"} with non-empty output items was treated as a valid response and broke out of the retry loop, reaching _normalize_codex_response() outside the fallback machinery. That function raises RuntimeError on status="failed", which propagates to the outer except with no fallback logic. Fix: detect terminal status codes before the output_items check and set response_invalid=True so the existing fallback chain fires normally.

When the Codex Responses API returns HTTP 200 with response.status = "failed" or "cancelled" (e.g. quota exhaustion), the runtime correctly detects the soft failure but only attempts cross-provider fallback via _try_activate_fallback(). Same-provider credential pool rotation is never invoked, so the agent burns through all pool entries on one exhausted account before falling through to a different provider. Root cause: the response_invalid block (PR NousResearch#15104) added detection for codex soft failures but did not wire in _recover_with_credential_pool(). The exception handler path (429/402) does call pool recovery, but these soft failures bypass the exception handler entirely. Changes: - Add _classify_codex_soft_failure() method to AIAgent that pattern- matches the error message from response.error against billing and rate-limit signals, returning a FailoverReason or None. - Add early pool-recovery block in the response_invalid path for codex_responses mode: classify the error, attempt rotation, and continue on the new credential before falling through to cross- provider fallback. - Billing signals are checked first (insufficient, credits, quota, billing, payment required, usage limit, account deactivated). - Rate-limit signals checked second (rate limit, throttle, retry, too many requests, quota exceeded, requests/tokens per). - Non-quota soft failures (content policy, safety, cancelled) return None and do not trigger pool rotation. Fixes NousResearch#24159

j3ffffff and others added 4 commits April 24, 2026 04:52

fix(model): let Codex setup reuse or reauthenticate

6c51614

chore(release): map j3ffffff and A-FdL-Prog in AUTHOR_MAP

2597c0f

teknium1 merged commit 2f39dbe into main Apr 24, 2026
10 of 11 checks passed

teknium1 deleted the hermes/hermes-172af8ae branch April 24, 2026 11:53

teknium1 mentioned this pull request Apr 24, 2026

fix(agent): consolidated fallback/retry correctness — SSL retry, cooldown, config context, auth fallback parity #15134

Merged

This was referenced Apr 24, 2026

fix(auth): parse OpenAI nested error shape in Codex token refresh #11512

Closed

fix(codex): route auth failures to fallback provider chain #5948

Closed

This was referenced May 12, 2026

[Bug] Codex Responses API soft failures (status=failed) bypass credential pool rotation #24159

Open

fix(codex): credential pool rotation on Responses API soft failures #24173

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(codex): consolidated OAuth error parsing + failed-status fallback routing + reauth UX#15104

fix(codex): consolidated OAuth error parsing + failed-status fallback routing + reauth UX#15104
teknium1 merged 4 commits into
mainfrom
hermes/hermes-172af8ae

teknium1 commented Apr 24, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

teknium1 commented Apr 24, 2026

Summary

Changes

Credit

Validation

Closes as superseded

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants