Skip to content

fix(codex): consolidated OAuth error parsing + failed-status fallback routing + reauth UX#15104

Merged
teknium1 merged 4 commits into
mainfrom
hermes/hermes-172af8ae
Apr 24, 2026
Merged

fix(codex): consolidated OAuth error parsing + failed-status fallback routing + reauth UX#15104
teknium1 merged 4 commits into
mainfrom
hermes/hermes-172af8ae

Conversation

@teknium1

Copy link
Copy Markdown
Contributor

Summary

Consolidated salvage of 3 Codex/OpenAI auth PRs — error parsing, fallback routing, and reauth UX. Attribution preserved via rebase-merge.

Changes

Integrated with main's newer refactors: preserved the _get_transport().validate_response() check, kept the output_text fallback for stream-backfill recovery, kept the token-expiry check in the reuse-prompt path.

Credit

Validation

Before After
refresh_codex_oauth_pure with OpenAI nested error surfaced generic "status 401" message, no relogin hint parses nested code/message, sets relogin_required
Primary Codex auth fails at startup with fallback configured exits with red error message walks fallback_model chain, switches to first working provider
Codex response arrives with status=failed (quota out) error bubbles past retry loop marked invalid, triggers fallback
_login_openai_codex(force_new_login=True) ignored, still prompted to reuse skips reuse/import prompts, runs device-code flow

scripts/run_tests.sh tests/hermes_cli/test_auth_codex_provider.py tests/hermes_cli/test_codex_models.py tests/hermes_cli/test_auth_commands.py tests/agent/test_credential_pool.py tests/hermes_cli/test_runtime_provider_resolution.py tests/hermes_cli/test_codex_cli_model_picker.py — 181/181 passing.

Closes as superseded

The following open PRs were obsoleted by #12360 ("Hermes owns its own Codex auth") which removed _sync_codex_entry_from_cli(), the auto-import, and the post-refresh writeback. They will be closed with credit and a pointer to the merged design:

j3ffffff and others added 4 commits April 24, 2026 04:52
OpenAI's OAuth token endpoint returns errors in a nested shape —
{"error": {"code": "refresh_token_reused", "message": "..."}} —
not the OAuth spec's flat {"error": "...", "error_description": "..."}.
The existing parser only handled the flat shape, so:

- `err.get("error")` returned a dict, the `isinstance(str)` guard
  rejected it, and `code` stayed `"codex_refresh_failed"`.
- The dedicated `refresh_token_reused` branch (with its actionable
  "re-run codex + hermes auth" message and `relogin_required=True`)
  never fired.
- Users saw the generic "Codex token refresh failed with status 401"
  when another Codex client (CLI, VS Code extension) had consumed
  their single-use refresh token — giving no hint that re-auth was
  required.

Parse both shapes, mapping OpenAI's nested `code`/`type` onto the
existing `code` variable so downstream branches (`refresh_token_reused`,
`invalid_grant`, etc.) fire correctly.

Add regression tests covering:
- nested `refresh_token_reused` → actionable message + relogin_required
- nested generic code → code + message surfaced
- flat OAuth-spec `invalid_grant` still handled (back-compat)
- unparseable body → generic fallback message, relogin_required=False

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Two related paths where Codex auth failures silently swallowed the
fallback chain instead of switching to the next provider:

1. cli.py — _ensure_runtime_credentials() calls resolve_runtime_provider()
   before each turn. When provider is explicitly configured (not "auto"),
   an AuthError from token refresh is re-raised and printed as a bold-red
   error, returning False before the agent ever starts. The fallback chain
   was never tried. Fix: on AuthError, iterate fallback_providers and
   switch to the first one that resolves successfully.

2. run_agent.py — inside the codex_responses validity gate (inner retry
   loop), response.status in {"failed","cancelled"} with non-empty output
   items was treated as a valid response and broke out of the retry loop,
   reaching _normalize_codex_response() outside the fallback machinery.
   That function raises RuntimeError on status="failed", which propagates
   to the outer except with no fallback logic. Fix: detect terminal status
   codes before the output_items check and set response_invalid=True so
   the existing fallback chain fires normally.
@teknium1 teknium1 merged commit 2f39dbe into main Apr 24, 2026
10 of 11 checks passed
@teknium1 teknium1 deleted the hermes/hermes-172af8ae branch April 24, 2026 11:53
@alt-glitch alt-glitch added type/bug Something isn't working P1 High — major feature broken, no workaround comp/agent Core agent loop, run_agent.py, prompt builder comp/cli CLI entry point, hermes_cli/, setup wizard area/auth Authentication, OAuth, credential pools provider/openai OpenAI / Codex Responses API labels Apr 24, 2026
jmmaloney4 added a commit to jmmaloney4/hermes-agent that referenced this pull request May 13, 2026
When the Codex Responses API returns HTTP 200 with response.status =
"failed" or "cancelled" (e.g. quota exhaustion), the runtime correctly
detects the soft failure but only attempts cross-provider fallback via
_try_activate_fallback().  Same-provider credential pool rotation is
never invoked, so the agent burns through all pool entries on one
exhausted account before falling through to a different provider.

Root cause: the response_invalid block (PR NousResearch#15104) added detection for
codex soft failures but did not wire in _recover_with_credential_pool().
The exception handler path (429/402) does call pool recovery, but these
soft failures bypass the exception handler entirely.

Changes:
- Add _classify_codex_soft_failure() method to AIAgent that pattern-
  matches the error message from response.error against billing and
  rate-limit signals, returning a FailoverReason or None.
- Add early pool-recovery block in the response_invalid path for
  codex_responses mode: classify the error, attempt rotation, and
  continue on the new credential before falling through to cross-
  provider fallback.
- Billing signals are checked first (insufficient, credits, quota,
  billing, payment required, usage limit, account deactivated).
- Rate-limit signals checked second (rate limit, throttle, retry,
  too many requests, quota exceeded, requests/tokens per).
- Non-quota soft failures (content policy, safety, cancelled) return
  None and do not trigger pool rotation.

Fixes NousResearch#24159
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/auth Authentication, OAuth, credential pools comp/agent Core agent loop, run_agent.py, prompt builder comp/cli CLI entry point, hermes_cli/, setup wizard P1 High — major feature broken, no workaround provider/openai OpenAI / Codex Responses API type/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants