Skip to content

fix(codex-oauth): quarantine terminal refresh errors so dead tokens are not replayed across sessions#27911

Closed
EloquentBrush0x wants to merge 1 commit into
NousResearch:mainfrom
EloquentBrush0x:fix/codex-oauth-terminal-error-quarantine
Closed

fix(codex-oauth): quarantine terminal refresh errors so dead tokens are not replayed across sessions#27911
EloquentBrush0x wants to merge 1 commit into
NousResearch:mainfrom
EloquentBrush0x:fix/codex-oauth-terminal-error-quarantine

Conversation

@EloquentBrush0x

Copy link
Copy Markdown
Contributor

Problem

When a Codex OAuth refresh token is permanently invalidated (HTTP 400/401/403 — token revoked, invalid_grant, or refresh_token_reused), _mark_exhausted is called but auth.json is left unchanged. On the next Hermes session, _seed_from_singletons re-reads auth.json and re-seeds the pool with the same revoked token, which triggers the same terminal failure in a loop across restarts.

The same gap was fixed for Nous in c905562 and is pending for xAI in #27898. This PR closes the identical gap for openai-codex.

Changes

hermes_cli/auth.py

  • Add _is_terminal_codex_oauth_refresh_error(exc): returns True for AuthError instances with provider="openai-codex", a terminal error code (codex_refresh_failed, codex_auth_missing_refresh_token, invalid_grant, invalid_token, refresh_token_reused), and relogin_required=True. Transient failures (429, 5xx) carry relogin_required=False and are not matched.

agent/credential_pool.py

  • Add pre-refresh sync from auth.json before calling refresh_codex_oauth_pure, matching the xAI and Nous patterns, to avoid refresh_token_reused races when multiple Hermes processes share the same auth.json singleton.
  • Add race-recovery block in the exception handler: if auth.json has a newer refresh_token, adopt it and return the recovered entry (same pattern as xAI at line 892 and Nous at line 910).
  • Add terminal quarantine block: if _is_terminal_codex_oauth_refresh_error is true and auth.json holds no newer tokens, clear access_token/refresh_token from auth.json, write last_auth_error, and remove all device_code-sourced entries from the in-memory pool. Mirrors the Nous quarantine path exactly.

tests/agent/test_credential_pool.py

  • test_is_terminal_codex_oauth_refresh_error — predicate unit test (terminal vs transient, wrong provider, generic exception)
  • test_codex_oauth_terminal_refresh_clears_auth_json_and_removes_pool_entries — integration test: pool seeded from auth.json, terminal failure quarantines auth.json and removes device_code entries while manual entries survive; second try_refresh_current does not call the refresh function again
  • test_codex_oauth_nonterminal_refresh_does_not_quarantine — transient failure leaves auth.json tokens intact

Test plan

  • uv run pytest tests/agent/test_credential_pool.py -x -q → 49 passed
  • uv run python -c "from hermes_cli.auth import _is_terminal_codex_oauth_refresh_error; print('OK')" → OK

…re not replayed across sessions

When a Codex OAuth refresh token is permanently invalidated (HTTP 400/401/403,
token revoked or reused), _mark_exhausted is called but auth.json is left with
the dead credentials. On the next session, _seed_from_singletons re-reads
auth.json and re-seeds the pool with the same revoked token, triggering the
same terminal failure in a loop.

Add _is_terminal_codex_oauth_refresh_error to auth.py and a matching quarantine
block in _refresh_entry: when a terminal error is detected and auth.json holds
no newer tokens, clear access_token/refresh_token from auth.json and remove all
device_code-sourced pool entries from memory. Mirrors the Nous quarantine added
in c905562 and the xAI quarantine in NousResearch#27898.

Also add a pre-refresh sync from auth.json before calling refresh_codex_oauth_pure,
matching the xAI and Nous patterns, to avoid refresh_token_reused races when
multiple Hermes processes share the same auth.json singleton.
@EloquentBrush0x EloquentBrush0x requested a review from a team May 18, 2026 09:53
@alt-glitch alt-glitch added type/bug Something isn't working comp/agent Core agent loop, run_agent.py, prompt builder provider/openai OpenAI / Codex Responses API area/auth Authentication, OAuth, credential pools P3 Low — cosmetic, nice to have labels May 18, 2026
@BoardJames-Bot

Copy link
Copy Markdown

Board James triage pass: check-attribution and e2e are green. The only red check is test, and the logs show cancellation late in the suite (~97% progress) rather than a branch-local assertion/error. I opened #27931 for the shared main-line CI drift found while triaging this batch; after that lands, please rebase/rerun checks.

teknium1 pushed a commit that referenced this pull request May 18, 2026
…re not replayed across sessions

When a Codex OAuth refresh token is permanently invalidated (HTTP 400/401/403,
token revoked or reused), _mark_exhausted was called but auth.json was left with
the dead credentials. On the next session, _seed_from_singletons re-read
auth.json and re-seeded the pool with the same revoked token, triggering the
same terminal failure in a loop.

Add _is_terminal_codex_oauth_refresh_error to auth.py and a matching quarantine
block in _refresh_entry: when a terminal error is detected and auth.json holds
no newer tokens, clear access_token/refresh_token from auth.json and remove all
device_code-sourced pool entries from memory. Mirrors the Nous quarantine added
in c905562 and the xAI quarantine in #28116.

Also add a pre-refresh sync from auth.json before calling refresh_codex_oauth_pure,
matching the xAI and Nous patterns, to avoid refresh_token_reused races when
multiple Hermes processes share the same auth.json singleton.

Salvaged from #27911 by @EloquentBrush0x — contributor's branch was severely
stale (would have reverted ~5000 LOC across azure/kanban/i18n subsystems);
fix re-applied surgically on current main with their predicate and tests preserved.
@teknium1

Copy link
Copy Markdown
Contributor

Salvaged via #28118 — your branch was unfortunately very stale (would have reverted ~5000 LOC of unrelated subsystems on cherry-pick), so the fix was re-applied surgically on current main. Your predicate and tests are preserved verbatim, and you remain the commit author. Thanks for catching this parity gap.

Lillard01 pushed a commit to Lillard01/hermes-agent that referenced this pull request May 21, 2026
…re not replayed across sessions

When a Codex OAuth refresh token is permanently invalidated (HTTP 400/401/403,
token revoked or reused), _mark_exhausted was called but auth.json was left with
the dead credentials. On the next session, _seed_from_singletons re-read
auth.json and re-seeded the pool with the same revoked token, triggering the
same terminal failure in a loop.

Add _is_terminal_codex_oauth_refresh_error to auth.py and a matching quarantine
block in _refresh_entry: when a terminal error is detected and auth.json holds
no newer tokens, clear access_token/refresh_token from auth.json and remove all
device_code-sourced pool entries from memory. Mirrors the Nous quarantine added
in c905562 and the xAI quarantine in NousResearch#28116.

Also add a pre-refresh sync from auth.json before calling refresh_codex_oauth_pure,
matching the xAI and Nous patterns, to avoid refresh_token_reused races when
multiple Hermes processes share the same auth.json singleton.

Salvaged from NousResearch#27911 by @EloquentBrush0x — contributor's branch was severely
stale (would have reverted ~5000 LOC across azure/kanban/i18n subsystems);
fix re-applied surgically on current main with their predicate and tests preserved.
Mucky010 pushed a commit to Mucky010/hermes-agent that referenced this pull request May 24, 2026
…re not replayed across sessions

When a Codex OAuth refresh token is permanently invalidated (HTTP 400/401/403,
token revoked or reused), _mark_exhausted was called but auth.json was left with
the dead credentials. On the next session, _seed_from_singletons re-read
auth.json and re-seeded the pool with the same revoked token, triggering the
same terminal failure in a loop.

Add _is_terminal_codex_oauth_refresh_error to auth.py and a matching quarantine
block in _refresh_entry: when a terminal error is detected and auth.json holds
no newer tokens, clear access_token/refresh_token from auth.json and remove all
device_code-sourced pool entries from memory. Mirrors the Nous quarantine added
in c905562 and the xAI quarantine in NousResearch#28116.

Also add a pre-refresh sync from auth.json before calling refresh_codex_oauth_pure,
matching the xAI and Nous patterns, to avoid refresh_token_reused races when
multiple Hermes processes share the same auth.json singleton.

Salvaged from NousResearch#27911 by @EloquentBrush0x — contributor's branch was severely
stale (would have reverted ~5000 LOC across azure/kanban/i18n subsystems);
fix re-applied surgically on current main with their predicate and tests preserved.
gweeteve pushed a commit to gweeteve/hermes-agent that referenced this pull request Jun 2, 2026
…re not replayed across sessions

When a Codex OAuth refresh token is permanently invalidated (HTTP 400/401/403,
token revoked or reused), _mark_exhausted was called but auth.json was left with
the dead credentials. On the next session, _seed_from_singletons re-read
auth.json and re-seeded the pool with the same revoked token, triggering the
same terminal failure in a loop.

Add _is_terminal_codex_oauth_refresh_error to auth.py and a matching quarantine
block in _refresh_entry: when a terminal error is detected and auth.json holds
no newer tokens, clear access_token/refresh_token from auth.json and remove all
device_code-sourced pool entries from memory. Mirrors the Nous quarantine added
in c905562 and the xAI quarantine in NousResearch#28116.

Also add a pre-refresh sync from auth.json before calling refresh_codex_oauth_pure,
matching the xAI and Nous patterns, to avoid refresh_token_reused races when
multiple Hermes processes share the same auth.json singleton.

Salvaged from NousResearch#27911 by @EloquentBrush0x — contributor's branch was severely
stale (would have reverted ~5000 LOC across azure/kanban/i18n subsystems);
fix re-applied surgically on current main with their predicate and tests preserved.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/auth Authentication, OAuth, credential pools comp/agent Core agent loop, run_agent.py, prompt builder P3 Low — cosmetic, nice to have provider/openai OpenAI / Codex Responses API type/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants