Skip to content

fix(xai-oauth): quarantine terminal refresh errors to prevent dead-token replay across sessions#27898

Closed
EloquentBrush0x wants to merge 1 commit into
NousResearch:mainfrom
EloquentBrush0x:fix/xai-oauth-terminal-error-quarantine
Closed

fix(xai-oauth): quarantine terminal refresh errors to prevent dead-token replay across sessions#27898
EloquentBrush0x wants to merge 1 commit into
NousResearch:mainfrom
EloquentBrush0x:fix/xai-oauth-terminal-error-quarantine

Conversation

@EloquentBrush0x

Copy link
Copy Markdown
Contributor

Problem

When refresh_xai_oauth_pure returns HTTP 400/401/403 (revoked token, invalid_grant, or a reused refresh token not caught by the pre-refresh race-recovery sync), _refresh_entry fell through to _mark_exhausted without clearing providers["xai-oauth"]["tokens"] from auth.json.

On the next Hermes startup _seed_from_singletons reads auth.json and re-seeds the pool with the same revoked credentials. The next refresh attempt fails identically, and the cycle repeats on every session restart until the user manually re-authenticates.

Root cause

c90556262 added a full terminal-error quarantine for Nous OAuth (_is_terminal_nous_refresh_error + _quarantine_nous_oauth_state + pool entry removal), but the parallel xAI OAuth error-handler path was not updated at the same time.

Fix

hermes_cli/auth.py

  • Add _is_terminal_xai_oauth_refresh_error(exc) — returns True for xai_refresh_failed (HTTP 400/401/403) and xai_auth_missing_refresh_token with relogin_required=True; transient errors (429, 5xx) carry relogin_required=False and are not matched

agent/credential_pool.py

  • After the existing auth.json race-recovery block in _refresh_entry, add a quarantine path for xAI OAuth:
    • Re-checks that auth.json still holds the same (dead) refresh token — if another process has already rotated it, no action is taken
    • Clears access_token and refresh_token from providers["xai-oauth"]["tokens"]
    • Writes last_auth_error for hermes doctor / hermes auth status diagnostics
    • Removes all loopback_pkce-sourced entries from the in-memory pool
    • Persists the updated pool; returns None

Tests

  • test_is_terminal_xai_oauth_refresh_error — unit-tests the predicate for terminal vs transient vs wrong-provider errors
  • test_xai_oauth_terminal_refresh_clears_auth_json_and_removes_pool_entries — end-to-end: terminal error clears tokens, sets last_auth_error, removes loopback entries, keeps manual entries; second try_refresh_current() does not call refresh_xai_oauth_pure again
  • test_xai_oauth_nonterminal_refresh_does_not_quarantine — 429/5xx error leaves auth.json tokens untouched

49/49 tests pass.

Checklist

  • Mirrors the Nous quarantine pattern from c90556262 exactly
  • Does not affect the existing race-recovery path (another process rotated tokens → adopt and return)
  • Does not quarantine on transient errors (rate limits, server errors)
  • Manual/API-key pool entries survive the quarantine
  • All existing credential pool tests pass

… not replayed across sessions

When refresh_xai_oauth_pure raises a terminal error (HTTP 400/401/403,
i.e. revoked or reused refresh token), _refresh_entry's existing race-
recovery path re-syncs from auth.json and returns if another process has
already rotated the tokens.  If auth.json still holds the same stale
token pair, the function fell through to _mark_exhausted — leaving the
dead credentials in auth.json.  On the next Hermes startup _seed_from_singletons
re-seeded the pool from those stale tokens, causing the same failure loop
on every session.

Fix: after the auth.json re-sync check in the xAI-oauth error handler,
detect terminal errors with the new _is_terminal_xai_oauth_refresh_error
helper and apply a quarantine:
- Clear access_token and refresh_token from providers["xai-oauth"]["tokens"]
  in auth.json so they are not re-seeded.
- Write a last_auth_error entry for hermes doctor / auth status diagnostics.
- Remove all loopback_pkce entries from the in-memory pool so the current
  session stops retrying with the dead credentials.

Mirrors the identical quarantine already in place for Nous OAuth
(c905562).

Closes the parity gap introduced when c905562 added Nous-only terminal
error handling without a corresponding xAI-oauth path.
@alt-glitch alt-glitch added type/bug Something isn't working comp/agent Core agent loop, run_agent.py, prompt builder area/auth Authentication, OAuth, credential pools provider/xai xAI (Grok) P3 Low — cosmetic, nice to have labels May 18, 2026
@BoardJames-Bot

Copy link
Copy Markdown

Board James triage pass: check-attribution and e2e are green. The only red check is test, and the logs show cancellation late in the suite (~95% progress) rather than a branch-local assertion/error. I opened #27931 for the shared main-line CI drift found while triaging this batch; after that lands, please rebase/rerun checks.

@teknium1

Copy link
Copy Markdown
Contributor

Merged via #28116. Cherry-picked onto current main with your authorship preserved. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/auth Authentication, OAuth, credential pools comp/agent Core agent loop, run_agent.py, prompt builder P3 Low — cosmetic, nice to have provider/xai xAI (Grok) type/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants