Skip to content

fix(credential-pool): STATUS_DEAD for terminal OAuth failures#34412

Merged
teknium1 merged 1 commit into
mainfrom
hermes/hermes-5bf34d29
May 29, 2026
Merged

fix(credential-pool): STATUS_DEAD for terminal OAuth failures#34412
teknium1 merged 1 commit into
mainfrom
hermes/hermes-5bf34d29

Conversation

@teknium1

@teknium1 teknium1 commented May 29, 2026

Copy link
Copy Markdown
Contributor

Summary

Add a STATUS_DEAD terminal state to the credential pool for permanently-failed OAuth credentials, and auto-prune dead manual entries after a 24-hour quiet window. Closes #32849.

Background

When OpenAI Codex returns a 401 with token_invalidated or token_revoked, the credential is broken upstream — retrying after a TTL cooldown cannot fix it. The current code marks every 401/429/402 the same way: STATUS_EXHAUSTED with a TTL cooldown (5 min for 401, 1 hour for 429/default). After the TTL elapses the credential re-enters rotation and immediately fails again with the same 401, surfacing on long sessions as Failed to generate context summary and breaking compaction every hour all day.

Reporter observed 7 separate 401 token_invalidated failures from the same revoked Codex OAuth credential between 10:27 and 17:50 UTC on a single day. Manual hermes auth → remove was the only way to stop it.

Fix

New STATUS_DEAD state alongside STATUS_OK and STATUS_EXHAUSTED:

  • What goes DEAD: 401 responses whose error.code / error.reason matches a known terminal OAuth state — token_invalidated, token_revoked, invalid_token, invalid_grant, unauthorized_client, refresh_token_reused. Sourced from OpenAI Codex, Anthropic, xAI, and OAuth 2.0 RFCs 6749/6750/7009.
  • What stays EXHAUSTED: 429 rate limits, 402 billing, 5xx, generic 401 with no specific reason (might be a transient server-side glitch worth retrying after TTL).
  • DEAD entries are excluded from rotation unconditionally — no TTL clears them.
  • DEAD entries only clear via an explicit re-auth write-side sync_sync_codex_pool_entries and the _sync_*_entry_from_auth_store paths clear last_status to None, so a fresh device-code login or other re-auth automatically resurrects the credential.
  • Auth.json read-side sync extended to fire on DEAD too — if another process already has fresh tokens in auth.json, an in-flight DEAD entry adopts them on next pool selection without needing explicit re-auth.

Auto-prune DEAD manual entries after 24h (the addition you asked about):

  • After 24 hours with no recovery, manual:* DEAD entries are removed from the pool so dead state doesn't accumulate. The user can always re-add via hermes auth add.
  • Singleton-seeded DEAD entries (device_code, loopback_pkce, claude_code) are NOT pruned because _seed_from_singletons would just recreate them on next load with the same stale singleton tokens. Pruning them would be a pointless dance. The DEAD marker stays visible so the audit trail (label, last_error_reason, timestamps) is preserved.

Why mark-dead-then-prune instead of remove-immediately:

  • Recoverability — within the 24h window, an explicit re-auth (write-side sync) resurrects the entry in-place with the same ID/priority.
  • False-positive safety — if OpenAI hiccups during their own outage and returns a spurious token_invalidated on a valid token (we have no evidence this happens, but the codex backend has been flaky), the user still has a window to recover without re-adding.
  • Audit trail — the user sees last_error_reason: token_invalidated in hermes auth list instead of an unexplained disappearance.
  • refresh_token_reused from inter-process races — losing process's pool entry transitions back to OK on next selection via the auth.json sync.

Changes

  • agent/credential_pool.py:
    • new STATUS_DEAD constant + _TERMINAL_AUTH_REASONS set + DEAD_MANUAL_PRUNE_TTL_SECONDS (24h)
    • new _is_terminal_auth_failure() helper
    • _mark_exhausted() routes terminal reasons to STATUS_DEAD
    • _available_entries() excludes DEAD entries and prunes DEAD manual entries past the 24h TTL
    • the four _sync_*_entry_from_auth_store calls now fire on {EXHAUSTED, DEAD} so a write-side re-auth clears DEAD too
    • mark_exhausted_and_rotate() logs DEAD distinctly at warning level
  • tests/agent/test_credential_pool.py: seven new tests
    • token_invalidated marks DEAD and still rotates to a healthy entry
    • DEAD never re-enters rotation within the 24h window
    • 429 rate limits still use STATUS_EXHAUSTED (transient)
    • 401 without a specific terminal reason still uses STATUS_EXHAUSTED
    • DEAD manual entry pruned after 24h
    • DEAD manual entry kept within 24h (audit trail preserved)
    • DEAD singleton-seeded entry NOT pruned (would just re-seed)

Validation

  • 77/77 pass in tests/agent/test_credential_pool.py (was 70 — 7 new)
  • 338/338 pass across credential_pool + auxiliary_client + codex auth tests

Closes

@github-actions

github-actions Bot commented May 29, 2026

Copy link
Copy Markdown
Contributor

🔎 Lint report: hermes/hermes-5bf34d29 vs origin/main

ruff

Total: 0 on HEAD, 0 on base (➖ 0)

🆕 New issues: none

✅ Fixed issues: none

Unchanged: 0 pre-existing issues carried over.

ty (type checker)

Total: 9421 on HEAD, 9419 on base (🆕 +2)

🆕 New issues: none

✅ Fixed issues: none

Unchanged: 4890 pre-existing issues carried over.

Diagnostics are surfaced as warnings — this check never fails the build.

When OpenAI Codex returns 401 token_invalidated or token_revoked, the
credential is broken upstream — retrying after a TTL cooldown cannot
fix it. The existing code treated every 401/429 the same way:
STATUS_EXHAUSTED with a TTL cooldown (5 min for 401, 1 hour for 429).
After the TTL elapsed, the broken credential re-entered rotation and
immediately failed again with the same 401, surfacing as 'Failed to
generate context summary' on every context-compression cycle.

Reporter observed 7 separate 401 token_invalidated failures from the
same revoked credential in a single day; the only workaround was
removing it manually via 'hermes auth'.

Add a STATUS_DEAD terminal state. Only 401 responses whose
error.code/reason matches a known terminal OAuth state (token_invalidated,
token_revoked, invalid_token, invalid_grant, unauthorized_client,
refresh_token_reused) transition to DEAD. Everything else keeps the
existing TTL semantics — 429 rate limits are transient and should
recover.

DEAD entries are excluded from rotation unconditionally. They only
clear when an explicit write-side re-auth sync rewrites the tokens
(the existing _sync_codex_pool_entries / _sync_*_entry_from_auth_store
paths already clear last_status to None). The read-side
auth.json-sync paths also now fire on DEAD so an in-flight pool entry
can adopt fresh tokens written by another process without needing
explicit re-auth.

After 24 hours, DEAD manual entries (source='manual:*') are pruned
from the pool automatically so dead state doesn't accumulate forever.
Singleton-seeded DEAD entries (source='device_code' etc.) are kept
because _seed_from_singletons would recreate them on the next load
with the same stale tokens — pruning would be pointless. The audit
trail stays visible (label, last_error_reason, timestamps).

Closes #32849.
@teknium1 teknium1 force-pushed the hermes/hermes-5bf34d29 branch from 4f3143d to 9b183d5 Compare May 29, 2026 06:27
@alt-glitch alt-glitch added type/bug Something isn't working comp/agent Core agent loop, run_agent.py, prompt builder area/auth Authentication, OAuth, credential pools P2 Medium — degraded but workaround exists labels May 29, 2026
@alt-glitch

Copy link
Copy Markdown
Collaborator

Competing with #33904 (same STATUS_DEAD concept, different author/approach). Both fix #32849. This PR is by a maintainer and uses error code matching; #33904 uses error_classifier + three-layer fix.

Comment thread agent/credential_pool.py Dismissed
@teknium1 teknium1 merged commit 86a389f into main May 29, 2026
25 checks passed
@teknium1 teknium1 deleted the hermes/hermes-5bf34d29 branch May 29, 2026 06:45
KKT-OPT pushed a commit to KKT-OPT/hermes-agent that referenced this pull request May 31, 2026
…search#32849) (NousResearch#34412)

When OpenAI Codex returns 401 token_invalidated or token_revoked, the
credential is broken upstream — retrying after a TTL cooldown cannot
fix it. The existing code treated every 401/429 the same way:
STATUS_EXHAUSTED with a TTL cooldown (5 min for 401, 1 hour for 429).
After the TTL elapsed, the broken credential re-entered rotation and
immediately failed again with the same 401, surfacing as 'Failed to
generate context summary' on every context-compression cycle.

Reporter observed 7 separate 401 token_invalidated failures from the
same revoked credential in a single day; the only workaround was
removing it manually via 'hermes auth'.

Add a STATUS_DEAD terminal state. Only 401 responses whose
error.code/reason matches a known terminal OAuth state (token_invalidated,
token_revoked, invalid_token, invalid_grant, unauthorized_client,
refresh_token_reused) transition to DEAD. Everything else keeps the
existing TTL semantics — 429 rate limits are transient and should
recover.

DEAD entries are excluded from rotation unconditionally. They only
clear when an explicit write-side re-auth sync rewrites the tokens
(the existing _sync_codex_pool_entries / _sync_*_entry_from_auth_store
paths already clear last_status to None). The read-side
auth.json-sync paths also now fire on DEAD so an in-flight pool entry
can adopt fresh tokens written by another process without needing
explicit re-auth.

After 24 hours, DEAD manual entries (source='manual:*') are pruned
from the pool automatically so dead state doesn't accumulate forever.
Singleton-seeded DEAD entries (source='device_code' etc.) are kept
because _seed_from_singletons would recreate them on the next load
with the same stale tokens — pruning would be pointless. The audit
trail stays visible (label, last_error_reason, timestamps).

Closes NousResearch#32849.
dev-xyz-0-0 added a commit to dev-xyz-0-0/hermes-agent that referenced this pull request Jun 7, 2026
…tedly coming back into use and breaking context summarization or model calls. NousResearch/hermes-agent#34412
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/auth Authentication, OAuth, credential pools comp/agent Core agent loop, run_agent.py, prompt builder P2 Medium — degraded but workaround exists type/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Credential pool: token_invalidated / token_revoked should be terminal failures, not 1-hour cooldowns

3 participants