fix(credential-pool): STATUS_DEAD for terminal OAuth failures by teknium1 · Pull Request #34412 · NousResearch/hermes-agent

teknium1 · 2026-05-29T06:17:14Z

Summary

Add a STATUS_DEAD terminal state to the credential pool for permanently-failed OAuth credentials, and auto-prune dead manual entries after a 24-hour quiet window. Closes #32849.

Background

When OpenAI Codex returns a 401 with token_invalidated or token_revoked, the credential is broken upstream — retrying after a TTL cooldown cannot fix it. The current code marks every 401/429/402 the same way: STATUS_EXHAUSTED with a TTL cooldown (5 min for 401, 1 hour for 429/default). After the TTL elapses the credential re-enters rotation and immediately fails again with the same 401, surfacing on long sessions as Failed to generate context summary and breaking compaction every hour all day.

Reporter observed 7 separate 401 token_invalidated failures from the same revoked Codex OAuth credential between 10:27 and 17:50 UTC on a single day. Manual hermes auth → remove was the only way to stop it.

Fix

New STATUS_DEAD state alongside STATUS_OK and STATUS_EXHAUSTED:

What goes DEAD: 401 responses whose error.code / error.reason matches a known terminal OAuth state — token_invalidated, token_revoked, invalid_token, invalid_grant, unauthorized_client, refresh_token_reused. Sourced from OpenAI Codex, Anthropic, xAI, and OAuth 2.0 RFCs 6749/6750/7009.
What stays EXHAUSTED: 429 rate limits, 402 billing, 5xx, generic 401 with no specific reason (might be a transient server-side glitch worth retrying after TTL).
DEAD entries are excluded from rotation unconditionally — no TTL clears them.
DEAD entries only clear via an explicit re-auth write-side sync — _sync_codex_pool_entries and the _sync_*_entry_from_auth_store paths clear last_status to None, so a fresh device-code login or other re-auth automatically resurrects the credential.
Auth.json read-side sync extended to fire on DEAD too — if another process already has fresh tokens in auth.json, an in-flight DEAD entry adopts them on next pool selection without needing explicit re-auth.

Auto-prune DEAD manual entries after 24h (the addition you asked about):

After 24 hours with no recovery, manual:* DEAD entries are removed from the pool so dead state doesn't accumulate. The user can always re-add via hermes auth add.
Singleton-seeded DEAD entries (device_code, loopback_pkce, claude_code) are NOT pruned because _seed_from_singletons would just recreate them on next load with the same stale singleton tokens. Pruning them would be a pointless dance. The DEAD marker stays visible so the audit trail (label, last_error_reason, timestamps) is preserved.

Why mark-dead-then-prune instead of remove-immediately:

Recoverability — within the 24h window, an explicit re-auth (write-side sync) resurrects the entry in-place with the same ID/priority.
False-positive safety — if OpenAI hiccups during their own outage and returns a spurious token_invalidated on a valid token (we have no evidence this happens, but the codex backend has been flaky), the user still has a window to recover without re-adding.
Audit trail — the user sees last_error_reason: token_invalidated in hermes auth list instead of an unexplained disappearance.
refresh_token_reused from inter-process races — losing process's pool entry transitions back to OK on next selection via the auth.json sync.

Changes

agent/credential_pool.py:
- new STATUS_DEAD constant + _TERMINAL_AUTH_REASONS set + DEAD_MANUAL_PRUNE_TTL_SECONDS (24h)
- new _is_terminal_auth_failure() helper
- _mark_exhausted() routes terminal reasons to STATUS_DEAD
- _available_entries() excludes DEAD entries and prunes DEAD manual entries past the 24h TTL
- the four _sync_*_entry_from_auth_store calls now fire on {EXHAUSTED, DEAD} so a write-side re-auth clears DEAD too
- mark_exhausted_and_rotate() logs DEAD distinctly at warning level
tests/agent/test_credential_pool.py: seven new tests
- token_invalidated marks DEAD and still rotates to a healthy entry
- DEAD never re-enters rotation within the 24h window
- 429 rate limits still use STATUS_EXHAUSTED (transient)
- 401 without a specific terminal reason still uses STATUS_EXHAUSTED
- DEAD manual entry pruned after 24h
- DEAD manual entry kept within 24h (audit trail preserved)
- DEAD singleton-seeded entry NOT pruned (would just re-seed)

Validation

77/77 pass in tests/agent/test_credential_pool.py (was 70 — 7 new)
338/338 pass across credential_pool + auxiliary_client + codex auth tests

Closes

Credential pool: token_invalidated / token_revoked should be terminal failures, not 1-hour cooldowns #32849

github-actions · 2026-05-29T06:19:14Z

🔎 Lint report: `hermes/hermes-5bf34d29` vs `origin/main`

ruff

Total: 0 on HEAD, 0 on base (➖ 0)

🆕 New issues: none

✅ Fixed issues: none

Unchanged: 0 pre-existing issues carried over.

ty (type checker)

Total: 9421 on HEAD, 9419 on base (🆕 +2)

🆕 New issues: none

✅ Fixed issues: none

Unchanged: 4890 pre-existing issues carried over.

Diagnostics are surfaced as warnings — this check never fails the build.

When OpenAI Codex returns 401 token_invalidated or token_revoked, the credential is broken upstream — retrying after a TTL cooldown cannot fix it. The existing code treated every 401/429 the same way: STATUS_EXHAUSTED with a TTL cooldown (5 min for 401, 1 hour for 429). After the TTL elapsed, the broken credential re-entered rotation and immediately failed again with the same 401, surfacing as 'Failed to generate context summary' on every context-compression cycle. Reporter observed 7 separate 401 token_invalidated failures from the same revoked credential in a single day; the only workaround was removing it manually via 'hermes auth'. Add a STATUS_DEAD terminal state. Only 401 responses whose error.code/reason matches a known terminal OAuth state (token_invalidated, token_revoked, invalid_token, invalid_grant, unauthorized_client, refresh_token_reused) transition to DEAD. Everything else keeps the existing TTL semantics — 429 rate limits are transient and should recover. DEAD entries are excluded from rotation unconditionally. They only clear when an explicit write-side re-auth sync rewrites the tokens (the existing _sync_codex_pool_entries / _sync_*_entry_from_auth_store paths already clear last_status to None). The read-side auth.json-sync paths also now fire on DEAD so an in-flight pool entry can adopt fresh tokens written by another process without needing explicit re-auth. After 24 hours, DEAD manual entries (source='manual:*') are pruned from the pool automatically so dead state doesn't accumulate forever. Singleton-seeded DEAD entries (source='device_code' etc.) are kept because _seed_from_singletons would recreate them on the next load with the same stale tokens — pruning would be pointless. The audit trail stays visible (label, last_error_reason, timestamps). Closes #32849.

alt-glitch · 2026-05-29T06:28:35Z

Competing with #33904 (same STATUS_DEAD concept, different author/approach). Both fix #32849. This PR is by a maintainer and uses error code matching; #33904 uses error_classifier + three-layer fix.

…search#32849) (NousResearch#34412) When OpenAI Codex returns 401 token_invalidated or token_revoked, the credential is broken upstream — retrying after a TTL cooldown cannot fix it. The existing code treated every 401/429 the same way: STATUS_EXHAUSTED with a TTL cooldown (5 min for 401, 1 hour for 429). After the TTL elapsed, the broken credential re-entered rotation and immediately failed again with the same 401, surfacing as 'Failed to generate context summary' on every context-compression cycle. Reporter observed 7 separate 401 token_invalidated failures from the same revoked credential in a single day; the only workaround was removing it manually via 'hermes auth'. Add a STATUS_DEAD terminal state. Only 401 responses whose error.code/reason matches a known terminal OAuth state (token_invalidated, token_revoked, invalid_token, invalid_grant, unauthorized_client, refresh_token_reused) transition to DEAD. Everything else keeps the existing TTL semantics — 429 rate limits are transient and should recover. DEAD entries are excluded from rotation unconditionally. They only clear when an explicit write-side re-auth sync rewrites the tokens (the existing _sync_codex_pool_entries / _sync_*_entry_from_auth_store paths already clear last_status to None). The read-side auth.json-sync paths also now fire on DEAD so an in-flight pool entry can adopt fresh tokens written by another process without needing explicit re-auth. After 24 hours, DEAD manual entries (source='manual:*') are pruned from the pool automatically so dead state doesn't accumulate forever. Singleton-seeded DEAD entries (source='device_code' etc.) are kept because _seed_from_singletons would recreate them on the next load with the same stale tokens — pruning would be pointless. The audit trail stays visible (label, last_error_reason, timestamps). Closes NousResearch#32849.

…tedly coming back into use and breaking context summarization or model calls. NousResearch/hermes-agent#34412

teknium1 force-pushed the hermes/hermes-5bf34d29 branch from 4f3143d to 9b183d5 Compare May 29, 2026 06:27

alt-glitch added type/bug Something isn't working comp/agent Core agent loop, run_agent.py, prompt builder area/auth Authentication, OAuth, credential pools P2 Medium — degraded but workaround exists labels May 29, 2026

github-advanced-security AI found potential problems May 29, 2026

View reviewed changes

Comment thread agent/credential_pool.py Dismissed

teknium1 merged commit 86a389f into main May 29, 2026
25 checks passed

teknium1 deleted the hermes/hermes-5bf34d29 branch May 29, 2026 06:45

BrewTestBot mentioned this pull request Jun 6, 2026

hermes-agent 2026.6.5 Homebrew/homebrew-core#286569

Merged

1 task

github-actions Bot mentioned this pull request Jun 6, 2026

chore: bump NousResearch/hermes-agent version from v2026.5.29.2 to v2026.6.5 Docker-Hub-sirmark/docker-hermes-agent#9

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(credential-pool): STATUS_DEAD for terminal OAuth failures#34412

fix(credential-pool): STATUS_DEAD for terminal OAuth failures#34412
teknium1 merged 1 commit into
mainfrom
hermes/hermes-5bf34d29

teknium1 commented May 29, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented May 29, 2026 •

edited

Loading

Uh oh!

alt-glitch commented May 29, 2026

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

teknium1 commented May 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Background

Fix

Changes

Validation

Closes

Uh oh!

github-actions Bot commented May 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔎 Lint report: hermes/hermes-5bf34d29 vs origin/main

ruff

ty (type checker)

Uh oh!

alt-glitch commented May 29, 2026

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

teknium1 commented May 29, 2026 •

edited

Loading

github-actions Bot commented May 29, 2026 •

edited

Loading

🔎 Lint report: `hermes/hermes-5bf34d29` vs `origin/main`