fix(credential-pool): STATUS_DEAD for terminal OAuth failures#34412
Merged
Conversation
Contributor
🔎 Lint report:
|
When OpenAI Codex returns 401 token_invalidated or token_revoked, the credential is broken upstream — retrying after a TTL cooldown cannot fix it. The existing code treated every 401/429 the same way: STATUS_EXHAUSTED with a TTL cooldown (5 min for 401, 1 hour for 429). After the TTL elapsed, the broken credential re-entered rotation and immediately failed again with the same 401, surfacing as 'Failed to generate context summary' on every context-compression cycle. Reporter observed 7 separate 401 token_invalidated failures from the same revoked credential in a single day; the only workaround was removing it manually via 'hermes auth'. Add a STATUS_DEAD terminal state. Only 401 responses whose error.code/reason matches a known terminal OAuth state (token_invalidated, token_revoked, invalid_token, invalid_grant, unauthorized_client, refresh_token_reused) transition to DEAD. Everything else keeps the existing TTL semantics — 429 rate limits are transient and should recover. DEAD entries are excluded from rotation unconditionally. They only clear when an explicit write-side re-auth sync rewrites the tokens (the existing _sync_codex_pool_entries / _sync_*_entry_from_auth_store paths already clear last_status to None). The read-side auth.json-sync paths also now fire on DEAD so an in-flight pool entry can adopt fresh tokens written by another process without needing explicit re-auth. After 24 hours, DEAD manual entries (source='manual:*') are pruned from the pool automatically so dead state doesn't accumulate forever. Singleton-seeded DEAD entries (source='device_code' etc.) are kept because _seed_from_singletons would recreate them on the next load with the same stale tokens — pruning would be pointless. The audit trail stays visible (label, last_error_reason, timestamps). Closes #32849.
4f3143d to
9b183d5
Compare
Collaborator
KKT-OPT
pushed a commit
to KKT-OPT/hermes-agent
that referenced
this pull request
May 31, 2026
…search#32849) (NousResearch#34412) When OpenAI Codex returns 401 token_invalidated or token_revoked, the credential is broken upstream — retrying after a TTL cooldown cannot fix it. The existing code treated every 401/429 the same way: STATUS_EXHAUSTED with a TTL cooldown (5 min for 401, 1 hour for 429). After the TTL elapsed, the broken credential re-entered rotation and immediately failed again with the same 401, surfacing as 'Failed to generate context summary' on every context-compression cycle. Reporter observed 7 separate 401 token_invalidated failures from the same revoked credential in a single day; the only workaround was removing it manually via 'hermes auth'. Add a STATUS_DEAD terminal state. Only 401 responses whose error.code/reason matches a known terminal OAuth state (token_invalidated, token_revoked, invalid_token, invalid_grant, unauthorized_client, refresh_token_reused) transition to DEAD. Everything else keeps the existing TTL semantics — 429 rate limits are transient and should recover. DEAD entries are excluded from rotation unconditionally. They only clear when an explicit write-side re-auth sync rewrites the tokens (the existing _sync_codex_pool_entries / _sync_*_entry_from_auth_store paths already clear last_status to None). The read-side auth.json-sync paths also now fire on DEAD so an in-flight pool entry can adopt fresh tokens written by another process without needing explicit re-auth. After 24 hours, DEAD manual entries (source='manual:*') are pruned from the pool automatically so dead state doesn't accumulate forever. Singleton-seeded DEAD entries (source='device_code' etc.) are kept because _seed_from_singletons would recreate them on the next load with the same stale tokens — pruning would be pointless. The audit trail stays visible (label, last_error_reason, timestamps). Closes NousResearch#32849.
1 task
dev-xyz-0-0
added a commit
to dev-xyz-0-0/hermes-agent
that referenced
this pull request
Jun 7, 2026
…tedly coming back into use and breaking context summarization or model calls. NousResearch/hermes-agent#34412
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Add a
STATUS_DEADterminal state to the credential pool for permanently-failed OAuth credentials, and auto-prune dead manual entries after a 24-hour quiet window. Closes #32849.Background
When OpenAI Codex returns a 401 with
token_invalidatedortoken_revoked, the credential is broken upstream — retrying after a TTL cooldown cannot fix it. The current code marks every 401/429/402 the same way:STATUS_EXHAUSTEDwith a TTL cooldown (5 min for 401, 1 hour for 429/default). After the TTL elapses the credential re-enters rotation and immediately fails again with the same 401, surfacing on long sessions asFailed to generate context summaryand breaking compaction every hour all day.Reporter observed 7 separate
401 token_invalidatedfailures from the same revoked Codex OAuth credential between 10:27 and 17:50 UTC on a single day. Manualhermes auth→ remove was the only way to stop it.Fix
New
STATUS_DEADstate alongsideSTATUS_OKandSTATUS_EXHAUSTED:error.code/error.reasonmatches a known terminal OAuth state —token_invalidated,token_revoked,invalid_token,invalid_grant,unauthorized_client,refresh_token_reused. Sourced from OpenAI Codex, Anthropic, xAI, and OAuth 2.0 RFCs 6749/6750/7009._sync_codex_pool_entriesand the_sync_*_entry_from_auth_storepaths clearlast_statusto None, so a fresh device-code login or other re-auth automatically resurrects the credential.auth.json, an in-flight DEAD entry adopts them on next pool selection without needing explicit re-auth.Auto-prune DEAD manual entries after 24h (the addition you asked about):
manual:*DEAD entries are removed from the pool so dead state doesn't accumulate. The user can always re-add viahermes auth add.device_code,loopback_pkce,claude_code) are NOT pruned because_seed_from_singletonswould just recreate them on next load with the same stale singleton tokens. Pruning them would be a pointless dance. The DEAD marker stays visible so the audit trail (label, last_error_reason, timestamps) is preserved.Why mark-dead-then-prune instead of remove-immediately:
token_invalidatedon a valid token (we have no evidence this happens, but the codex backend has been flaky), the user still has a window to recover without re-adding.last_error_reason: token_invalidatedinhermes auth listinstead of an unexplained disappearance.refresh_token_reusedfrom inter-process races — losing process's pool entry transitions back to OK on next selection via the auth.json sync.Changes
agent/credential_pool.py:STATUS_DEADconstant +_TERMINAL_AUTH_REASONSset +DEAD_MANUAL_PRUNE_TTL_SECONDS(24h)_is_terminal_auth_failure()helper_mark_exhausted()routes terminal reasons toSTATUS_DEAD_available_entries()excludes DEAD entries and prunes DEAD manual entries past the 24h TTL_sync_*_entry_from_auth_storecalls now fire on{EXHAUSTED, DEAD}so a write-side re-auth clears DEAD toomark_exhausted_and_rotate()logs DEAD distinctly at warning leveltests/agent/test_credential_pool.py: seven new teststoken_invalidatedmarks DEAD and still rotates to a healthy entrySTATUS_EXHAUSTED(transient)STATUS_EXHAUSTEDValidation
tests/agent/test_credential_pool.py(was 70 — 7 new)Closes