Skip to content

feat(credential-pool): mark revoked/invalidated credentials as permanently dead#33904

Open
zccyman wants to merge 1 commit into
NousResearch:mainfrom
atyou2happy:fix/32849-credential-pool-terminal-failures
Open

feat(credential-pool): mark revoked/invalidated credentials as permanently dead#33904
zccyman wants to merge 1 commit into
NousResearch:mainfrom
atyou2happy:fix/32849-credential-pool-terminal-failures

Conversation

@zccyman

@zccyman zccyman commented May 28, 2026

Copy link
Copy Markdown
Contributor

Summary

Fixes #32849 — credential pool treats token_invalidated/token_revoked errors as temporary exhaustion (5min cooldown → re-enter rotation) instead of permanently removing the credential.

Problem

The credential pool has only two states: STATUS_OK and STATUS_EXHAUSTED. All 401 errors get a 5-minute cooldown, after which the credential re-enters rotation. For permanently invalid credentials (revoked keys, deactivated accounts), this creates an infinite retry cycle:

  1. 401 token_revoked → marked exhausted (TTL=5min)
  2. 5 minutes later → cooldown expires → credential back in rotation
  3. GOTO 1

Solution

Three-layer defense:

Layer File Change
Classification error_classifier.py 401 messages containing token_revoked, invalid_api_key, key has been deactivated, etc. → FailoverReason.auth_permanent (instead of auth)
State credential_pool.py New STATUS_DEAD = "dead" terminal state. _mark_exhausted() uses STATUS_DEAD for permanent failures (no reset_at). _available_entries() permanently skips DEAD entries. reset_statuses() can still revive (manual user action).
Recovery agent_runtime_helpers.py auth_permanent path skips the OAuth refresh attempt entirely and goes straight to mark-as-dead + rotate.

Testing

  • 33 new tests in test_credential_pool_dead_state.py
  • 150 existing test_error_classifier.py tests all pass (zero regression)

Test coverage

  • _is_permanent_auth_failure(): pattern matching for 15 message variants
  • _mark_exhausted(): permanent → DEAD, transient → EXHAUSTED, 429 → EXHAUSTED
  • _available_entries(): DEAD entries never returned, never resurrect after time
  • mark_exhausted_and_rotate(): integration test with rotation
  • error_classifier: token_revokedauth_permanent, transient 401 stays auth
  • reset_statuses(): manual reset can revive DEAD entries

Checklist

  • Code follows project style (ruff check passes)
  • Tests added and passing (33/33)
  • No regression in existing tests (150/150 error_classifier tests pass)
  • Minimal diff — 108 lines added, 0 removed across 3 source files

…ently dead

Credential pool previously treated all 401 errors as transient
exhausted (5min cooldown → re-enter rotation). token_invalidated
and token_revoked errors are permanent — the key will never work
again — but the pool kept cycling them every 5 minutes.

Changes:
- credential_pool.py: Add STATUS_DEAD terminal state. Permanent
  auth failures (revoked, invalidated, deactivated keys) get marked
  as DEAD with no cooldown. _available_entries() permanently skips
  DEAD entries. User can still revive via reset_statuses().
- error_classifier.py: 401 messages containing token_revoked,
  invalid_api_key, etc. now classify as auth_permanent instead of
  auth, giving downstream code the signal to skip refresh attempts.
- agent_runtime_helpers.py: auth_permanent path skips the OAuth
  refresh attempt entirely and goes straight to mark-as-dead +
  rotate.

Tests: 33 new tests covering pattern detection, dead state lifecycle,
exhaustion fallback, error classifier integration, and manual reset.

Fixes: NousResearch#32849
@alt-glitch alt-glitch added type/bug Something isn't working P2 Medium — degraded but workaround exists comp/agent Core agent loop, run_agent.py, prompt builder area/auth Authentication, OAuth, credential pools labels May 28, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/auth Authentication, OAuth, credential pools comp/agent Core agent loop, run_agent.py, prompt builder P2 Medium — degraded but workaround exists type/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Credential pool: token_invalidated / token_revoked should be terminal failures, not 1-hour cooldowns

2 participants