Skip to content

fix(pool): rotate pooled credentials immediately on usage_limit or auth failure#10282

Closed
redf0x1 wants to merge 1 commit into
NousResearch:mainfrom
redf0x1:fix/credential-pool-rotation-clean
Closed

fix(pool): rotate pooled credentials immediately on usage_limit or auth failure#10282
redf0x1 wants to merge 1 commit into
NousResearch:mainfrom
redf0x1:fix/credential-pool-rotation-clean

Conversation

@redf0x1

@redf0x1 redf0x1 commented Apr 15, 2026

Copy link
Copy Markdown

Problem

When a pooled credential (for example openai-codex) becomes exhausted (usage_limit_reached or auth refresh failure), Hermes could keep retrying the same broken credential instead of rotating immediately to another usable pooled entry.

Symptoms:

  • repeated retries on an already exhausted credential
  • no immediate failover to another pooled account
  • user-visible usage limit errors with no forward progress

Root cause

  • CredentialPool._refresh_entry(force=True) could still hand back the current credential instead of definitively failing refresh
  • _recover_with_credential_pool() only rotated when refresh returned None
  • the caller could burn retry budget before moving to another usable pooled credential

Fix

Runtime changes

  1. agent/credential_pool.py
    • force-refresh failure now falls through to exhausted state instead of reusing the same credential
  2. run_agent.py
    • when refresh fails, call mark_exhausted_and_rotate() immediately
    • rotate to the next pooled credential without short retrying the same exhausted credential
    • fail fast with reset hint when the pool has no usable entry left

Scope

Included:

  • credential-pool rotation on exhaustion
  • immediate failover instead of retrying a known-bad credential
  • exhaustion state persistence

Excluded:

Testing

python -m pytest tests/agent/test_credential_pool.py -q
# 31 passed

python -m pytest tests/run_agent/test_run_agent.py::TestCredentialPoolRecovery -q
# 10 passed

Related

Copilot AI review requested due to automatic review settings April 15, 2026 11:49

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

…th failure

Core credential pool failover fix:
- prevent infinite retry loop when pooled credential becomes exhausted (usage_limit, rate_limit)
- rotate to next available credential immediately on usage_limit or auth failure
- persist reset window timestamps to prevent retry scans before reset completes
- mark exhausted credentials properly so pool selection respects state

Implementation:
- credential_pool: fallback token validation (when Codex hardening module unavailable)
- run_agent: immediate rotation via mark_exhausted_and_rotate when refresh fails
- failure modes: fail-fast with reset hint when no pooled credentials available

Test coverage:
- credential pool rotation and exhaustion
- force refresh on auth failure returns None (enabling pool rotation)
- selected entry usage tracking

This fix addresses the core failover incident. Codex auth status normalization
is a separate concern and will be in a follow-up PR.

Scope: core failover logic only (Codex auth hardening cut for PR NousResearch#3)
@redf0x1 redf0x1 force-pushed the fix/credential-pool-rotation-clean branch from 0252c54 to a158ddb Compare April 15, 2026 13:48
@alt-glitch alt-glitch added type/bug Something isn't working P2 Medium — degraded but workaround exists comp/agent Core agent loop, run_agent.py, prompt builder area/auth Authentication, OAuth, credential pools labels Apr 26, 2026
@alt-glitch

Copy link
Copy Markdown
Collaborator

Likely superseded by merged #15120 (consolidated credential-pool fix: rotation, cross-process sync, least_used counter, 403 triggering rotation). Please verify whether the specific exhaustion-state persistence in this PR is covered by #15120.

@teknium1

Copy link
Copy Markdown
Contributor

Thanks for the contribution @redf0x1 — this automated hermes-sweeper review found that the changes in this PR are fully covered by the already-merged #15120 ("fix(credential-pool): correctness + rotation + cross-process sync", merged 2026-04-24, commit 785d168d5).

Specifically, main now has:

  • mark_exhausted_and_rotate() in agent/credential_pool.py:857 — immediate exhaustion + rotation on a bad credential
  • _recover_with_credential_pool() in run_agent.py:5678/5693/5713 calls mark_exhausted_and_rotate() directly for billing, rate-limit, and auth-refresh-failed paths — no short-retry on a known-bad credential
  • _refresh_entry(force=True) falls through to _mark_exhausted() on failure (agent/credential_pool.py:577, 716) — the specific force-refresh regression this PR identified
  • Full exhaustion-state persistence via STATUS_EXHAUSTED, _mark_exhausted(), _exhausted_ttl(), and _exhausted_until() with TTL-based cooldown

@alt-glitch's 2026-04-26 comment correctly identified the supersession. Closing as implemented on main.

@teknium1 teknium1 closed this Apr 27, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/auth Authentication, OAuth, credential pools comp/agent Core agent loop, run_agent.py, prompt builder P2 Medium — degraded but workaround exists type/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants