fix(credential-pool): pool rotation should not count toward api_max_retries (#16830) by briandevans · Pull Request #16902 · NousResearch/hermes-agent

briandevans · 2026-04-28T08:18:07Z

Summary

429 on a credential with multi-key pool now rotates immediately instead of first burning a retry slot retrying the same key.
Single-credential pools keep the original retry-same-then-rotate behavior so a transient 429 on the only available key isn't doubly punished.
Pool walking is transparent to api_max_retries — only once every credential is exhausted does retry_count start incrementing.

The bug

Fixes #16830. With a 10-key credential pool and the default api_max_retries: 3, only 2-3 keys ever get tried before the agent gives up. The remaining 7+ keys go unused even though they would have served the request.

Tracing _recover_with_credential_pool (in run_agent.py):

Iter 1: 429 on key #1 → returns (False, True)              → retry_count=1
Iter 2: 429 on key #1 → rotates to key #2, returns (True,…) → retry_count=1
Iter 3: 429 on key #2 → returns (False, True)              → retry_count=2
Iter 4: 429 on key #2 → rotates to key #3, returns (True,…) → retry_count=2
Iter 5: 429 on key #3 → returns (False, True)              → retry_count=3 → exit

Each credential consumed at least one retry slot before being exchanged, because the first 429 returned (False, True) and the outer loop did retry_count += 1 (line 11315) before the second attempt that actually rotates.

The fix

agent/credential_pool.py — new has_unexhausted_alternates(): a read-only query (no _persist, no refresh) that reports whether the pool has at least one entry besides the current one that is either unexhausted or whose cooldown has elapsed.

run_agent.py — rate-limit branch of _recover_with_credential_pool:

has_alternates = False
try:
    has_alternates = bool(pool.has_unexhausted_alternates())
except AttributeError:
    has_alternates = False  # older / hand-rolled mock pools
if not has_retried_429 and not has_alternates:
    return False, True  # single credential — keep original behavior
next_entry = pool.mark_exhausted_and_rotate(...)
...

Multi-credential pool: first 429 → rotate immediately (no retry_count cost, fresh key likely has fresh quota).
Single-credential pool: first 429 → retry once before counting, preserving the safety net for transient blips on a lone key.

The AttributeError guard keeps existing tests with hand-rolled fake pools working without modification.

Test plan

tests/agent/test_credential_pool_routing.py — TestPoolRotationCycle updated:
- test_first_429_with_alternates_rotates_immediately (NEW): multi-credential → rotation on first 429
- test_first_429_no_alternates_retries_same_credential (NEW): single-credential → original behavior
- test_full_pool_walk_does_not_burn_retry_slots (NEW): 10-credential pool walks ~all keys before recovery returns False
tests/run_agent/test_run_agent.py::TestCredentialPoolRecovery::test_recover_with_pool_rotates_immediately_when_alternates_available (NEW): direct repro of the issue scenario.
tests/agent/test_credential_pool.py — four new direct unit tests for has_unexhausted_alternates: two ok entries, single entry, other-in-cooldown, other-cooldown-elapsed.
Regression guard: reverted run_agent.py + agent/credential_pool.py, the 3 new behavior tests failed with recovered=False where they now expect True. Restored fix → all 60 tests in tests/agent/test_credential_pool* and TestCredentialPoolRecovery pass.
CI's pre-existing baseline failures in test_anthropic_adapter.py, test_bedrock_adapter.py, test_copilot_acp_client.py reproduce on clean origin/main (8269f90) and are unrelated to this change.

Failure set is the same one observable on every other PR opened today against 8081425a1 (e.g. #16851, #16786, #16770, the now-merged #16868) — test_anthropic_adapter::test_custom_base_url, test_copilot_acp_client::test_read_text_file_redacts_sensitive_content, test_config_env_expansion::*, test_session_split_brain_11016::* (timeouts), test_make_agent_provider::test_make_agent_passes_resolved_provider, test_protocol::test_session_resume_returns_hydrated_messages, etc.

Locally the full pool/retry suite is green:

tests/agent/test_credential_pool.py + test_credential_pool_routing.py + tests/run_agent/test_run_agent.py
57 passed

including the three new regression tests for this fix:

TestPoolRotationCycle::test_first_429_with_alternates_rotates_immediately
TestPoolRotationCycle::test_first_429_no_alternates_retries_same_credential
TestPoolRotationCycle::test_full_pool_walk_does_not_burn_retry_slots

…g does not burn retry slots (NousResearch#16830) Issue NousResearch#16830: with a 10-key credential pool and the default api_max_retries=3, only 2-3 keys ever get tried before the agent gives up, even though 7+ keys remain unused. Root cause ---------- On a 429, _recover_with_credential_pool used a "retry same credential once, then rotate" pattern. The first 429 returned (False, True), which caused the outer retry loop in run_agent.py to do: retry_count += 1 before the second attempt — and only the second 429 actually rotated to the next pool entry. Each credential therefore consumed at least one retry slot before being exchanged. With 10 keys and 3 retries, the pool was already over before half its keys had been tried. The fix ------- When the pool reports unexhausted alternates available, rotate immediately on the first 429 instead of burning a retry slot retrying the same key. The rate-limit window won't clear inside our backoff interval anyway, but a different key in the pool likely has fresh quota. Single-credential pools (or pools where every alternate is in cooldown) keep the original retry-same-then-rotate behavior so a transient 429 on the only available key isn't doubly punished. This makes pool walking transparent to api_max_retries — only once every credential has been exhausted does retry_count start incrementing. Changes ------- * agent/credential_pool.py: new `has_unexhausted_alternates()` — read-only check (no _persist, no refresh) for "is there at least one entry besides current that is unexhausted or has cleared its cooldown?" * run_agent.py: rate_limit branch of `_recover_with_credential_pool` consults `has_unexhausted_alternates()` and rotates immediately when alternates exist. Wrapped in try/except AttributeError so older or hand-rolled mock pools without the method keep prior behavior. * tests: extend TestPoolRotationCycle with multi- vs single-credential cases, add a 10-credential pool walk regression test, and add direct unit tests for `has_unexhausted_alternates` covering ok / cooldown / cooldown-elapsed / single-entry shapes. Regression guard ---------------- Reverted the run_agent.py + credential_pool.py changes locally and the 3 new behavior tests fail with `recovered=False` where they now expect `True`. Restored the fix and all 60 tests in the credential pool suites pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

briandevans · 2026-05-07T03:16:44Z

Rebased onto current origin/main (49c3c2e0d) — clean rebase, no conflicts. Re-ran focused tests (tests/agent/test_credential_pool.py, tests/agent/test_credential_pool_routing.py, tests/run_agent/test_run_agent.py) on the rebased head: all 373 pass. The fix continues to apply cleanly across the run_agent / credential_pool / fallback churn since the original open (notably #17929 fallback-at-init, the IterationBudget lock, OpenRouter routing changes, and the provider-modules ProviderProfile rewrite). Issue #16830 is still open and StreakingJerry echoed the same complaint last week. Happy to address any review feedback whenever convenient.

briandevans · 2026-05-18T09:25:17Z

Closing to keep the queue clean — credential_pool.py has been substantially refactored on main since this opened (Nous JWT rewrite, key mix-up fix #25516, cooldown shortening, etc.) and the PR no longer applies cleanly. Happy to reopen with a fresh take if the pool-rotation-vs-retry-count distinction is still useful.

Copilot AI review requested due to automatic review settings April 28, 2026 08:18

alt-glitch added type/bug Something isn't working P1 High — major feature broken, no workaround comp/agent Core agent loop, run_agent.py, prompt builder labels Apr 28, 2026

briandevans mentioned this pull request Apr 28, 2026

fix(qqbot): surface non-image attachments to agent when download fails (#16979) #16990

Closed

3 tasks

briandevans force-pushed the fix/credential-pool-rotation-not-counting-retries-16830 branch from 56d718f to fb65542 Compare April 28, 2026 12:47

briandevans mentioned this pull request Apr 28, 2026

fix(aux): default Codex reasoning effort to medium when extra_body.reasoning.effort is falsy #17012

Closed

4 tasks

briandevans force-pushed the fix/credential-pool-rotation-not-counting-retries-16830 branch from fb65542 to 5ababbb Compare May 7, 2026 03:16

briandevans closed this May 18, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(credential-pool): pool rotation should not count toward api_max_retries (#16830)#16902

fix(credential-pool): pool rotation should not count toward api_max_retries (#16830)#16902
briandevans wants to merge 1 commit into
NousResearch:mainfrom
briandevans:fix/credential-pool-rotation-not-counting-retries-16830

briandevans commented Apr 28, 2026

Uh oh!

briandevans commented Apr 28, 2026

Uh oh!

briandevans commented May 7, 2026

Uh oh!

briandevans commented May 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

briandevans commented Apr 28, 2026

Summary

The bug

The fix

Test plan

Related

Uh oh!

briandevans commented Apr 28, 2026

Uh oh!

briandevans commented May 7, 2026

Uh oh!

briandevans commented May 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants