Skip to content

fix(credential-pool): pool rotation should not count toward api_max_retries (#16830)#16902

Closed
briandevans wants to merge 1 commit into
NousResearch:mainfrom
briandevans:fix/credential-pool-rotation-not-counting-retries-16830
Closed

fix(credential-pool): pool rotation should not count toward api_max_retries (#16830)#16902
briandevans wants to merge 1 commit into
NousResearch:mainfrom
briandevans:fix/credential-pool-rotation-not-counting-retries-16830

Conversation

@briandevans

Copy link
Copy Markdown
Contributor

Summary

  • 429 on a credential with multi-key pool now rotates immediately instead of first burning a retry slot retrying the same key.
  • Single-credential pools keep the original retry-same-then-rotate behavior so a transient 429 on the only available key isn't doubly punished.
  • Pool walking is transparent to api_max_retries — only once every credential is exhausted does retry_count start incrementing.

The bug

Fixes #16830. With a 10-key credential pool and the default api_max_retries: 3, only 2-3 keys ever get tried before the agent gives up. The remaining 7+ keys go unused even though they would have served the request.

Tracing _recover_with_credential_pool (in run_agent.py):

Iter 1: 429 on key #1 → returns (False, True)              → retry_count=1
Iter 2: 429 on key #1 → rotates to key #2, returns (True,…) → retry_count=1
Iter 3: 429 on key #2 → returns (False, True)              → retry_count=2
Iter 4: 429 on key #2 → rotates to key #3, returns (True,…) → retry_count=2
Iter 5: 429 on key #3 → returns (False, True)              → retry_count=3 → exit

Each credential consumed at least one retry slot before being exchanged, because the first 429 returned (False, True) and the outer loop did retry_count += 1 (line 11315) before the second attempt that actually rotates.

The fix

agent/credential_pool.py — new has_unexhausted_alternates(): a read-only query (no _persist, no refresh) that reports whether the pool has at least one entry besides the current one that is either unexhausted or whose cooldown has elapsed.

run_agent.py — rate-limit branch of _recover_with_credential_pool:

has_alternates = False
try:
    has_alternates = bool(pool.has_unexhausted_alternates())
except AttributeError:
    has_alternates = False  # older / hand-rolled mock pools
if not has_retried_429 and not has_alternates:
    return False, True  # single credential — keep original behavior
next_entry = pool.mark_exhausted_and_rotate(...)
...

Multi-credential pool: first 429 → rotate immediately (no retry_count cost, fresh key likely has fresh quota).
Single-credential pool: first 429 → retry once before counting, preserving the safety net for transient blips on a lone key.

The AttributeError guard keeps existing tests with hand-rolled fake pools working without modification.

Test plan

  • tests/agent/test_credential_pool_routing.pyTestPoolRotationCycle updated:
    • test_first_429_with_alternates_rotates_immediately (NEW): multi-credential → rotation on first 429
    • test_first_429_no_alternates_retries_same_credential (NEW): single-credential → original behavior
    • test_full_pool_walk_does_not_burn_retry_slots (NEW): 10-credential pool walks ~all keys before recovery returns False
  • tests/run_agent/test_run_agent.py::TestCredentialPoolRecovery::test_recover_with_pool_rotates_immediately_when_alternates_available (NEW): direct repro of the issue scenario.
  • tests/agent/test_credential_pool.py — four new direct unit tests for has_unexhausted_alternates: two ok entries, single entry, other-in-cooldown, other-cooldown-elapsed.
  • Regression guard: reverted run_agent.py + agent/credential_pool.py, the 3 new behavior tests failed with recovered=False where they now expect True. Restored fix → all 60 tests in tests/agent/test_credential_pool* and TestCredentialPoolRecovery pass.
  • CI's pre-existing baseline failures in test_anthropic_adapter.py, test_bedrock_adapter.py, test_copilot_acp_client.py reproduce on clean origin/main (8269f90) and are unrelated to this change.

Related

Copilot AI review requested due to automatic review settings April 28, 2026 08:18
@alt-glitch alt-glitch added type/bug Something isn't working P1 High — major feature broken, no workaround comp/agent Core agent loop, run_agent.py, prompt builder labels Apr 28, 2026
@briandevans

Copy link
Copy Markdown
Contributor Author

CI audit — all 21 test job failures on commit 56d718fce are pre-existing baselines on clean origin/main (8081425a1). Zero failures intersect with touched code (agent/credential_pool.py, run_agent.py, or the three new test files).

Failure set is the same one observable on every other PR opened today against 8081425a1 (e.g. #16851, #16786, #16770, the now-merged #16868) — test_anthropic_adapter::test_custom_base_url, test_copilot_acp_client::test_read_text_file_redacts_sensitive_content, test_config_env_expansion::*, test_session_split_brain_11016::* (timeouts), test_make_agent_provider::test_make_agent_passes_resolved_provider, test_protocol::test_session_resume_returns_hydrated_messages, etc.

Locally the full pool/retry suite is green:

tests/agent/test_credential_pool.py + test_credential_pool_routing.py + tests/run_agent/test_run_agent.py
57 passed

including the three new regression tests for this fix:

  • TestPoolRotationCycle::test_first_429_with_alternates_rotates_immediately
  • TestPoolRotationCycle::test_first_429_no_alternates_retries_same_credential
  • TestPoolRotationCycle::test_full_pool_walk_does_not_burn_retry_slots

…g does not burn retry slots (NousResearch#16830)

Issue NousResearch#16830: with a 10-key credential pool and the default
api_max_retries=3, only 2-3 keys ever get tried before the agent gives
up, even though 7+ keys remain unused.

Root cause
----------
On a 429, _recover_with_credential_pool used a "retry same credential
once, then rotate" pattern.  The first 429 returned (False, True),
which caused the outer retry loop in run_agent.py to do:

    retry_count += 1

before the second attempt — and only the second 429 actually rotated
to the next pool entry.  Each credential therefore consumed at least
one retry slot before being exchanged.  With 10 keys and 3 retries,
the pool was already over before half its keys had been tried.

The fix
-------
When the pool reports unexhausted alternates available, rotate
immediately on the first 429 instead of burning a retry slot retrying
the same key.  The rate-limit window won't clear inside our backoff
interval anyway, but a different key in the pool likely has fresh
quota.  Single-credential pools (or pools where every alternate is in
cooldown) keep the original retry-same-then-rotate behavior so a
transient 429 on the only available key isn't doubly punished.

This makes pool walking transparent to api_max_retries — only once
every credential has been exhausted does retry_count start
incrementing.

Changes
-------
* agent/credential_pool.py: new `has_unexhausted_alternates()` —
  read-only check (no _persist, no refresh) for "is there at least one
  entry besides current that is unexhausted or has cleared its
  cooldown?"
* run_agent.py: rate_limit branch of `_recover_with_credential_pool`
  consults `has_unexhausted_alternates()` and rotates immediately when
  alternates exist.  Wrapped in try/except AttributeError so older or
  hand-rolled mock pools without the method keep prior behavior.
* tests: extend TestPoolRotationCycle with multi- vs single-credential
  cases, add a 10-credential pool walk regression test, and add direct
  unit tests for `has_unexhausted_alternates` covering ok / cooldown /
  cooldown-elapsed / single-entry shapes.

Regression guard
----------------
Reverted the run_agent.py + credential_pool.py changes locally and the
3 new behavior tests fail with `recovered=False` where they now expect
`True`.  Restored the fix and all 60 tests in the credential pool
suites pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@briandevans briandevans force-pushed the fix/credential-pool-rotation-not-counting-retries-16830 branch from fb65542 to 5ababbb Compare May 7, 2026 03:16
@briandevans

Copy link
Copy Markdown
Contributor Author

Rebased onto current origin/main (49c3c2e0d) — clean rebase, no conflicts. Re-ran focused tests (tests/agent/test_credential_pool.py, tests/agent/test_credential_pool_routing.py, tests/run_agent/test_run_agent.py) on the rebased head: all 373 pass. The fix continues to apply cleanly across the run_agent / credential_pool / fallback churn since the original open (notably #17929 fallback-at-init, the IterationBudget lock, OpenRouter routing changes, and the provider-modules ProviderProfile rewrite). Issue #16830 is still open and StreakingJerry echoed the same complaint last week. Happy to address any review feedback whenever convenient.

@briandevans

Copy link
Copy Markdown
Contributor Author

Closing to keep the queue clean — credential_pool.py has been substantially refactored on main since this opened (Nous JWT rewrite, key mix-up fix #25516, cooldown shortening, etc.) and the PR no longer applies cleanly. Happy to reopen with a fresh take if the pool-rotation-vs-retry-count distinction is still useful.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/agent Core agent loop, run_agent.py, prompt builder P1 High — major feature broken, no workaround type/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Credential pool rotation should not count toward api_max_retries

2 participants