fix(credential-pool): pool rotation should not count toward api_max_retries (#16830)#16902
Conversation
|
CI audit — all 21 Failure set is the same one observable on every other PR opened today against Locally the full pool/retry suite is green: including the three new regression tests for this fix:
|
56d718f to
fb65542
Compare
…g does not burn retry slots (NousResearch#16830) Issue NousResearch#16830: with a 10-key credential pool and the default api_max_retries=3, only 2-3 keys ever get tried before the agent gives up, even though 7+ keys remain unused. Root cause ---------- On a 429, _recover_with_credential_pool used a "retry same credential once, then rotate" pattern. The first 429 returned (False, True), which caused the outer retry loop in run_agent.py to do: retry_count += 1 before the second attempt — and only the second 429 actually rotated to the next pool entry. Each credential therefore consumed at least one retry slot before being exchanged. With 10 keys and 3 retries, the pool was already over before half its keys had been tried. The fix ------- When the pool reports unexhausted alternates available, rotate immediately on the first 429 instead of burning a retry slot retrying the same key. The rate-limit window won't clear inside our backoff interval anyway, but a different key in the pool likely has fresh quota. Single-credential pools (or pools where every alternate is in cooldown) keep the original retry-same-then-rotate behavior so a transient 429 on the only available key isn't doubly punished. This makes pool walking transparent to api_max_retries — only once every credential has been exhausted does retry_count start incrementing. Changes ------- * agent/credential_pool.py: new `has_unexhausted_alternates()` — read-only check (no _persist, no refresh) for "is there at least one entry besides current that is unexhausted or has cleared its cooldown?" * run_agent.py: rate_limit branch of `_recover_with_credential_pool` consults `has_unexhausted_alternates()` and rotates immediately when alternates exist. Wrapped in try/except AttributeError so older or hand-rolled mock pools without the method keep prior behavior. * tests: extend TestPoolRotationCycle with multi- vs single-credential cases, add a 10-credential pool walk regression test, and add direct unit tests for `has_unexhausted_alternates` covering ok / cooldown / cooldown-elapsed / single-entry shapes. Regression guard ---------------- Reverted the run_agent.py + credential_pool.py changes locally and the 3 new behavior tests fail with `recovered=False` where they now expect `True`. Restored the fix and all 60 tests in the credential pool suites pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
fb65542 to
5ababbb
Compare
|
Rebased onto current |
|
Closing to keep the queue clean — credential_pool.py has been substantially refactored on main since this opened (Nous JWT rewrite, key mix-up fix #25516, cooldown shortening, etc.) and the PR no longer applies cleanly. Happy to reopen with a fresh take if the pool-rotation-vs-retry-count distinction is still useful. |
Summary
api_max_retries— only once every credential is exhausted doesretry_countstart incrementing.The bug
Fixes #16830. With a 10-key credential pool and the default
api_max_retries: 3, only 2-3 keys ever get tried before the agent gives up. The remaining 7+ keys go unused even though they would have served the request.Tracing
_recover_with_credential_pool(inrun_agent.py):Each credential consumed at least one retry slot before being exchanged, because the first 429 returned
(False, True)and the outer loop didretry_count += 1(line 11315) before the second attempt that actually rotates.The fix
agent/credential_pool.py— newhas_unexhausted_alternates(): a read-only query (no_persist, no refresh) that reports whether the pool has at least one entry besides the current one that is either unexhausted or whose cooldown has elapsed.run_agent.py— rate-limit branch of_recover_with_credential_pool:Multi-credential pool: first 429 → rotate immediately (no retry_count cost, fresh key likely has fresh quota).
Single-credential pool: first 429 → retry once before counting, preserving the safety net for transient blips on a lone key.
The
AttributeErrorguard keeps existing tests with hand-rolled fake pools working without modification.Test plan
tests/agent/test_credential_pool_routing.py—TestPoolRotationCycleupdated:test_first_429_with_alternates_rotates_immediately(NEW): multi-credential → rotation on first 429test_first_429_no_alternates_retries_same_credential(NEW): single-credential → original behaviortest_full_pool_walk_does_not_burn_retry_slots(NEW): 10-credential pool walks ~all keys before recovery returns Falsetests/run_agent/test_run_agent.py::TestCredentialPoolRecovery::test_recover_with_pool_rotates_immediately_when_alternates_available(NEW): direct repro of the issue scenario.tests/agent/test_credential_pool.py— four new direct unit tests forhas_unexhausted_alternates: two ok entries, single entry, other-in-cooldown, other-cooldown-elapsed.run_agent.py+agent/credential_pool.py, the 3 new behavior tests failed withrecovered=Falsewhere they now expectTrue. Restored fix → all 60 tests intests/agent/test_credential_pool*andTestCredentialPoolRecoverypass.test_anthropic_adapter.py,test_bedrock_adapter.py,test_copilot_acp_client.pyreproduce on cleanorigin/main(8269f90) and are unrelated to this change.Related