fix(credential-pool): exponential backoff on repeated exhaustion (#15296)#15455
fix(credential-pool): exponential backoff on repeated exhaustion (#15296)#15455briandevans wants to merge 2 commits into
Conversation
There was a problem hiding this comment.
Pull request overview
Adds exponential backoff to the credential pool’s exhaustion cooldown by tracking per-credential consecutive failure streaks, reducing repeated retry waste during sustained provider outages.
Changes:
- Introduces
consecutive_failuresonPooledCredentialand increments it on_mark_exhausted. - Updates
_exhausted_ttl/_exhausted_untilto apply capped exponential backoff based on consecutive failures. - Resets the streak on explicit success/operator-reset paths, and adds a comprehensive regression test suite.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
agent/credential_pool.py |
Adds the failure-streak counter and applies capped exponential backoff to exhaustion cooldowns, with resets on success/operator actions. |
tests/agent/test_credential_pool_backoff.py |
New regression tests covering TTL progression, persistence, legacy entries, and reset semantics. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| # exhaustion, so the first failure should still apply 1× the base | ||
| # (i.e. multiplier exponent ``failures - 1``, clamped at 0). | ||
| exponent = max(0, int(consecutive_failures) - 1) | ||
| multiplier = min(2 ** exponent, EXHAUSTED_TTL_BACKOFF_CAP) |
| last_error_reason=normalized_error.get("reason"), | ||
| last_error_message=normalized_error.get("message"), | ||
| last_error_reset_at=normalized_error.get("reset_at"), | ||
| consecutive_failures=int(prior_failures) + 1, | ||
| ) |
| from dataclasses import replace | ||
| from unittest.mock import patch |
| # ``consecutive_failures`` is the count *including* the current | ||
| # exhaustion, so the first failure should still apply 1× the base | ||
| # (i.e. multiplier exponent ``failures - 1``, clamped at 0). | ||
| exponent = max(0, int(consecutive_failures) - 1) |
|
Thanks @copilot — all four findings addressed in 1. 2 & 3. 4. Unused imports in test file — Two new defensive tests added:
72/72 credential-pool tests pass (44 pre-existing + 28 new). Zero behaviour change for valid inputs. |
…sResearch#15296) The pool's ``_exhausted_ttl`` returned a flat ``EXHAUSTED_TTL_*`` constant regardless of how many consecutive times the same credential had failed. When a provider was overloaded for hours, the pool cycled through: TTL expires → cleared back to ``ok`` → retried → fails again with same error (3 retries wasted) → marked exhausted with same flat TTL → wait → repeat For cron jobs (each run = one turn), every execution burned its retry budget on the same dead credential until the upstream actually recovered. ### Fix * New ``consecutive_failures: int = 0`` field on ``PooledCredential``. * ``_mark_exhausted`` increments it from the prior value (defensive ``getattr`` + ``or 0`` so legacy on-disk entries that predate this field don't break — they default to 0 via the dataclass and get promoted to 1 on the first new failure). * ``_exhausted_ttl(error_code, consecutive_failures=0)`` now applies ``base_ttl * min(2 ** (failures - 1), 8)`` — capped at 8× so a long-running outage doesn't push the cooldown into days. The default value preserves backward compatibility for any caller still using the old single-arg signature. * ``_exhausted_until`` passes ``entry.consecutive_failures`` through. * Resets to 0 happen on real success markers — refresh produces fresh tokens, anthropic claude_code sync from credentials file, nous adopts newer tokens from auth.json — and on operator-driven ``reset_statuses()``. * The cooldown auto-clear path (``_available_entries(clear_expired= True)`` flipping ``last_status`` back to ``STATUS_OK`` when the TTL elapses) deliberately does NOT reset the counter — that's the entire point. If the same credential exhausts again right after the cooldown expires, the upstream is still down and the next cooldown should be longer. ### Backoff progression | failures | multiplier | cooldown (default 1h base) | |---:|---:|---| | 0 (default) | 1× | 1h | | 1 (first) | 1× | 1h | | 2 | 2× | 2h | | 3 | 4× | 4h | | 4 | 8× | 8h | | 5+ | 8× | 8h (cap) | ### Tests (21 new in ``tests/agent/test_credential_pool_backoff.py``) * ``TestExhaustedTtl`` (8 cases): formula correctness — backward-compat default, first-failure 1× (not 2×), 8-row parametrised matrix proving the cap holds at 8×, zero/negative defensive clamps, plus a pin on the cap constant value. * ``TestExhaustedUntil`` (4 cases): integration — first failure ×1, fourth failure ×8, capped after 20 failures, and an explicit ``last_error_reset_at`` from the upstream wins over backoff. * ``TestMarkExhaustedIncrement`` (4 cases): increment from 0→1, from 2→3, persistence round-trip across pool reload, and a legacy on-disk entry missing the field still works. * ``TestCounterResetSemantics`` (3 cases): ``reset_statuses()`` zeroes the counter (operator path); ``clear_expired`` does NOT zero it (the bug-fix invariant); and the field round-trips through ``to_dict`` / ``from_dict``. **Verified regression guards**: temporarily reverted the ``_mark_exhausted`` increment (replaced ``+ 1`` with ``= 0``); 4 tests in ``TestMarkExhaustedIncrement`` correctly failed with clear assertion messages. Restored fix → 30/30 pass. All 44 pre-existing ``test_credential_pool.py`` tests stay green — no behavior change for callers that don't pass ``consecutive_failures`` (the default). Closes NousResearch#15296 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
NousResearch#15455) Copilot's review on NousResearch#15455 flagged four real issues: 1. ``2 ** exponent`` was computed BEFORE ``min(..., CAP)``, so a credential with ``consecutive_failures = 10000`` (sustained outage) would materialise a ~3000-digit integer just to throw it away. 2. ``int(prior_failures)`` in ``_mark_exhausted`` raises ``ValueError`` if ``auth.json`` carries a non-numeric value (string, list, ...) — breaks failover for users with hand-edited or migrated pool stores. 3. Same issue in ``_exhausted_ttl(consecutive_failures)`` and in the ``_exhausted_until`` call site. 4. Test file imported ``replace`` and ``patch`` but used neither. ### Fix * New ``_coerce_failure_count(value)`` helper that tolerates ``None``, strings, floats, lists, and falls back to 0 on any failure. Real numeric values (including ``"5"`` and ``3.7``) coerce as expected. * New ``_EXHAUSTED_TTL_BACKOFF_MAX_EXPONENT = 3`` constant (= log2 of the cap) used to clamp the exponent before the shift. Replaced ``2 ** exponent`` with ``1 << exponent`` since they're equivalent for non-negative ints and the bit-shift form makes the clamp intent unambiguous. * ``_mark_exhausted`` and ``_exhausted_until`` now route the prior- count read through ``_coerce_failure_count`` so a corrupted on-disk entry can never blow up failover. * Removed unused ``from dataclasses import replace`` and ``from unittest.mock import patch`` from the test module. ### New tests (7 cases) * ``test_corrupted_failure_value_falls_back_to_zero`` — parametrised over ``None``, ``"not-a-number"``, ``[]``, ``{}``, plus the two numeric-string / float cases that DO coerce cleanly (so the contract is uniform). * ``test_runaway_failure_count_does_not_compute_huge_int`` — passes ``consecutive_failures = 10_000`` and asserts both correctness AND ``elapsed < 50 ms``. If the exponent clamp regresses to ``2 ** 10000``, the wall-time bound flags it instantly. 72/72 credential-pool tests pass (44 pre-existing + 28 new). No behaviour change for valid inputs — the clamp is unreachable for ``failures <= 4``, the safe-cast is a no-op for proper integers. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
f421ad4 to
df3e2ce
Compare
|
Closing to keep the queue clean — 17 days idle and now conflicting on agent/credential_pool.py. Happy to reopen if this is still useful. |
What does this PR do?
Fixes `#15296`. The pool's `exhausted_ttl` returned a flat `EXHAUSTED_TTL*` constant regardless of how many consecutive times the same credential had failed. When a provider was overloaded for hours, the pool cycled through:
For cron jobs (each run = one turn), every execution burned its retry budget on the same dead credential until the upstream actually recovered.
Fix
Backoff progression
The 8h cap matters: without it, a credential that's been failing for a week could end up on a multi-day cooldown that survives operator intervention (refreshing the upstream provider, swapping API keys, etc.).
Related Issue
Fixes #15296
Type of Change
Test plan
Test coverage detail
`TestExhaustedTtl` (8 cases):
`TestExhaustedUntil` (4 cases):
`TestMarkExhaustedIncrement` (4 cases):
`TestCounterResetSemantics` (3 cases):
Companion PRs
This is one of three related credential-pool resilience fixes filed by @mordekai-lab:
The three are independent and can land in any order.
Not in scope