Skip to content

fix(credential-pool): exponential backoff on repeated exhaustion (#15296)#15455

Closed
briandevans wants to merge 2 commits into
NousResearch:mainfrom
briandevans:fix/credential-pool-exponential-backoff
Closed

fix(credential-pool): exponential backoff on repeated exhaustion (#15296)#15455
briandevans wants to merge 2 commits into
NousResearch:mainfrom
briandevans:fix/credential-pool-exponential-backoff

Conversation

@briandevans

Copy link
Copy Markdown
Contributor

What does this PR do?

Fixes `#15296`. The pool's `exhausted_ttl` returned a flat `EXHAUSTED_TTL*` constant regardless of how many consecutive times the same credential had failed. When a provider was overloaded for hours, the pool cycled through:

  1. TTL expires → credential cleared back to `ok`
  2. Provider tried again → fails with the same error (3 retries wasted)
  3. Credential re-exhausted with the same flat TTL
  4. Wait for TTL to expire again → repeat

For cron jobs (each run = one turn), every execution burned its retry budget on the same dead credential until the upstream actually recovered.

Fix

  • New `consecutive_failures: int = 0` field on `PooledCredential`
  • `_mark_exhausted` increments it from the prior value (defensive `getattr` + `or 0` so legacy on-disk entries that predate the field default to 0 via the dataclass and get promoted to 1 on first failure)
  • `_exhausted_ttl(error_code, consecutive_failures=0)` now applies `base_ttl * min(2 ** (failures - 1), 8)` — capped at 8× so a long-running outage doesn't push the cooldown into days. The default-arg keeps backward compat with any caller still using the old single-arg signature.
  • `_exhausted_until` passes `entry.consecutive_failures` through
  • Resets to 0 happen on real success markers (refresh produces fresh tokens, anthropic claude_code sync from credentials file, nous adopts newer tokens from auth.json) and on operator-driven `reset_statuses()`
  • The cooldown auto-clear path (`_available_entries(clear_expired=True)` flipping `last_status` back to `STATUS_OK` when the TTL elapses) deliberately does NOT reset the counter — that's the entire point of the fix. If the same credential exhausts again right after the cooldown expires, the upstream is still down and the next cooldown should be longer.

Backoff progression

consecutive_failures multiplier cooldown (default 1h base)
0 (default) 1h
1 (first) 1h
2 2h
3 4h
4 8h
5+ 8h (cap)

The 8h cap matters: without it, a credential that's been failing for a week could end up on a multi-day cooldown that survives operator intervention (refreshing the upstream provider, swapping API keys, etc.).

Related Issue

Fixes #15296

Type of Change

  • 🐛 Bug fix (non-breaking change that fixes an issue)
  • ✅ Tests (adding or improving test coverage)

Test plan

  • 21 new tests in `tests/agent/test_credential_pool_backoff.py` — all green on py3.11 venv
  • All 44 pre-existing `test_credential_pool.py` tests still pass — backward compat preserved (the default-argument keeps the single-arg signature working)
  • Verified regression guards: temporarily reverted the `_mark_exhausted` increment to a no-op; 4 tests in `TestMarkExhaustedIncrement` correctly failed with clear assertion messages pointing at the regressed invariant. Restored fix → all 30 backoff tests pass.

Test coverage detail

`TestExhaustedTtl` (8 cases):

  • Backward-compat default (no `consecutive_failures` arg → flat base TTL)
  • `consecutive_failures=1` is 1× (not 2×) — prevents off-by-one on the very first failure
  • 8-row parametrised matrix on 429 progression: 1× → 2× → 4× → 8× → cap
  • Same matrix on 402 (non-429 cooldowns also back off)
  • `consecutive_failures=0` defensive default
  • Negative count clamped to base (corrupted on-disk entry safety)
  • Cap constant pinned to 8 so a refactor flagging this test prompts a docstring update

`TestExhaustedUntil` (4 cases):

  • First failure → 1× base TTL after `last_status_at`
  • Fourth failure → 8× base TTL
  • 20 failures → still 8× (cap holds)
  • Explicit `last_error_reset_at` from upstream wins over backoff (provider-supplied resets are absolute truth)

`TestMarkExhaustedIncrement` (4 cases):

  • 0 → 1 on first exhaustion
  • 2 → 3 on consecutive (the streak persists)
  • Round-trips through pool reload (persistence)
  • Legacy on-disk entry without the field → loads as 0, increments to 1

`TestCounterResetSemantics` (3 cases):

  • `reset_statuses()` zeroes the counter (operator path)
  • `clear_expired` does NOT zero the counter — the bug-fix invariant; without this assertion a future refactor could silently break the backoff
  • `to_dict`/`from_dict` round-trips the field

Companion PRs

This is one of three related credential-pool resilience fixes filed by @mordekai-lab:

The three are independent and can land in any order.

Not in scope

  • Per-provider tuning of the cap or base TTL — currently uses the same 8× cap for every provider. A future enhancement could let providers configure their own backoff curve, but that's a bigger surface than this user-facing regression needs to unblock.

Copilot AI review requested due to automatic review settings April 25, 2026 00:55

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds exponential backoff to the credential pool’s exhaustion cooldown by tracking per-credential consecutive failure streaks, reducing repeated retry waste during sustained provider outages.

Changes:

  • Introduces consecutive_failures on PooledCredential and increments it on _mark_exhausted.
  • Updates _exhausted_ttl / _exhausted_until to apply capped exponential backoff based on consecutive failures.
  • Resets the streak on explicit success/operator-reset paths, and adds a comprehensive regression test suite.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.

File Description
agent/credential_pool.py Adds the failure-streak counter and applies capped exponential backoff to exhaustion cooldowns, with resets on success/operator actions.
tests/agent/test_credential_pool_backoff.py New regression tests covering TTL progression, persistence, legacy entries, and reset semantics.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread agent/credential_pool.py Outdated
# exhaustion, so the first failure should still apply 1× the base
# (i.e. multiplier exponent ``failures - 1``, clamped at 0).
exponent = max(0, int(consecutive_failures) - 1)
multiplier = min(2 ** exponent, EXHAUSTED_TTL_BACKOFF_CAP)
Comment thread agent/credential_pool.py
Comment on lines 465 to 469
last_error_reason=normalized_error.get("reason"),
last_error_message=normalized_error.get("message"),
last_error_reset_at=normalized_error.get("reset_at"),
consecutive_failures=int(prior_failures) + 1,
)
Comment on lines +30 to +31
from dataclasses import replace
from unittest.mock import patch
Comment thread agent/credential_pool.py Outdated
Comment on lines +232 to +235
# ``consecutive_failures`` is the count *including* the current
# exhaustion, so the first failure should still apply 1× the base
# (i.e. multiplier exponent ``failures - 1``, clamped at 0).
exponent = max(0, int(consecutive_failures) - 1)
@briandevans

Copy link
Copy Markdown
Contributor Author

Thanks @copilot — all four findings addressed in f421ad48:

1. 2 ** exponent could compute huge ints before capping — fixed by clamping the exponent first. Added _EXHAUSTED_TTL_BACKOFF_MAX_EXPONENT = 3 (= log2 of the cap) and use it as a ceiling on the exponent BEFORE the shift. Also switched to 1 << exponent since it's equivalent for non-negative ints and makes the clamp intent unambiguous.

2 & 3. int(prior_failures) raising ValueError on corrupt auth.json — fixed at all three sites with a new _coerce_failure_count(value) helper that tolerates None, strings, floats, lists, and falls back to 0. Real numeric values (including "5" and 3.7) coerce as expected.

4. Unused imports in test filereplace and patch were never referenced; removed.

Two new defensive tests added:

  • test_corrupted_failure_value_falls_back_to_zero — parametrised over None, "not-a-number", [], {} plus the numeric-string / float cases
  • test_runaway_failure_count_does_not_compute_huge_int — passes consecutive_failures = 10_000 and asserts both correctness AND elapsed < 50 ms. If the exponent clamp regresses to 2 ** 10000, the wall-time bound flags it instantly.

72/72 credential-pool tests pass (44 pre-existing + 28 new). Zero behaviour change for valid inputs.

@alt-glitch alt-glitch added type/bug Something isn't working P2 Medium — degraded but workaround exists comp/agent Core agent loop, run_agent.py, prompt builder area/auth Authentication, OAuth, credential pools labels Apr 25, 2026
briandevans and others added 2 commits April 30, 2026 09:14
…sResearch#15296)

The pool's ``_exhausted_ttl`` returned a flat ``EXHAUSTED_TTL_*``
constant regardless of how many consecutive times the same credential
had failed.  When a provider was overloaded for hours, the pool cycled
through:

    TTL expires → cleared back to ``ok`` → retried → fails again with
    same error (3 retries wasted) → marked exhausted with same flat
    TTL → wait → repeat

For cron jobs (each run = one turn), every execution burned its retry
budget on the same dead credential until the upstream actually
recovered.

### Fix

* New ``consecutive_failures: int = 0`` field on ``PooledCredential``.
* ``_mark_exhausted`` increments it from the prior value (defensive
  ``getattr`` + ``or 0`` so legacy on-disk entries that predate this
  field don't break — they default to 0 via the dataclass and get
  promoted to 1 on the first new failure).
* ``_exhausted_ttl(error_code, consecutive_failures=0)`` now applies
  ``base_ttl * min(2 ** (failures - 1), 8)`` — capped at 8× so a
  long-running outage doesn't push the cooldown into days.  The default
  value preserves backward compatibility for any caller still using
  the old single-arg signature.
* ``_exhausted_until`` passes ``entry.consecutive_failures`` through.
* Resets to 0 happen on real success markers — refresh produces fresh
  tokens, anthropic claude_code sync from credentials file, nous adopts
  newer tokens from auth.json — and on operator-driven
  ``reset_statuses()``.
* The cooldown auto-clear path (``_available_entries(clear_expired=
  True)`` flipping ``last_status`` back to ``STATUS_OK`` when the TTL
  elapses) deliberately does NOT reset the counter — that's the entire
  point.  If the same credential exhausts again right after the
  cooldown expires, the upstream is still down and the next cooldown
  should be longer.

### Backoff progression

| failures | multiplier | cooldown (default 1h base) |
|---:|---:|---|
| 0 (default) | 1× | 1h |
| 1 (first) | 1× | 1h |
| 2 | 2× | 2h |
| 3 | 4× | 4h |
| 4 | 8× | 8h |
| 5+ | 8× | 8h (cap) |

### Tests (21 new in ``tests/agent/test_credential_pool_backoff.py``)

* ``TestExhaustedTtl`` (8 cases): formula correctness — backward-compat
  default, first-failure 1× (not 2×), 8-row parametrised matrix
  proving the cap holds at 8×, zero/negative defensive clamps, plus a
  pin on the cap constant value.
* ``TestExhaustedUntil`` (4 cases): integration — first failure ×1,
  fourth failure ×8, capped after 20 failures, and an explicit
  ``last_error_reset_at`` from the upstream wins over backoff.
* ``TestMarkExhaustedIncrement`` (4 cases): increment from 0→1, from
  2→3, persistence round-trip across pool reload, and a legacy on-disk
  entry missing the field still works.
* ``TestCounterResetSemantics`` (3 cases): ``reset_statuses()`` zeroes
  the counter (operator path); ``clear_expired`` does NOT zero it (the
  bug-fix invariant); and the field round-trips through ``to_dict`` /
  ``from_dict``.

**Verified regression guards**: temporarily reverted the
``_mark_exhausted`` increment (replaced ``+ 1`` with ``= 0``); 4 tests
in ``TestMarkExhaustedIncrement`` correctly failed with clear
assertion messages.  Restored fix → 30/30 pass.  All 44 pre-existing
``test_credential_pool.py`` tests stay green — no behavior change for
callers that don't pass ``consecutive_failures`` (the default).

Closes NousResearch#15296

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
NousResearch#15455)

Copilot's review on NousResearch#15455 flagged four real issues:

1. ``2 ** exponent`` was computed BEFORE ``min(..., CAP)``, so a
   credential with ``consecutive_failures = 10000`` (sustained outage)
   would materialise a ~3000-digit integer just to throw it away.

2. ``int(prior_failures)`` in ``_mark_exhausted`` raises ``ValueError``
   if ``auth.json`` carries a non-numeric value (string, list, ...) —
   breaks failover for users with hand-edited or migrated pool stores.

3. Same issue in ``_exhausted_ttl(consecutive_failures)`` and in the
   ``_exhausted_until`` call site.

4. Test file imported ``replace`` and ``patch`` but used neither.

### Fix

* New ``_coerce_failure_count(value)`` helper that tolerates ``None``,
  strings, floats, lists, and falls back to 0 on any failure.  Real
  numeric values (including ``"5"`` and ``3.7``) coerce as expected.
* New ``_EXHAUSTED_TTL_BACKOFF_MAX_EXPONENT = 3`` constant (= log2 of
  the cap) used to clamp the exponent before the shift.  Replaced
  ``2 ** exponent`` with ``1 << exponent`` since they're equivalent
  for non-negative ints and the bit-shift form makes the clamp
  intent unambiguous.
* ``_mark_exhausted`` and ``_exhausted_until`` now route the prior-
  count read through ``_coerce_failure_count`` so a corrupted on-disk
  entry can never blow up failover.
* Removed unused ``from dataclasses import replace`` and
  ``from unittest.mock import patch`` from the test module.

### New tests (7 cases)

* ``test_corrupted_failure_value_falls_back_to_zero`` — parametrised
  over ``None``, ``"not-a-number"``, ``[]``, ``{}``, plus the two
  numeric-string / float cases that DO coerce cleanly (so the contract
  is uniform).
* ``test_runaway_failure_count_does_not_compute_huge_int`` — passes
  ``consecutive_failures = 10_000`` and asserts both correctness AND
  ``elapsed < 50 ms``.  If the exponent clamp regresses to ``2 **
  10000``, the wall-time bound flags it instantly.

72/72 credential-pool tests pass (44 pre-existing + 28 new).  No
behaviour change for valid inputs — the clamp is unreachable for
``failures <= 4``, the safe-cast is a no-op for proper integers.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@briandevans briandevans force-pushed the fix/credential-pool-exponential-backoff branch from f421ad4 to df3e2ce Compare April 30, 2026 16:15
@briandevans

Copy link
Copy Markdown
Contributor Author

Closing to keep the queue clean — 17 days idle and now conflicting on agent/credential_pool.py. Happy to reopen if this is still useful.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/auth Authentication, OAuth, credential pools comp/agent Core agent loop, run_agent.py, prompt builder P2 Medium — degraded but workaround exists type/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Credential pool: no exponential backoff on repeated exhaustion — flat TTL causes 429 retry loops

3 participants