fix(credential-pool): exponential backoff on repeated exhaustion (#15296) by briandevans · Pull Request #15455 · NousResearch/hermes-agent

briandevans · 2026-04-25T00:55:40Z

What does this PR do?

Fixes `#15296`. The pool's `exhausted_ttl` returned a flat `EXHAUSTED_TTL*` constant regardless of how many consecutive times the same credential had failed. When a provider was overloaded for hours, the pool cycled through:

TTL expires → credential cleared back to `ok`
Provider tried again → fails with the same error (3 retries wasted)
Credential re-exhausted with the same flat TTL
Wait for TTL to expire again → repeat

For cron jobs (each run = one turn), every execution burned its retry budget on the same dead credential until the upstream actually recovered.

Fix

New `consecutive_failures: int = 0` field on `PooledCredential`
`_mark_exhausted` increments it from the prior value (defensive `getattr` + `or 0` so legacy on-disk entries that predate the field default to 0 via the dataclass and get promoted to 1 on first failure)
`_exhausted_ttl(error_code, consecutive_failures=0)` now applies `base_ttl * min(2 ** (failures - 1), 8)` — capped at 8× so a long-running outage doesn't push the cooldown into days. The default-arg keeps backward compat with any caller still using the old single-arg signature.
`_exhausted_until` passes `entry.consecutive_failures` through
Resets to 0 happen on real success markers (refresh produces fresh tokens, anthropic claude_code sync from credentials file, nous adopts newer tokens from auth.json) and on operator-driven `reset_statuses()`
The cooldown auto-clear path (`_available_entries(clear_expired=True)` flipping `last_status` back to `STATUS_OK` when the TTL elapses) deliberately does NOT reset the counter — that's the entire point of the fix. If the same credential exhausts again right after the cooldown expires, the upstream is still down and the next cooldown should be longer.

Backoff progression

consecutive_failures	multiplier	cooldown (default 1h base)
0 (default)	1×	1h
1 (first)	1×	1h
2	2×	2h
3	4×	4h
4	8×	8h
5+	8×	8h (cap)

The 8h cap matters: without it, a credential that's been failing for a week could end up on a multi-day cooldown that survives operator intervention (refreshing the upstream provider, swapping API keys, etc.).

Related Issue

Fixes #15296

Type of Change

🐛 Bug fix (non-breaking change that fixes an issue)
✅ Tests (adding or improving test coverage)

Test plan

21 new tests in `tests/agent/test_credential_pool_backoff.py` — all green on py3.11 venv
All 44 pre-existing `test_credential_pool.py` tests still pass — backward compat preserved (the default-argument keeps the single-arg signature working)
Verified regression guards: temporarily reverted the `_mark_exhausted` increment to a no-op; 4 tests in `TestMarkExhaustedIncrement` correctly failed with clear assertion messages pointing at the regressed invariant. Restored fix → all 30 backoff tests pass.

Test coverage detail

`TestExhaustedTtl` (8 cases):

Backward-compat default (no `consecutive_failures` arg → flat base TTL)
`consecutive_failures=1` is 1× (not 2×) — prevents off-by-one on the very first failure
8-row parametrised matrix on 429 progression: 1× → 2× → 4× → 8× → cap
Same matrix on 402 (non-429 cooldowns also back off)
`consecutive_failures=0` defensive default
Negative count clamped to base (corrupted on-disk entry safety)
Cap constant pinned to 8 so a refactor flagging this test prompts a docstring update

`TestExhaustedUntil` (4 cases):

First failure → 1× base TTL after `last_status_at`
Fourth failure → 8× base TTL
20 failures → still 8× (cap holds)
Explicit `last_error_reset_at` from upstream wins over backoff (provider-supplied resets are absolute truth)

`TestMarkExhaustedIncrement` (4 cases):

0 → 1 on first exhaustion
2 → 3 on consecutive (the streak persists)
Round-trips through pool reload (persistence)
Legacy on-disk entry without the field → loads as 0, increments to 1

`TestCounterResetSemantics` (3 cases):

`reset_statuses()` zeroes the counter (operator path)
`clear_expired` does NOT zero the counter — the bug-fix invariant; without this assertion a future refactor could silently break the backoff
`to_dict`/`from_dict` round-trips the field

Companion PRs

This is one of three related credential-pool resilience fixes filed by @mordekai-lab:

Error classifier: 429 'temporarily overloaded' misclassified as rate_limit — triggers wrong recovery path #15297 — error classifier disambiguates 429 "overloaded" from rate_limit (my open PR fix(error-classifier): route 429 'overloaded' messages to FailoverReason.overloaded (#15297) #15414)
_restore_primary_runtime() doesn't check credential cooldown — burns retries every turn while provider is exhausted #15298 — `_restore_primary_runtime` cooldown check (Tranquil-Flow's fix(agent): check credential pool exhaustion before restoring primary runtime (#15298) #15434)
Credential pool: no exponential backoff on repeated exhaustion — flat TTL causes 429 retry loops #15296 — this PR

The three are independent and can land in any order.

Not in scope

Per-provider tuning of the cap or base TTL — currently uses the same 8× cap for every provider. A future enhancement could let providers configure their own backoff curve, but that's a bigger surface than this user-facing regression needs to unblock.

Copilot

Pull request overview

Adds exponential backoff to the credential pool’s exhaustion cooldown by tracking per-credential consecutive failure streaks, reducing repeated retry waste during sustained provider outages.

Changes:

Introduces consecutive_failures on PooledCredential and increments it on _mark_exhausted.
Updates _exhausted_ttl / _exhausted_until to apply capped exponential backoff based on consecutive failures.
Resets the streak on explicit success/operator-reset paths, and adds a comprehensive regression test suite.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.

File	Description
`agent/credential_pool.py`	Adds the failure-streak counter and applies capped exponential backoff to exhaustion cooldowns, with resets on success/operator actions.
`tests/agent/test_credential_pool_backoff.py`	New regression tests covering TTL progression, persistence, legacy entries, and reset semantics.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+    # exhaustion, so the first failure should still apply 1× the base
+    # (i.e. multiplier exponent ``failures - 1``, clamped at 0).
+    exponent = max(0, int(consecutive_failures) - 1)
+    multiplier = min(2 ** exponent, EXHAUSTED_TTL_BACKOFF_CAP)


            last_error_reason=normalized_error.get("reason"),
            last_error_message=normalized_error.get("message"),
            last_error_reset_at=normalized_error.get("reset_at"),
+            consecutive_failures=int(prior_failures) + 1,
        )


+from dataclasses import replace
+from unittest.mock import patch


+    # ``consecutive_failures`` is the count *including* the current
+    # exhaustion, so the first failure should still apply 1× the base
+    # (i.e. multiplier exponent ``failures - 1``, clamped at 0).
+    exponent = max(0, int(consecutive_failures) - 1)


briandevans · 2026-04-25T01:05:40Z

Thanks @copilot — all four findings addressed in f421ad48:

1. 2 ** exponent could compute huge ints before capping — fixed by clamping the exponent first. Added _EXHAUSTED_TTL_BACKOFF_MAX_EXPONENT = 3 (= log2 of the cap) and use it as a ceiling on the exponent BEFORE the shift. Also switched to 1 << exponent since it's equivalent for non-negative ints and makes the clamp intent unambiguous.

2 & 3. int(prior_failures) raising ValueError on corrupt auth.json — fixed at all three sites with a new _coerce_failure_count(value) helper that tolerates None, strings, floats, lists, and falls back to 0. Real numeric values (including "5" and 3.7) coerce as expected.

4. Unused imports in test file — replace and patch were never referenced; removed.

Two new defensive tests added:

test_corrupted_failure_value_falls_back_to_zero — parametrised over None, "not-a-number", [], {} plus the numeric-string / float cases
test_runaway_failure_count_does_not_compute_huge_int — passes consecutive_failures = 10_000 and asserts both correctness AND elapsed < 50 ms. If the exponent clamp regresses to 2 ** 10000, the wall-time bound flags it instantly.

72/72 credential-pool tests pass (44 pre-existing + 28 new). Zero behaviour change for valid inputs.

…sResearch#15296) The pool's ``_exhausted_ttl`` returned a flat ``EXHAUSTED_TTL_*`` constant regardless of how many consecutive times the same credential had failed. When a provider was overloaded for hours, the pool cycled through: TTL expires → cleared back to ``ok`` → retried → fails again with same error (3 retries wasted) → marked exhausted with same flat TTL → wait → repeat For cron jobs (each run = one turn), every execution burned its retry budget on the same dead credential until the upstream actually recovered. ### Fix * New ``consecutive_failures: int = 0`` field on ``PooledCredential``. * ``_mark_exhausted`` increments it from the prior value (defensive ``getattr`` + ``or 0`` so legacy on-disk entries that predate this field don't break — they default to 0 via the dataclass and get promoted to 1 on the first new failure). * ``_exhausted_ttl(error_code, consecutive_failures=0)`` now applies ``base_ttl * min(2 ** (failures - 1), 8)`` — capped at 8× so a long-running outage doesn't push the cooldown into days. The default value preserves backward compatibility for any caller still using the old single-arg signature. * ``_exhausted_until`` passes ``entry.consecutive_failures`` through. * Resets to 0 happen on real success markers — refresh produces fresh tokens, anthropic claude_code sync from credentials file, nous adopts newer tokens from auth.json — and on operator-driven ``reset_statuses()``. * The cooldown auto-clear path (``_available_entries(clear_expired= True)`` flipping ``last_status`` back to ``STATUS_OK`` when the TTL elapses) deliberately does NOT reset the counter — that's the entire point. If the same credential exhausts again right after the cooldown expires, the upstream is still down and the next cooldown should be longer. ### Backoff progression | failures | multiplier | cooldown (default 1h base) | |---:|---:|---| | 0 (default) | 1× | 1h | | 1 (first) | 1× | 1h | | 2 | 2× | 2h | | 3 | 4× | 4h | | 4 | 8× | 8h | | 5+ | 8× | 8h (cap) | ### Tests (21 new in ``tests/agent/test_credential_pool_backoff.py``) * ``TestExhaustedTtl`` (8 cases): formula correctness — backward-compat default, first-failure 1× (not 2×), 8-row parametrised matrix proving the cap holds at 8×, zero/negative defensive clamps, plus a pin on the cap constant value. * ``TestExhaustedUntil`` (4 cases): integration — first failure ×1, fourth failure ×8, capped after 20 failures, and an explicit ``last_error_reset_at`` from the upstream wins over backoff. * ``TestMarkExhaustedIncrement`` (4 cases): increment from 0→1, from 2→3, persistence round-trip across pool reload, and a legacy on-disk entry missing the field still works. * ``TestCounterResetSemantics`` (3 cases): ``reset_statuses()`` zeroes the counter (operator path); ``clear_expired`` does NOT zero it (the bug-fix invariant); and the field round-trips through ``to_dict`` / ``from_dict``. **Verified regression guards**: temporarily reverted the ``_mark_exhausted`` increment (replaced ``+ 1`` with ``= 0``); 4 tests in ``TestMarkExhaustedIncrement`` correctly failed with clear assertion messages. Restored fix → 30/30 pass. All 44 pre-existing ``test_credential_pool.py`` tests stay green — no behavior change for callers that don't pass ``consecutive_failures`` (the default). Closes NousResearch#15296 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

NousResearch#15455) Copilot's review on NousResearch#15455 flagged four real issues: 1. ``2 ** exponent`` was computed BEFORE ``min(..., CAP)``, so a credential with ``consecutive_failures = 10000`` (sustained outage) would materialise a ~3000-digit integer just to throw it away. 2. ``int(prior_failures)`` in ``_mark_exhausted`` raises ``ValueError`` if ``auth.json`` carries a non-numeric value (string, list, ...) — breaks failover for users with hand-edited or migrated pool stores. 3. Same issue in ``_exhausted_ttl(consecutive_failures)`` and in the ``_exhausted_until`` call site. 4. Test file imported ``replace`` and ``patch`` but used neither. ### Fix * New ``_coerce_failure_count(value)`` helper that tolerates ``None``, strings, floats, lists, and falls back to 0 on any failure. Real numeric values (including ``"5"`` and ``3.7``) coerce as expected. * New ``_EXHAUSTED_TTL_BACKOFF_MAX_EXPONENT = 3`` constant (= log2 of the cap) used to clamp the exponent before the shift. Replaced ``2 ** exponent`` with ``1 << exponent`` since they're equivalent for non-negative ints and the bit-shift form makes the clamp intent unambiguous. * ``_mark_exhausted`` and ``_exhausted_until`` now route the prior- count read through ``_coerce_failure_count`` so a corrupted on-disk entry can never blow up failover. * Removed unused ``from dataclasses import replace`` and ``from unittest.mock import patch`` from the test module. ### New tests (7 cases) * ``test_corrupted_failure_value_falls_back_to_zero`` — parametrised over ``None``, ``"not-a-number"``, ``[]``, ``{}``, plus the two numeric-string / float cases that DO coerce cleanly (so the contract is uniform). * ``test_runaway_failure_count_does_not_compute_huge_int`` — passes ``consecutive_failures = 10_000`` and asserts both correctness AND ``elapsed < 50 ms``. If the exponent clamp regresses to ``2 ** 10000``, the wall-time bound flags it instantly. 72/72 credential-pool tests pass (44 pre-existing + 28 new). No behaviour change for valid inputs — the clamp is unreachable for ``failures <= 4``, the safe-cast is a no-op for proper integers. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

briandevans · 2026-05-18T08:15:15Z

Closing to keep the queue clean — 17 days idle and now conflicting on agent/credential_pool.py. Happy to reopen if this is still useful.

Copilot AI review requested due to automatic review settings April 25, 2026 00:55

Copilot started reviewing on behalf of briandevans April 25, 2026 00:56 View session

Copilot AI reviewed Apr 25, 2026

View reviewed changes

alt-glitch added type/bug Something isn't working P2 Medium — degraded but workaround exists comp/agent Core agent loop, run_agent.py, prompt builder area/auth Authentication, OAuth, credential pools labels Apr 25, 2026

briandevans mentioned this pull request Apr 27, 2026

fix(gateway): propagate credential_pool through /model session overrides (#16678) #16701

Closed

4 tasks

briandevans and others added 2 commits April 30, 2026 09:14

briandevans force-pushed the fix/credential-pool-exponential-backoff branch from f421ad4 to df3e2ce Compare April 30, 2026 16:15

briandevans closed this May 18, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(credential-pool): exponential backoff on repeated exhaustion (#15296)#15455

fix(credential-pool): exponential backoff on repeated exhaustion (#15296)#15455
briandevans wants to merge 2 commits into
NousResearch:mainfrom
briandevans:fix/credential-pool-exponential-backoff

briandevans commented Apr 25, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

briandevans commented Apr 25, 2026

Uh oh!

briandevans commented May 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		from dataclasses import replace
		from unittest.mock import patch

Conversation

briandevans commented Apr 25, 2026

What does this PR do?

Fix

Backoff progression

Related Issue

Type of Change

Test plan

Test coverage detail

Companion PRs

Not in scope

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

briandevans commented Apr 25, 2026

Uh oh!

briandevans commented May 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants