Summary
_restore_primary_runtime() unconditionally restores the primary provider at the start of every new turn, even when the primary provider's credential is still in exhaustion cooldown. This causes wasted retries on every turn until the cooldown expires.
How It Happens
- Turn N: Primary provider returns 429 → credential marked exhausted → fallback activated
- Turn N+1:
_restore_primary_runtime() runs at top of run_conversation() → restores primary provider
- Turn N+1: Primary tried again → 429 (3 retries burned) → fallback activated again
- Repeat every turn until cooldown expires
In long-lived gateway sessions (agent caching), this means every single message burns 3 retries on the exhausted provider before falling back. For cron jobs it's worse — each run only gets one turn, so the retry burns the entire execution.
Affected Code
run_agent.py — _restore_primary_runtime() (line ~6490):
def _restore_primary_runtime(self) -> bool:
if not self._fallback_activated:
return False
# ← No cooldown check here — always restores
rt = self._primary_runtime
# ... restores model, provider, client, etc.
The docstring correctly explains why restoration exists (transient failures shouldn't permanently pin to fallback), but it doesn't account for sustained cooldown periods.
Proposed Fix
Before restoring, check if the primary provider's credential pool entry is still in cooldown:
def _restore_primary_runtime(self) -> bool:
if not self._fallback_activated:
return False
# Check if primary credential is still in cooldown
pool = self._credential_pool
if pool is not None:
rt_provider = self._primary_runtime.get("provider", "")
if rt_provider and pool.provider == rt_provider:
current_entry = pool.peek()
if current_entry and current_entry.last_status == "exhausted":
cooldown_until = _exhausted_until(current_entry)
if cooldown_until is not None and time.time() < cooldown_until:
# Stay on fallback — primary still cooling down
return False
rt = self._primary_runtime
# ... existing restore logic
This preserves the existing behavior for transient failures (cooldown expired → restore works as before) while avoiding the retry burn during sustained outages.
Environment
- Hermes-agent latest, gateway mode
- Credential pool enabled with
fill_first strategy
- Observed during extended provider overload (429 errors lasting 1+ hours)
Summary
_restore_primary_runtime()unconditionally restores the primary provider at the start of every new turn, even when the primary provider's credential is still in exhaustion cooldown. This causes wasted retries on every turn until the cooldown expires.How It Happens
_restore_primary_runtime()runs at top ofrun_conversation()→ restores primary providerIn long-lived gateway sessions (agent caching), this means every single message burns 3 retries on the exhausted provider before falling back. For cron jobs it's worse — each run only gets one turn, so the retry burns the entire execution.
Affected Code
run_agent.py—_restore_primary_runtime()(line ~6490):The docstring correctly explains why restoration exists (transient failures shouldn't permanently pin to fallback), but it doesn't account for sustained cooldown periods.
Proposed Fix
Before restoring, check if the primary provider's credential pool entry is still in cooldown:
This preserves the existing behavior for transient failures (cooldown expired → restore works as before) while avoiding the retry burn during sustained outages.
Environment
fill_firststrategy