Skip to content

fix: apply fallback cooldown for all failover reasons, make duration configurable#19839

Open
dragonforce2010 wants to merge 1 commit into
NousResearch:mainfrom
dragonforce2010:fix/fallback-cooldown-configurable
Open

fix: apply fallback cooldown for all failover reasons, make duration configurable#19839
dragonforce2010 wants to merge 1 commit into
NousResearch:mainfrom
dragonforce2010:fix/fallback-cooldown-configurable

Conversation

@dragonforce2010

Copy link
Copy Markdown

Problem

The rate-limit cooldown (_rate_limited_until) in _try_activate_fallback() was gated behind reason in (FailoverReason.rate_limit, FailoverReason.billing). However, the majority of callsites (11 out of 12) invoke the method without a reason argument, so the cooldown was effectively never applied.

This causes _restore_primary_runtime() to attempt the primary provider on every new turn — even when it's known to be unavailable. For subscription-based providers like OpenAI Codex OAuth where quota resets can take hours, this means:

  1. Every message hits the primary → gets 429 → waits for retry timeout → falls back
  2. User sees "Rate limited — switching to fallback provider..." on every single message
  3. Unnecessary latency on every turn (retry delays before fallback activates)

Fix

  • Remove the reason gate so cooldown fires on any fallback activation
  • Make cooldown duration configurable via HERMES_RATE_LIMIT_COOLDOWN env var (default: 3600s = 1 hour, up from hardcoded 60s)
  • Preserve the existing guard that only starts cooldown when leaving the primary provider (chain-switching between fallbacks is unaffected)

Rationale for default change (60s → 3600s)

The original 60-second cooldown was reasonable for transient API outages, but subscription-based providers (Codex OAuth, Anthropic Pro) have quota reset windows measured in hours, not seconds. A 60s cooldown means the agent retries the exhausted provider every minute — adding latency with zero chance of success.

1 hour balances between:

  • Not wasting time on known-exhausted providers
  • Recovering reasonably quickly when quota does reset

Users who prefer the old behavior can set HERMES_RATE_LIMIT_COOLDOWN=60.

Testing

Verified on a live Hermes Gateway with OpenAI Codex as primary and MiniMax as fallback:

  • Before fix: every message showed "Rate limited — switching to fallback provider..."
  • After fix: first message triggers fallback, subsequent messages go directly to MiniMax for 1 hour

…limit/billing

Previously, the rate-limit cooldown (_rate_limited_until) was only set
when _try_activate_fallback() received reason=rate_limit or
reason=billing.  However, the majority of callsites invoke the method
without a reason argument, so the cooldown was never applied in
practice.  This caused _restore_primary_runtime() to attempt the
primary provider on every new turn — even when it's known to be
unavailable (e.g. ChatGPT subscription quota exhausted for hours).

Changes:
- Remove the reason gate so cooldown fires on any fallback activation
- Make cooldown duration configurable via HERMES_RATE_LIMIT_COOLDOWN
  env var (default: 3600s = 1 hour, up from hardcoded 60s)
- Preserve the guard that only starts cooldown when leaving the
  primary provider (chain-switching between fallbacks is unaffected)

This is particularly important for subscription-based providers like
OpenAI Codex OAuth where quota resets can take hours.
@alt-glitch alt-glitch added type/bug Something isn't working P2 Medium — degraded but workaround exists comp/agent Core agent loop, run_agent.py, prompt builder labels May 4, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/agent Core agent loop, run_agent.py, prompt builder P2 Medium — degraded but workaround exists type/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants