fix: apply fallback cooldown for all failover reasons, make duration configurable#19839
Open
dragonforce2010 wants to merge 1 commit into
Open
Conversation
…limit/billing Previously, the rate-limit cooldown (_rate_limited_until) was only set when _try_activate_fallback() received reason=rate_limit or reason=billing. However, the majority of callsites invoke the method without a reason argument, so the cooldown was never applied in practice. This caused _restore_primary_runtime() to attempt the primary provider on every new turn — even when it's known to be unavailable (e.g. ChatGPT subscription quota exhausted for hours). Changes: - Remove the reason gate so cooldown fires on any fallback activation - Make cooldown duration configurable via HERMES_RATE_LIMIT_COOLDOWN env var (default: 3600s = 1 hour, up from hardcoded 60s) - Preserve the guard that only starts cooldown when leaving the primary provider (chain-switching between fallbacks is unaffected) This is particularly important for subscription-based providers like OpenAI Codex OAuth where quota resets can take hours.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
The rate-limit cooldown (
_rate_limited_until) in_try_activate_fallback()was gated behindreason in (FailoverReason.rate_limit, FailoverReason.billing). However, the majority of callsites (11 out of 12) invoke the method without a reason argument, so the cooldown was effectively never applied.This causes
_restore_primary_runtime()to attempt the primary provider on every new turn — even when it's known to be unavailable. For subscription-based providers like OpenAI Codex OAuth where quota resets can take hours, this means:Fix
HERMES_RATE_LIMIT_COOLDOWNenv var (default: 3600s = 1 hour, up from hardcoded 60s)Rationale for default change (60s → 3600s)
The original 60-second cooldown was reasonable for transient API outages, but subscription-based providers (Codex OAuth, Anthropic Pro) have quota reset windows measured in hours, not seconds. A 60s cooldown means the agent retries the exhausted provider every minute — adding latency with zero chance of success.
1 hour balances between:
Users who prefer the old behavior can set
HERMES_RATE_LIMIT_COOLDOWN=60.Testing
Verified on a live Hermes Gateway with OpenAI Codex as primary and MiniMax as fallback: