You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
fallback_providers does not activate when the primary provider is hung on a stream that the stale-stream/stale-call detector eventually kills. The kill produces a classified FailoverReason.timeout error with retryable=True, and the retry loop simply re-hits the same broken primary — burning the full retry budget (3–5 minutes per stale kill × N retries = 15+ minute total hang) before bailing. The configured fallback chain is never tried.
Reproduction
Configure primary as Anthropic (or any cloud provider) and a fallback chain pointing at a different provider:
Send a turn while the primary provider is degraded (slow stream with heartbeats, gateway issues, etc.).
Observe: the stream stalls, HERMES_STREAM_STALE_TIMEOUT fires (default 180s, scaled up to 240–300s for large context), connection is killed, error classifies as FailoverReason.timeout. The retry loop attempts the same provider again, hangs again, kills again — repeated until retries exhausted. Fallback is never activated. Total hang: 15+ minutes.
Root cause
In run_agent.py:13035-13059, eager fallback is gated on rate-limit/billing only:
Stale-detected timeouts (FailoverReason.timeout) and connection drops never activate fallback eagerly — they retry-with-backoff against the same broken provider. The stale detector itself works correctly; it's the retry loop that ignores the configured fallback chain in this failure mode.
Issue #21444 documents the inverse direction (Codex/gpt-5.5 primary stalling → fallback activates after one stale kill on the non-streaming path). That works because non-streaming fully exhausts retries faster, and because that issue's reporter saw fallback fire after one ~300s wait — but for streaming primaries with longer per-attempt budget and exponential-backoff between attempts, multiple full stale-kill cycles compound into 15+ min observed hangs.
Proposed fix
Extend the eager-fallback trigger to include FailoverReason.timeout (and arguably FailoverReason.connection) after the first stale-detected timeout. Gate behind a new opt-out config key for back-compat:
When true, the same eager-fallback block activates on the first classified timeout if a fallback chain is configured AND the credential pool can't recover. This converts a 15+ min silent hang into a ~5 min single-stale-kill + immediate fallback.
Use case
Paid Anthropic primary + OAuth-backed Codex/GPT-5.5 fallback. When Anthropic stream is degraded but not erroring cleanly (heartbeats keep socket alive), the user has no automatic recovery today.
Summary
fallback_providersdoes not activate when the primary provider is hung on a stream that the stale-stream/stale-call detector eventually kills. The kill produces a classifiedFailoverReason.timeouterror withretryable=True, and the retry loop simply re-hits the same broken primary — burning the full retry budget (3–5 minutes per stale kill × N retries = 15+ minute total hang) before bailing. The configured fallback chain is never tried.Reproduction
HERMES_STREAM_STALE_TIMEOUTfires (default 180s, scaled up to 240–300s for large context), connection is killed, error classifies asFailoverReason.timeout. The retry loop attempts the same provider again, hangs again, kills again — repeated until retries exhausted. Fallback is never activated. Total hang: 15+ minutes.Root cause
In
run_agent.py:13035-13059, eager fallback is gated on rate-limit/billing only:Stale-detected timeouts (
FailoverReason.timeout) and connection drops never activate fallback eagerly — they retry-with-backoff against the same broken provider. The stale detector itself works correctly; it's the retry loop that ignores the configured fallback chain in this failure mode.Issue #21444 documents the inverse direction (Codex/gpt-5.5 primary stalling → fallback activates after one stale kill on the non-streaming path). That works because non-streaming fully exhausts retries faster, and because that issue's reporter saw fallback fire after one ~300s wait — but for streaming primaries with longer per-attempt budget and exponential-backoff between attempts, multiple full stale-kill cycles compound into 15+ min observed hangs.
Proposed fix
Extend the eager-fallback trigger to include
FailoverReason.timeout(and arguablyFailoverReason.connection) after the first stale-detected timeout. Gate behind a new opt-out config key for back-compat:When true, the same eager-fallback block activates on the first classified timeout if a fallback chain is configured AND the credential pool can't recover. This converts a 15+ min silent hang into a ~5 min single-stale-kill + immediate fallback.
Use case
Are you willing to submit a PR for this?
Yes — drafting now.