Skip to content

fallback chain not activated on stream-stall timeouts (15+ min silent hang on degraded primary) #22277

@yonefive71

Description

@yonefive71

Summary

fallback_providers does not activate when the primary provider is hung on a stream that the stale-stream/stale-call detector eventually kills. The kill produces a classified FailoverReason.timeout error with retryable=True, and the retry loop simply re-hits the same broken primary — burning the full retry budget (3–5 minutes per stale kill × N retries = 15+ minute total hang) before bailing. The configured fallback chain is never tried.

Reproduction

  1. Configure primary as Anthropic (or any cloud provider) and a fallback chain pointing at a different provider:
    model:
      default: claude-opus-4-7
      provider: anthropic
    fallback_providers:
      - provider: openai-codex
        model: gpt-5.5
  2. Send a turn while the primary provider is degraded (slow stream with heartbeats, gateway issues, etc.).
  3. Observe: the stream stalls, HERMES_STREAM_STALE_TIMEOUT fires (default 180s, scaled up to 240–300s for large context), connection is killed, error classifies as FailoverReason.timeout. The retry loop attempts the same provider again, hangs again, kills again — repeated until retries exhausted. Fallback is never activated. Total hang: 15+ minutes.

Root cause

In run_agent.py:13035-13059, eager fallback is gated on rate-limit/billing only:

is_rate_limited = classified.reason in (
    FailoverReason.rate_limit,
    FailoverReason.billing,
)
if is_rate_limited and self._fallback_index < len(self._fallback_chain):
    ...
    if self._try_activate_fallback(reason=classified.reason):

Stale-detected timeouts (FailoverReason.timeout) and connection drops never activate fallback eagerly — they retry-with-backoff against the same broken provider. The stale detector itself works correctly; it's the retry loop that ignores the configured fallback chain in this failure mode.

Issue #21444 documents the inverse direction (Codex/gpt-5.5 primary stalling → fallback activates after one stale kill on the non-streaming path). That works because non-streaming fully exhausts retries faster, and because that issue's reporter saw fallback fire after one ~300s wait — but for streaming primaries with longer per-attempt budget and exponential-backoff between attempts, multiple full stale-kill cycles compound into 15+ min observed hangs.

Proposed fix

Extend the eager-fallback trigger to include FailoverReason.timeout (and arguably FailoverReason.connection) after the first stale-detected timeout. Gate behind a new opt-out config key for back-compat:

fallback:
  eager_on_timeout: true   # new, default true

When true, the same eager-fallback block activates on the first classified timeout if a fallback chain is configured AND the credential pool can't recover. This converts a 15+ min silent hang into a ~5 min single-stale-kill + immediate fallback.

Use case

  • Paid Anthropic primary + OAuth-backed Codex/GPT-5.5 fallback. When Anthropic stream is degraded but not erroring cleanly (heartbeats keep socket alive), the user has no automatic recovery today.
  • Symmetric to [Bug]: All openai-codex / gpt-5.5 primary calls hang silently for full stale timeout #21444 — that fixes silent-hang on Codex primary; this fixes silent-hang on Anthropic primary (or any provider whose stream-stalls are eventually caught by the stale detector).

Are you willing to submit a PR for this?

Yes — drafting now.

Metadata

Metadata

Assignees

No one assigned

    Labels

    P1High — major feature broken, no workaroundcomp/agentCore agent loop, run_agent.py, prompt buildertype/bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions