fallback chain not activated on stream-stall timeouts (15+ min silent hang on degraded primary)

## Summary

`fallback_providers` does not activate when the primary provider is hung on a stream that the stale-stream/stale-call detector eventually kills. The kill produces a classified `FailoverReason.timeout` error with `retryable=True`, and the retry loop simply re-hits the same broken primary — burning the full retry budget (3–5 minutes per stale kill × N retries = 15+ minute total hang) before bailing. The configured fallback chain is never tried.

## Reproduction

1. Configure primary as Anthropic (or any cloud provider) and a fallback chain pointing at a different provider:
   ```yaml
   model:
     default: claude-opus-4-7
     provider: anthropic
   fallback_providers:
     - provider: openai-codex
       model: gpt-5.5
   ```
2. Send a turn while the primary provider is degraded (slow stream with heartbeats, gateway issues, etc.).
3. Observe: the stream stalls, `HERMES_STREAM_STALE_TIMEOUT` fires (default 180s, scaled up to 240–300s for large context), connection is killed, error classifies as `FailoverReason.timeout`. The retry loop attempts the same provider again, hangs again, kills again — repeated until retries exhausted. Fallback is never activated. Total hang: 15+ minutes.

## Root cause

In `run_agent.py:13035-13059`, eager fallback is gated on rate-limit/billing only:

```python
is_rate_limited = classified.reason in (
    FailoverReason.rate_limit,
    FailoverReason.billing,
)
if is_rate_limited and self._fallback_index < len(self._fallback_chain):
    ...
    if self._try_activate_fallback(reason=classified.reason):
```

Stale-detected timeouts (`FailoverReason.timeout`) and connection drops never activate fallback eagerly — they retry-with-backoff against the same broken provider. The stale detector itself works correctly; it's the retry loop that ignores the configured fallback chain in this failure mode.

Issue #21444 documents the inverse direction (Codex/gpt-5.5 primary stalling → fallback activates after one stale kill on the non-streaming path). That works because non-streaming fully exhausts retries faster, and because that issue's reporter saw fallback fire after one ~300s wait — but for streaming primaries with longer per-attempt budget and exponential-backoff between attempts, multiple full stale-kill cycles compound into 15+ min observed hangs.

## Proposed fix

Extend the eager-fallback trigger to include `FailoverReason.timeout` (and arguably `FailoverReason.connection`) after the first stale-detected timeout. Gate behind a new opt-out config key for back-compat:

```yaml
fallback:
  eager_on_timeout: true   # new, default true
```

When true, the same eager-fallback block activates on the first classified timeout if a fallback chain is configured AND the credential pool can't recover. This converts a 15+ min silent hang into a ~5 min single-stale-kill + immediate fallback.

## Use case

- Paid Anthropic primary + OAuth-backed Codex/GPT-5.5 fallback. When Anthropic stream is degraded but not erroring cleanly (heartbeats keep socket alive), the user has no automatic recovery today.
- Symmetric to #21444 — that fixes silent-hang on Codex primary; this fixes silent-hang on Anthropic primary (or any provider whose stream-stalls are eventually caught by the stale detector).

## Are you willing to submit a PR for this?

Yes — drafting now.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fallback chain not activated on stream-stall timeouts (15+ min silent hang on degraded primary) #22277

Summary

Reproduction

Root cause

Proposed fix

Use case

Are you willing to submit a PR for this?

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

fallback chain not activated on stream-stall timeouts (15+ min silent hang on degraded primary) #22277

Description

Summary

Reproduction

Root cause

Proposed fix

Use case

Are you willing to submit a PR for this?

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions