Skip to content

Stale stream timeout does not trigger fallback_providers chain #25689

@vibegin

Description

@vibegin

Problem

When the primary provider becomes unresponsive during streaming (no chunks delivered within the stale timeout), Hermes kills the connection and retries with the same provider. It does not activate the fallback_providers chain.

This means that if a provider is alive but unresponsive (not returning 429/500, just holding the connection open without delivering tokens), the agent will retry the same provider repeatedly until max_retries is exhausted — even though working fallback providers are configured and available.

Reproduction

  1. Configure a primary provider and fallback_providers chain in config.yaml
  2. Send a request with a large context (~80K tokens) that causes the primary to stall (no chunks, but TCP connection stays alive via SSE keepalive)
  3. Observe: stale detector fires at 240s, kills connection, retries same provider
  4. After max_retries, the turn fails — fallback chain is never activated

Expected behavior

After the first stale-stream kill (or after N stale kills), _try_activate_fallback() should be called to switch to the next provider in the chain — similar to how empty/malformed responses trigger eager fallback at line 11513-11519.

Relevant code

Stale stream detection (run_agent.py:7514-7546):

_stale_elapsed = time.time() - last_chunk_time["t"]
if _stale_elapsed > _stream_stale_timeout:
    # ... logs warning, emits status, kills client
    self._emit_status(
        f"⚠️ No response from provider for {int(_stale_elapsed)}s "
        f"(model: {api_kwargs.get('model', 'unknown')}, "
        f"context: ~{_est_ctx:,} tokens). "
        f"Reconnecting..."
    )
    # Closes client, resets timer, continues while-loop — does NOT call _try_activate_fallback()

Non-streaming stale detection (run_agent.py:6550-6587) — same gap: kills connection, sets TimeoutError, but no fallback activation.

Where fallback IS activated (run_agent.py:11510-11519):

# Empty/malformed responses — correctly triggers fallback
if self._fallback_index < len(self._fallback_chain):
    self._emit_status("⚠️ Empty/malformed response — switching to fallback...")
if self._try_activate_fallback():
    retry_count = 0
    continue

_try_activate_fallback signature (run_agent.py:7629):

def _try_activate_fallback(self, reason: "FailoverReason | None" = None) -> bool:

FailoverReason.timeout exists in agent/error_classifier.py:40 — can be used as the reason parameter.

Environment

  • Hermes Agent v0.12.0
  • Provider: zai (glm-5-turbo) as primary, ollama-cloud (minimax-m2.7) as first fallback
  • Context: ~81K tokens, stale timeout hit 240s (correct per scaling logic at line 7484)
  • Provider was alive (quota not exhausted, non-streaming requests worked) — just not delivering stream chunks

Suggested fix

In _make_streaming_api_call() around line 7546, after the stale-stream kill and timer reset, add:

if self._fallback_index < len(self._fallback_chain):
    self._emit_status("⚠️ Stale stream — switching to fallback...")
if self._try_activate_fallback(reason=FailoverReason.timeout):
    retry_count = 0
    compression_attempts = 0
    primary_recovery_attempted = False
    break  # exit stale-detection while-loop, retry with fallback

Same logic should be added for the non-streaming stale timeout at line 6582-6586.

Note: FailoverReason.timeout is already defined in agent/error_classifier.py:40 — no new enum value needed.

Metadata

Metadata

Assignees

No one assigned

    Labels

    P1High — major feature broken, no workaroundcomp/agentCore agent loop, run_agent.py, prompt buildertype/bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions