Skip to content

feat(agent): eager fallback on stream-stall timeouts (#22277)#22278

Open
yonefive71 wants to merge 1 commit into
NousResearch:mainfrom
yonefive71:fix/eager-fallback-on-timeout
Open

feat(agent): eager fallback on stream-stall timeouts (#22277)#22278
yonefive71 wants to merge 1 commit into
NousResearch:mainfrom
yonefive71:fix/eager-fallback-on-timeout

Conversation

@yonefive71

Copy link
Copy Markdown

Summary

Fixes #22277. Adds agent.eager_fallback_on_timeout config flag (default false). When true AND fallback_providers is configured, a classified FailoverReason.timeout immediately activates the next fallback provider instead of retrying the same broken primary.

The bug

run_agent.py:13035-13059 already implements eager fallback for rate-limit and billing errors. But timeouts (typically a stale-detector-killed hung stream) classify as FailoverReason.timeout with retryable=True, and the retry loop just re-hits the same broken provider. With ~5 min stale-kill threshold × N retries, this compounds into the 15+ min silent hang documented in the issue.

The configured fallback chain sits idle the entire time.

Fix

Mirror the existing rate-limit eager-fallback block, gated behind a new opt-in flag:

is_timeout_eager = (
    classified.reason == FailoverReason.timeout
    and self._eager_fallback_on_timeout
    and self._fallback_index < len(self._fallback_chain)
)
if is_timeout_eager:
    self._emit_status("⚠️ Provider stalled (stream/call timeout) — switching to fallback provider...")
    if self._try_activate_fallback(reason=classified.reason):
        retry_count = 0
        compression_attempts = 0
        primary_recovery_attempted = False
        continue

Default false to preserve historical behavior; users with paid primaries + OAuth-backed fallbacks (the motivating case: Anthropic + openai-codex) opt in.

Config loaded via hermes_cli.config.load_config() rather than the cached cli.CLI_CONFIG to avoid the staleness issue addressed by #18947.

Files

  • run_agent.py_eager_fallback_on_timeout instance flag + new eager-fallback block
  • hermes_cli/config.pyagent.eager_fallback_on_timeout: False schema entry with doc comment
  • tests/run_agent/test_eager_fallback_on_timeout.py — 9 new tests

Tests

pytest tests/run_agent/test_eager_fallback_on_timeout.py — 9 passed:

  • TestEagerFallbackConfigDefaults — default false, explicit false/true, truthy-int coercion
  • TestEagerFallbackOnTimeoutGate — gate predicate fires iff reason==timeout AND flag AND chain has room; does not fire for unknown/server_error/format_error/thinking_signature

Existing fallback suite still green:

pytest tests/run_agent/test_provider_fallback.py tests/run_agent/test_fallback_model.py — 47 passed.

Notes

  • The full retry-loop path is a 700-line inline block that's hard to drive end-to-end in unit tests; the gate predicate is asserted directly. Happy to add an integration test if reviewers want one.
  • I did NOT change behavior on FailoverReason.connection (clean socket drops). Those still go through the normal retry-with-backoff path. Could be a follow-up.
  • Symmetric to [Bug]: All openai-codex / gpt-5.5 primary calls hang silently for full stale timeout #21444's territory but the inverse direction: that issue is about Codex-as-primary stalling; this is about any-provider-as-primary stalling while a fallback exists.

Fixes #22277

)

Add agent.eager_fallback_on_timeout config flag (default false). When
true AND fallback_providers is configured, a classified
FailoverReason.timeout immediately activates the next fallback provider
instead of retrying the same broken primary.

Without this, a hung stream that the stale-detector kills produces a
retryable timeout error, and the retry loop hammers the same broken
primary repeatedly — burning the full retry budget (3+ stale kills × 5
min each = 15+ min observed silent hangs) before bailing.

The eager-fallback block at run_agent.py:13039 already exists for
rate_limit/billing; this extends it (under the same _try_activate_fallback
+ pool-recovery semantics) to timeout, gated behind the new opt-in flag
so historical behavior is preserved.

Tests: tests/run_agent/test_eager_fallback_on_timeout.py
- Default is false when unset
- Explicit true/false/truthy-int wiring
- Gate predicate fires only when reason==timeout AND flag AND chain has room
- Does not fire for non-timeout reasons (unknown, server_error, etc.)

Existing 47 fallback tests still pass.

Fixes NousResearch#22277
@alt-glitch alt-glitch added type/bug Something isn't working comp/agent Core agent loop, run_agent.py, prompt builder area/config Config system, migrations, profiles P1 High — major feature broken, no workaround labels May 9, 2026
ryonakae added a commit to ryonakae/hermes-agent that referenced this pull request Jun 7, 2026
)

PR NousResearch#22278 (yonefive71) を現行コード構造へ移植。retry loop は
run_agent.py から conversation_loop.py へ、fallback 初期化は
agent_init.py へリファクタ済みのため、該当箇所に配置し直した。

agent.eager_fallback_on_timeout フラグ(デフォルト false)が true かつ
fallback_providers 設定時、FailoverReason.timeout で同一 primary を
リトライせず即座に次の fallback provider へ切り替える。stale 検出が
ハングしたストリームを kill した際の 15分超の silent hang を防ぐ。

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/config Config system, migrations, profiles comp/agent Core agent loop, run_agent.py, prompt builder P1 High — major feature broken, no workaround type/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

fallback chain not activated on stream-stall timeouts (15+ min silent hang on degraded primary)

2 participants