feat(agent): eager fallback on stream-stall timeouts (#22277) by yonefive71 · Pull Request #22278 · NousResearch/hermes-agent

yonefive71 · 2026-05-09T04:41:18Z

Summary

Fixes #22277. Adds agent.eager_fallback_on_timeout config flag (default false). When true AND fallback_providers is configured, a classified FailoverReason.timeout immediately activates the next fallback provider instead of retrying the same broken primary.

The bug

run_agent.py:13035-13059 already implements eager fallback for rate-limit and billing errors. But timeouts (typically a stale-detector-killed hung stream) classify as FailoverReason.timeout with retryable=True, and the retry loop just re-hits the same broken provider. With ~5 min stale-kill threshold × N retries, this compounds into the 15+ min silent hang documented in the issue.

The configured fallback chain sits idle the entire time.

Fix

Mirror the existing rate-limit eager-fallback block, gated behind a new opt-in flag:

is_timeout_eager = (
    classified.reason == FailoverReason.timeout
    and self._eager_fallback_on_timeout
    and self._fallback_index < len(self._fallback_chain)
)
if is_timeout_eager:
    self._emit_status("⚠️ Provider stalled (stream/call timeout) — switching to fallback provider...")
    if self._try_activate_fallback(reason=classified.reason):
        retry_count = 0
        compression_attempts = 0
        primary_recovery_attempted = False
        continue

Default false to preserve historical behavior; users with paid primaries + OAuth-backed fallbacks (the motivating case: Anthropic + openai-codex) opt in.

Config loaded via hermes_cli.config.load_config() rather than the cached cli.CLI_CONFIG to avoid the staleness issue addressed by #18947.

Files

run_agent.py — _eager_fallback_on_timeout instance flag + new eager-fallback block
hermes_cli/config.py — agent.eager_fallback_on_timeout: False schema entry with doc comment
tests/run_agent/test_eager_fallback_on_timeout.py — 9 new tests

Tests

pytest tests/run_agent/test_eager_fallback_on_timeout.py — 9 passed:

TestEagerFallbackConfigDefaults — default false, explicit false/true, truthy-int coercion
TestEagerFallbackOnTimeoutGate — gate predicate fires iff reason==timeout AND flag AND chain has room; does not fire for unknown/server_error/format_error/thinking_signature

Existing fallback suite still green:

pytest tests/run_agent/test_provider_fallback.py tests/run_agent/test_fallback_model.py — 47 passed.

Notes

The full retry-loop path is a 700-line inline block that's hard to drive end-to-end in unit tests; the gate predicate is asserted directly. Happy to add an integration test if reviewers want one.
I did NOT change behavior on FailoverReason.connection (clean socket drops). Those still go through the normal retry-with-backoff path. Could be a follow-up.
Symmetric to [Bug]: All openai-codex / gpt-5.5 primary calls hang silently for full stale timeout #21444's territory but the inverse direction: that issue is about Codex-as-primary stalling; this is about any-provider-as-primary stalling while a fallback exists.

Fixes #22277

) Add agent.eager_fallback_on_timeout config flag (default false). When true AND fallback_providers is configured, a classified FailoverReason.timeout immediately activates the next fallback provider instead of retrying the same broken primary. Without this, a hung stream that the stale-detector kills produces a retryable timeout error, and the retry loop hammers the same broken primary repeatedly — burning the full retry budget (3+ stale kills × 5 min each = 15+ min observed silent hangs) before bailing. The eager-fallback block at run_agent.py:13039 already exists for rate_limit/billing; this extends it (under the same _try_activate_fallback + pool-recovery semantics) to timeout, gated behind the new opt-in flag so historical behavior is preserved. Tests: tests/run_agent/test_eager_fallback_on_timeout.py - Default is false when unset - Explicit true/false/truthy-int wiring - Gate predicate fires only when reason==timeout AND flag AND chain has room - Does not fire for non-timeout reasons (unknown, server_error, etc.) Existing 47 fallback tests still pass. Fixes NousResearch#22277

) PR NousResearch#22278 (yonefive71) を現行コード構造へ移植。retry loop は run_agent.py から conversation_loop.py へ、fallback 初期化は agent_init.py へリファクタ済みのため、該当箇所に配置し直した。 agent.eager_fallback_on_timeout フラグ(デフォルト false)が true かつ fallback_providers 設定時、FailoverReason.timeout で同一 primary をリトライせず即座に次の fallback provider へ切り替える。stale 検出がハングしたストリームを kill した際の 15分超の silent hang を防ぐ。 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

alt-glitch added type/bug Something isn't working comp/agent Core agent loop, run_agent.py, prompt builder area/config Config system, migrations, profiles P1 High — major feature broken, no workaround labels May 9, 2026

alt-glitch mentioned this pull request May 14, 2026

Stale stream timeout does not trigger fallback_providers chain #25689

Open

adam91holt mentioned this pull request May 25, 2026

fix(codex): add time-to-first-byte watchdog for stalled Codex Responses streams #31984

Closed

12 tasks

alt-glitch mentioned this pull request May 26, 2026

[Bug]: Content-filter triggered stream stall (output new_sensitive) does not trigger fallback_providers #32421

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(agent): eager fallback on stream-stall timeouts (#22277)#22278

feat(agent): eager fallback on stream-stall timeouts (#22277)#22278
yonefive71 wants to merge 1 commit into
NousResearch:mainfrom
yonefive71:fix/eager-fallback-on-timeout

yonefive71 commented May 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

yonefive71 commented May 9, 2026

Summary

The bug

Fix

Files

Tests

Notes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants