feat(agent): eager fallback on stream-stall timeouts (#22277)#22278
Open
yonefive71 wants to merge 1 commit into
Open
feat(agent): eager fallback on stream-stall timeouts (#22277)#22278yonefive71 wants to merge 1 commit into
yonefive71 wants to merge 1 commit into
Conversation
) Add agent.eager_fallback_on_timeout config flag (default false). When true AND fallback_providers is configured, a classified FailoverReason.timeout immediately activates the next fallback provider instead of retrying the same broken primary. Without this, a hung stream that the stale-detector kills produces a retryable timeout error, and the retry loop hammers the same broken primary repeatedly — burning the full retry budget (3+ stale kills × 5 min each = 15+ min observed silent hangs) before bailing. The eager-fallback block at run_agent.py:13039 already exists for rate_limit/billing; this extends it (under the same _try_activate_fallback + pool-recovery semantics) to timeout, gated behind the new opt-in flag so historical behavior is preserved. Tests: tests/run_agent/test_eager_fallback_on_timeout.py - Default is false when unset - Explicit true/false/truthy-int wiring - Gate predicate fires only when reason==timeout AND flag AND chain has room - Does not fire for non-timeout reasons (unknown, server_error, etc.) Existing 47 fallback tests still pass. Fixes NousResearch#22277
12 tasks
ryonakae
added a commit
to ryonakae/hermes-agent
that referenced
this pull request
Jun 7, 2026
) PR NousResearch#22278 (yonefive71) を現行コード構造へ移植。retry loop は run_agent.py から conversation_loop.py へ、fallback 初期化は agent_init.py へリファクタ済みのため、該当箇所に配置し直した。 agent.eager_fallback_on_timeout フラグ(デフォルト false)が true かつ fallback_providers 設定時、FailoverReason.timeout で同一 primary を リトライせず即座に次の fallback provider へ切り替える。stale 検出が ハングしたストリームを kill した際の 15分超の silent hang を防ぐ。 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes #22277. Adds
agent.eager_fallback_on_timeoutconfig flag (defaultfalse). WhentrueANDfallback_providersis configured, a classifiedFailoverReason.timeoutimmediately activates the next fallback provider instead of retrying the same broken primary.The bug
run_agent.py:13035-13059already implements eager fallback for rate-limit and billing errors. But timeouts (typically a stale-detector-killed hung stream) classify asFailoverReason.timeoutwithretryable=True, and the retry loop just re-hits the same broken provider. With ~5 min stale-kill threshold × N retries, this compounds into the 15+ min silent hang documented in the issue.The configured fallback chain sits idle the entire time.
Fix
Mirror the existing rate-limit eager-fallback block, gated behind a new opt-in flag:
Default
falseto preserve historical behavior; users with paid primaries + OAuth-backed fallbacks (the motivating case: Anthropic +openai-codex) opt in.Config loaded via
hermes_cli.config.load_config()rather than the cachedcli.CLI_CONFIGto avoid the staleness issue addressed by #18947.Files
run_agent.py—_eager_fallback_on_timeoutinstance flag + new eager-fallback blockhermes_cli/config.py—agent.eager_fallback_on_timeout: Falseschema entry with doc commenttests/run_agent/test_eager_fallback_on_timeout.py— 9 new testsTests
pytest tests/run_agent/test_eager_fallback_on_timeout.py— 9 passed:TestEagerFallbackConfigDefaults— default false, explicit false/true, truthy-int coercionTestEagerFallbackOnTimeoutGate— gate predicate fires iffreason==timeout AND flag AND chain has room; does not fire forunknown/server_error/format_error/thinking_signatureExisting fallback suite still green:
pytest tests/run_agent/test_provider_fallback.py tests/run_agent/test_fallback_model.py— 47 passed.Notes
FailoverReason.connection(clean socket drops). Those still go through the normal retry-with-backoff path. Could be a follow-up.Fixes #22277