fix(gateway): don't give up on retryable platform reconnect (#17063)#17219
Closed
Tranquil-Flow wants to merge 1 commit into
Closed
fix(gateway): don't give up on retryable platform reconnect (#17063)#17219Tranquil-Flow wants to merge 1 commit into
Tranquil-Flow wants to merge 1 commit into
Conversation
…arch#17063) ``_platform_reconnect_watcher`` had a fixed ``_MAX_ATTEMPTS = 20`` and deleted the platform from ``_failed_platforms`` once that count was hit — even when the underlying error was a retryable network/proxy outage. For long-running messaging adapters (Telegram, Slack, Discord), the cap plus the 5 min capped backoff means a multi-hour proxy interruption silently converts to a permanent outage that only ``hermes gateway restart`` can recover from. The reporter observed exactly this: 21:11 → 21:19 transient ``httpx.ConnectError`` against the Telegram Bot API proxy → reconnect queue retried 20 times → 22:22 ``Giving up reconnecting telegram after 20 attempts`` → Telegram stayed offline despite the proxy coming back later. Distinct from NousResearch#11614 (which is about the gateway exiting when all platforms fail at startup); here the gateway itself stays alive but silently loses one platform. Drop the give-up branch entirely and let retryable failures keep retrying at the 300 s capped backoff indefinitely. The non-retryable fast-path (``adapter.has_fatal_error and not adapter.fatal_error_retryable``) is the correct "stop trying" gate and is unchanged — bad auth tokens and revoked credentials still drop out of the queue immediately. The log line drops the now-meaningless ``/<MAX_ATTEMPTS>`` suffix; the attempt counter still increments forever per platform so attempt-count telemetry is preserved. The pre-existing ``test_reconnect_gives_up_after_max_attempts`` codified the buggy behavior; it is replaced with two tests that lock in the new contract: (1) a retryable failure at attempts=20 stays queued and attempts becomes 21, and (2) a non-retryable failure at attempts=25 still gets removed (the fix must not soften that path).
Contributor
Author
|
Closing — the fix is now on main via a broader implementation. On current
The original 20-attempts give-up branch this PR was deleting is gone from main — replaced by the circuit-breaker + manual resume pattern. Same goal (don't give up on retryable platforms), broader UX. No further action needed on this PR. Thanks for the original diagnosis. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What does this PR do?
_platform_reconnect_watcherhad a fixed_MAX_ATTEMPTS = 20and deleted the platform from_failed_platformsonce that count was hit — even when the underlying error was a retryable network/proxy outage. For long-running messaging adapters (Telegram, Slack, Discord) the cap plus the 5-minute capped backoff means a multi-hour proxy interruption silently converted to a permanent outage that onlyhermes gateway restartcould recover from.The reporter observed exactly this on Telegram (#17063):
httpx.ConnectErroragainst the Bot API proxy → reconnect queue retried 20 times →Giving up reconnecting telegram after 20 attempts→ Telegram stayed offline despite the proxy coming back later. Distinct from #11614, which is about the gateway exiting when all platforms fail at startup; here the gateway itself stays alive but silently loses one platform.The fix drops the give-up branch entirely and lets retryable failures keep retrying at the 300s capped backoff indefinitely. The non-retryable fast-path (
adapter.has_fatal_error and not adapter.fatal_error_retryable) is the correct "stop trying" gate and is unchanged — bad auth tokens and revoked credentials still drop out of the queue immediately.Related Issue
Fixes #17063
Type of Change
Changes Made
gateway/run.py—_platform_reconnect_watcher: removed the_MAX_ATTEMPTSconstant + theattempts >= _MAX_ATTEMPTSgive-up branch. Updated the docstring to describe the two real exit conditions (successful reconnect, non-retryable fatal error). Dropped the now-meaningless/<MAX_ATTEMPTS>suffix from the per-attempt info log; the attempt counter still increments forever per platform so attempt-count telemetry is preserved.tests/gateway/test_platform_reconnect.py— replaced the pre-existingtest_reconnect_gives_up_after_max_attempts(which codified the buggy behavior) with two regression tests: (1) a retryable failure atattempts=20stays queued andattemptsbecomes 21, and (2) a non-retryable failure atattempts=25is still removed (the fix must not soften that path).How to Test
30 + 60 + 120 + 240 + 16 * 300≈ 90 minutes worth of attempts.Giving up reconnecting telegram after 20 attempts; even after the proxy comes back, Telegram stays disconnected untilhermes gateway restart.Reconnecting telegram (attempt N)...at the 300s capped interval; once the proxy is reachable the next attempt succeeds and Telegram comes back online without operator intervention.Automated:
Result on macOS 15.6 / Python 3.14:
15 + 87 = 102 passed, 2 skipped. The new test fails onorigin/main(the buggyGiving up reconnecting telegram after 20 attemptslog line shows up in the failure).Checklist
Code
fix(scope):,feat(scope):, etc.)pytest tests/gateway/test_platform_reconnect.py tests/gateway/test_telegram_network_reconnect.py tests/gateway/test_platform_base.py -qand the touched surface (102 tests) all passes. Fullpytest tests/ -qnot run; the gateway-platform reconnect path has dedicated test files which are exercised in full above.Documentation & Housekeeping
docs/, docstrings) — or N/A (updated_platform_reconnect_watcher's docstring to describe the two real exit conditions; no user-facing docs reference the old 20-attempt limit)cli-config.yaml.exampleif I added/changed config keys — or N/A (N/A — no config keys touched; the watcher remains a fixed-policy background task)CONTRIBUTING.mdorAGENTS.mdif I changed architecture or workflows — or N/A (N/A)Screenshots / Logs
Without the fix,
test_reconnect_retryable_keeps_trying_past_old_max_capfails with the exact symptom the reporter described: