fix: gateway reconnect watcher retries indefinitely instead of giving up after 20 attempts#17216
Open
vominh1919 wants to merge 1 commit into
Open
Conversation
… up after 20 attempts The platform reconnect watcher in gateway/run.py permanently removed retryable platforms from _failed_platforms after 20 failed attempts. For long-running gateways, this converted transient network outages into permanent disconnections requiring manual restart. Fix: reset the attempt counter and continue at the backoff cap (5 min) instead of deleting the platform from the retry queue. Fixes NousResearch#17063
Collaborator
Closed
3 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
The platform reconnect watcher in
gateway/run.pypermanently removes retryable platforms from_failed_platformsafter 20 failed attempts (_MAX_ATTEMPTS = 20). For long-running gateways (days/weeks), this converts transient network/proxy outages into permanent disconnections requiring manualhermes gateway restart.Observed timeline (from a real gateway):
httpx.ConnectErrorduring proxy-backed Bot API callsGiving up reconnecting telegram after 20 attemptsand removed Telegram from_failed_platformsFixes #17063
Fix
Instead of deleting the platform from the retry queue after 20 attempts, reset the attempt counter and continue retrying at the backoff cap (5 minutes). This ensures long-running gateways eventually recover from transient outages.
Before: Platform permanently abandoned after 20 failed attempts
After: Platform retries every 5 minutes indefinitely (until gateway restart or successful reconnect)
Changes
gateway/run.py: Replacedel self._failed_platforms[platform]withinfo["attempts"] = 0and schedule next retry at backoff capWARNINGtoINFO(this is expected behavior, not an error)Tests