Skip to content

fix: gateway reconnect watcher retries indefinitely instead of giving up after 20 attempts#17216

Open
vominh1919 wants to merge 1 commit into
NousResearch:mainfrom
vominh1919:fix/gateway-reconnect-max-attempts
Open

fix: gateway reconnect watcher retries indefinitely instead of giving up after 20 attempts#17216
vominh1919 wants to merge 1 commit into
NousResearch:mainfrom
vominh1919:fix/gateway-reconnect-max-attempts

Conversation

@vominh1919

Copy link
Copy Markdown
Contributor

Problem

The platform reconnect watcher in gateway/run.py permanently removes retryable platforms from _failed_platforms after 20 failed attempts (_MAX_ATTEMPTS = 20). For long-running gateways (days/weeks), this converts transient network/proxy outages into permanent disconnections requiring manual hermes gateway restart.

Observed timeline (from a real gateway):

  • Telegram hit repeated httpx.ConnectError during proxy-backed Bot API calls
  • Gateway retried for ~2 hours (20 attempts with exponential backoff)
  • Gateway logged Giving up reconnecting telegram after 20 attempts and removed Telegram from _failed_platforms
  • Telegram remained disconnected until manual restart, even after network recovered

Fixes #17063

Fix

Instead of deleting the platform from the retry queue after 20 attempts, reset the attempt counter and continue retrying at the backoff cap (5 minutes). This ensures long-running gateways eventually recover from transient outages.

Before: Platform permanently abandoned after 20 failed attempts
After: Platform retries every 5 minutes indefinitely (until gateway restart or successful reconnect)

Changes

  • gateway/run.py: Replace del self._failed_platforms[platform] with info["attempts"] = 0 and schedule next retry at backoff cap
  • Change log level from WARNING to INFO (this is expected behavior, not an error)

Tests

  • Existing gateway reconnect tests should still pass
  • The fix is minimal (4 lines changed) and preserves all existing behavior except the permanent abandonment

… up after 20 attempts

The platform reconnect watcher in gateway/run.py permanently removed
retryable platforms from _failed_platforms after 20 failed attempts.
For long-running gateways, this converted transient network outages
into permanent disconnections requiring manual restart.

Fix: reset the attempt counter and continue at the backoff cap (5 min)
instead of deleting the platform from the retry queue.

Fixes NousResearch#17063
@alt-glitch alt-glitch added type/bug Something isn't working P1 High — major feature broken, no workaround comp/gateway Gateway runner, session dispatch, delivery labels Apr 29, 2026
@alt-glitch

Copy link
Copy Markdown
Collaborator

Likely duplicate of #17219 — both fix #17063 by removing the MAX_ATTEMPTS give-up branch in reconnect watcher. #17219 has more comprehensive changes (drops constant entirely vs resetting counter).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/gateway Gateway runner, session dispatch, delivery P1 High — major feature broken, no workaround type/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Gateway reconnect watcher permanently stops retryable platforms after 20 failed attempts

2 participants