Skip to content

fix(gateway): don't give up on retryable platform reconnect (#17063)#17219

Closed
Tranquil-Flow wants to merge 1 commit into
NousResearch:mainfrom
Tranquil-Flow:fix/17063-reconnect-watcher-no-cap-on-retryable
Closed

fix(gateway): don't give up on retryable platform reconnect (#17063)#17219
Tranquil-Flow wants to merge 1 commit into
NousResearch:mainfrom
Tranquil-Flow:fix/17063-reconnect-watcher-no-cap-on-retryable

Conversation

@Tranquil-Flow

Copy link
Copy Markdown
Contributor

What does this PR do?

_platform_reconnect_watcher had a fixed _MAX_ATTEMPTS = 20 and deleted the platform from _failed_platforms once that count was hit — even when the underlying error was a retryable network/proxy outage. For long-running messaging adapters (Telegram, Slack, Discord) the cap plus the 5-minute capped backoff means a multi-hour proxy interruption silently converted to a permanent outage that only hermes gateway restart could recover from.

The reporter observed exactly this on Telegram (#17063): httpx.ConnectError against the Bot API proxy → reconnect queue retried 20 times → Giving up reconnecting telegram after 20 attempts → Telegram stayed offline despite the proxy coming back later. Distinct from #11614, which is about the gateway exiting when all platforms fail at startup; here the gateway itself stays alive but silently loses one platform.

The fix drops the give-up branch entirely and lets retryable failures keep retrying at the 300s capped backoff indefinitely. The non-retryable fast-path (adapter.has_fatal_error and not adapter.fatal_error_retryable) is the correct "stop trying" gate and is unchanged — bad auth tokens and revoked credentials still drop out of the queue immediately.

Related Issue

Fixes #17063

Type of Change

  • 🐛 Bug fix (non-breaking change that fixes an issue)
  • ✨ New feature (non-breaking change that adds functionality)
  • 🔒 Security fix
  • 📝 Documentation update
  • ✅ Tests (adding or improving test coverage)
  • ♻️ Refactor (no behavior change)
  • 🎯 New skill (bundled or hub)

Changes Made

  • gateway/run.py_platform_reconnect_watcher: removed the _MAX_ATTEMPTS constant + the attempts >= _MAX_ATTEMPTS give-up branch. Updated the docstring to describe the two real exit conditions (successful reconnect, non-retryable fatal error). Dropped the now-meaningless /<MAX_ATTEMPTS> suffix from the per-attempt info log; the attempt counter still increments forever per platform so attempt-count telemetry is preserved.
  • tests/gateway/test_platform_reconnect.py — replaced the pre-existing test_reconnect_gives_up_after_max_attempts (which codified the buggy behavior) with two regression tests: (1) a retryable failure at attempts=20 stays queued and attempts becomes 21, and (2) a non-retryable failure at attempts=25 is still removed (the fix must not soften that path).

How to Test

  1. Reproduce the issue manually: configure Telegram with a proxy you can take offline; wait for the gateway to mark Telegram retryable; hold the proxy down for 30 + 60 + 120 + 240 + 16 * 300 ≈ 90 minutes worth of attempts.
  2. Before this fix: gateway logs Giving up reconnecting telegram after 20 attempts; even after the proxy comes back, Telegram stays disconnected until hermes gateway restart.
  3. After this fix: gateway keeps logging Reconnecting telegram (attempt N)... at the 300s capped interval; once the proxy is reachable the next attempt succeeds and Telegram comes back online without operator intervention.

Automated:

pytest tests/gateway/test_platform_reconnect.py tests/gateway/test_telegram_network_reconnect.py tests/gateway/test_platform_base.py -q

Result on macOS 15.6 / Python 3.14: 15 + 87 = 102 passed, 2 skipped. The new test fails on origin/main (the buggy Giving up reconnecting telegram after 20 attempts log line shows up in the failure).

Checklist

Code

  • I've read the Contributing Guide
  • My commit messages follow Conventional Commits (fix(scope):, feat(scope):, etc.)
  • I searched for existing PRs to make sure this isn't a duplicate
  • My PR contains only changes related to this fix/feature (no unrelated commits)
  • I've run pytest tests/gateway/test_platform_reconnect.py tests/gateway/test_telegram_network_reconnect.py tests/gateway/test_platform_base.py -q and the touched surface (102 tests) all passes. Full pytest tests/ -q not run; the gateway-platform reconnect path has dedicated test files which are exercised in full above.
  • I've added tests for my changes (required for bug fixes, strongly encouraged for features)
  • I've tested on my platform: macOS 15.6 (Python 3.14)

Documentation & Housekeeping

  • I've updated relevant documentation (README, docs/, docstrings) — or N/A (updated _platform_reconnect_watcher's docstring to describe the two real exit conditions; no user-facing docs reference the old 20-attempt limit)
  • I've updated cli-config.yaml.example if I added/changed config keys — or N/A (N/A — no config keys touched; the watcher remains a fixed-policy background task)
  • I've updated CONTRIBUTING.md or AGENTS.md if I changed architecture or workflows — or N/A (N/A)
  • I've considered cross-platform impact (Windows, macOS) per the compatibility guide — or N/A (N/A — pure asyncio/logging, identical across platforms)
  • I've updated tool descriptions/schemas if I changed tool behavior — or N/A (N/A — gateway internal, not a tool)

Screenshots / Logs

$ pytest tests/gateway/test_platform_reconnect.py -q
...............                                                          [100%]
15 passed, 2 warnings in 3.18s

Without the fix, test_reconnect_retryable_keeps_trying_past_old_max_cap fails with the exact symptom the reporter described:

WARNING  gateway.run:run.py:2819 Giving up reconnecting telegram after 20 attempts
AssertionError: assert <Platform.TELEGRAM: 'telegram'> in {}

…arch#17063)

``_platform_reconnect_watcher`` had a fixed ``_MAX_ATTEMPTS = 20`` and
deleted the platform from ``_failed_platforms`` once that count was hit
— even when the underlying error was a retryable network/proxy outage.
For long-running messaging adapters (Telegram, Slack, Discord), the cap
plus the 5 min capped backoff means a multi-hour proxy interruption
silently converts to a permanent outage that only ``hermes gateway
restart`` can recover from.

The reporter observed exactly this: 21:11 → 21:19 transient
``httpx.ConnectError`` against the Telegram Bot API proxy → reconnect
queue retried 20 times → 22:22 ``Giving up reconnecting telegram after
20 attempts`` → Telegram stayed offline despite the proxy coming back
later. Distinct from NousResearch#11614 (which is about the gateway exiting when
all platforms fail at startup); here the gateway itself stays alive but
silently loses one platform.

Drop the give-up branch entirely and let retryable failures keep
retrying at the 300 s capped backoff indefinitely. The non-retryable
fast-path (``adapter.has_fatal_error and not adapter.fatal_error_retryable``)
is the correct "stop trying" gate and is unchanged — bad auth tokens
and revoked credentials still drop out of the queue immediately.

The log line drops the now-meaningless ``/<MAX_ATTEMPTS>`` suffix; the
attempt counter still increments forever per platform so attempt-count
telemetry is preserved.

The pre-existing ``test_reconnect_gives_up_after_max_attempts`` codified
the buggy behavior; it is replaced with two tests that lock in the new
contract: (1) a retryable failure at attempts=20 stays queued and
attempts becomes 21, and (2) a non-retryable failure at attempts=25
still gets removed (the fix must not soften that path).
@Tranquil-Flow

Copy link
Copy Markdown
Contributor Author

Closing — the fix is now on main via a broader implementation.

On current origin/main:

The original 20-attempts give-up branch this PR was deleting is gone from main — replaced by the circuit-breaker + manual resume pattern. Same goal (don't give up on retryable platforms), broader UX. No further action needed on this PR. Thanks for the original diagnosis.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/gateway Gateway runner, session dispatch, delivery P1 High — major feature broken, no workaround type/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Gateway reconnect watcher permanently stops retryable platforms after 20 failed attempts

2 participants