fix(gateway): don't give up on retryable platform reconnect (#17063) by Tranquil-Flow · Pull Request #17219 · NousResearch/hermes-agent

Tranquil-Flow · 2026-04-29T02:17:35Z

What does this PR do?

_platform_reconnect_watcher had a fixed _MAX_ATTEMPTS = 20 and deleted the platform from _failed_platforms once that count was hit — even when the underlying error was a retryable network/proxy outage. For long-running messaging adapters (Telegram, Slack, Discord) the cap plus the 5-minute capped backoff means a multi-hour proxy interruption silently converted to a permanent outage that only hermes gateway restart could recover from.

The reporter observed exactly this on Telegram (#17063): httpx.ConnectError against the Bot API proxy → reconnect queue retried 20 times → Giving up reconnecting telegram after 20 attempts → Telegram stayed offline despite the proxy coming back later. Distinct from #11614, which is about the gateway exiting when all platforms fail at startup; here the gateway itself stays alive but silently loses one platform.

The fix drops the give-up branch entirely and lets retryable failures keep retrying at the 300s capped backoff indefinitely. The non-retryable fast-path (adapter.has_fatal_error and not adapter.fatal_error_retryable) is the correct "stop trying" gate and is unchanged — bad auth tokens and revoked credentials still drop out of the queue immediately.

Related Issue

Fixes #17063

Type of Change

🐛 Bug fix (non-breaking change that fixes an issue)
✨ New feature (non-breaking change that adds functionality)
🔒 Security fix
📝 Documentation update
✅ Tests (adding or improving test coverage)
♻️ Refactor (no behavior change)
🎯 New skill (bundled or hub)

Changes Made

gateway/run.py — _platform_reconnect_watcher: removed the _MAX_ATTEMPTS constant + the attempts >= _MAX_ATTEMPTS give-up branch. Updated the docstring to describe the two real exit conditions (successful reconnect, non-retryable fatal error). Dropped the now-meaningless /<MAX_ATTEMPTS> suffix from the per-attempt info log; the attempt counter still increments forever per platform so attempt-count telemetry is preserved.
tests/gateway/test_platform_reconnect.py — replaced the pre-existing test_reconnect_gives_up_after_max_attempts (which codified the buggy behavior) with two regression tests: (1) a retryable failure at attempts=20 stays queued and attempts becomes 21, and (2) a non-retryable failure at attempts=25 is still removed (the fix must not soften that path).

How to Test

Reproduce the issue manually: configure Telegram with a proxy you can take offline; wait for the gateway to mark Telegram retryable; hold the proxy down for 30 + 60 + 120 + 240 + 16 * 300 ≈ 90 minutes worth of attempts.
Before this fix: gateway logs Giving up reconnecting telegram after 20 attempts; even after the proxy comes back, Telegram stays disconnected until hermes gateway restart.
After this fix: gateway keeps logging Reconnecting telegram (attempt N)... at the 300s capped interval; once the proxy is reachable the next attempt succeeds and Telegram comes back online without operator intervention.

Automated:

pytest tests/gateway/test_platform_reconnect.py tests/gateway/test_telegram_network_reconnect.py tests/gateway/test_platform_base.py -q

Result on macOS 15.6 / Python 3.14: 15 + 87 = 102 passed, 2 skipped. The new test fails on origin/main (the buggy Giving up reconnecting telegram after 20 attempts log line shows up in the failure).

Checklist

Code

I've read the Contributing Guide
My commit messages follow Conventional Commits (fix(scope):, feat(scope):, etc.)
I searched for existing PRs to make sure this isn't a duplicate
My PR contains only changes related to this fix/feature (no unrelated commits)
I've run pytest tests/gateway/test_platform_reconnect.py tests/gateway/test_telegram_network_reconnect.py tests/gateway/test_platform_base.py -q and the touched surface (102 tests) all passes. Full pytest tests/ -q not run; the gateway-platform reconnect path has dedicated test files which are exercised in full above.
I've added tests for my changes (required for bug fixes, strongly encouraged for features)
I've tested on my platform: macOS 15.6 (Python 3.14)

Documentation & Housekeeping

I've updated relevant documentation (README, docs/, docstrings) — or N/A (updated _platform_reconnect_watcher's docstring to describe the two real exit conditions; no user-facing docs reference the old 20-attempt limit)
I've updated cli-config.yaml.example if I added/changed config keys — or N/A (N/A — no config keys touched; the watcher remains a fixed-policy background task)
I've updated CONTRIBUTING.md or AGENTS.md if I changed architecture or workflows — or N/A (N/A)
I've considered cross-platform impact (Windows, macOS) per the compatibility guide — or N/A (N/A — pure asyncio/logging, identical across platforms)
I've updated tool descriptions/schemas if I changed tool behavior — or N/A (N/A — gateway internal, not a tool)

Screenshots / Logs

$ pytest tests/gateway/test_platform_reconnect.py -q
...............                                                          [100%]
15 passed, 2 warnings in 3.18s

Without the fix, test_reconnect_retryable_keeps_trying_past_old_max_cap fails with the exact symptom the reporter described:

WARNING  gateway.run:run.py:2819 Giving up reconnecting telegram after 20 attempts
AssertionError: assert <Platform.TELEGRAM: 'telegram'> in {}

…arch#17063) ``_platform_reconnect_watcher`` had a fixed ``_MAX_ATTEMPTS = 20`` and deleted the platform from ``_failed_platforms`` once that count was hit — even when the underlying error was a retryable network/proxy outage. For long-running messaging adapters (Telegram, Slack, Discord), the cap plus the 5 min capped backoff means a multi-hour proxy interruption silently converts to a permanent outage that only ``hermes gateway restart`` can recover from. The reporter observed exactly this: 21:11 → 21:19 transient ``httpx.ConnectError`` against the Telegram Bot API proxy → reconnect queue retried 20 times → 22:22 ``Giving up reconnecting telegram after 20 attempts`` → Telegram stayed offline despite the proxy coming back later. Distinct from NousResearch#11614 (which is about the gateway exiting when all platforms fail at startup); here the gateway itself stays alive but silently loses one platform. Drop the give-up branch entirely and let retryable failures keep retrying at the 300 s capped backoff indefinitely. The non-retryable fast-path (``adapter.has_fatal_error and not adapter.fatal_error_retryable``) is the correct "stop trying" gate and is unchanged — bad auth tokens and revoked credentials still drop out of the queue immediately. The log line drops the now-meaningless ``/<MAX_ATTEMPTS>`` suffix; the attempt counter still increments forever per platform so attempt-count telemetry is preserved. The pre-existing ``test_reconnect_gives_up_after_max_attempts`` codified the buggy behavior; it is replaced with two tests that lock in the new contract: (1) a retryable failure at attempts=20 stays queued and attempts becomes 21, and (2) a non-retryable failure at attempts=25 still gets removed (the fix must not soften that path).

Tranquil-Flow · 2026-05-19T10:21:20Z

Closing — the fix is now on main via a broader implementation.

On current origin/main:

gateway/run.py:5381: _PAUSE_AFTER_FAILURES = 10 — a per-platform circuit breaker. Retryable failures keep retrying at the 5-minute backoff cap indefinitely; only after 10 consecutive failures is a platform paused (kept in the queue, not hammered).
gateway/run.py:9529, 9591, 9612: the /platform resume <name> slash command surfaces and resumes paused platforms.
Issue Gateway reconnect watcher permanently stops retryable platforms after 20 failed attempts #17063 was closed by PR fix(gateway): keep running when platforms fail; add per-platform circuit breaker + /platform #26600 (fix(gateway): keep running when platforms fail; add per-platform circuit breaker + /platform).

The original 20-attempts give-up branch this PR was deleting is gone from main — replaced by the circuit-breaker + manual resume pattern. Same goal (don't give up on retryable platforms), broader UX. No further action needed on this PR. Thanks for the original diagnosis.

alt-glitch added type/bug Something isn't working P1 High — major feature broken, no workaround comp/gateway Gateway runner, session dispatch, delivery labels Apr 29, 2026

alt-glitch mentioned this pull request Apr 29, 2026

fix: gateway reconnect watcher retries indefinitely instead of giving up after 20 attempts #17216

Open

teknium1 mentioned this pull request May 15, 2026

fix(gateway): keep running when platforms fail; add per-platform circuit breaker + /platform #26600

Merged

Tranquil-Flow closed this May 19, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(gateway): don't give up on retryable platform reconnect (#17063)#17219

fix(gateway): don't give up on retryable platform reconnect (#17063)#17219
Tranquil-Flow wants to merge 1 commit into
NousResearch:mainfrom
Tranquil-Flow:fix/17063-reconnect-watcher-no-cap-on-retryable

Tranquil-Flow commented Apr 29, 2026

Uh oh!

Tranquil-Flow commented May 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Tranquil-Flow commented Apr 29, 2026

What does this PR do?

Related Issue

Type of Change

Changes Made

How to Test

Checklist

Code

Documentation & Housekeeping

Screenshots / Logs

Uh oh!

Tranquil-Flow commented May 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants