Skip to content

fix(gateway): keep startup retry failures alive for reconnect recovery#17984

Closed
drkolesnikov wants to merge 1 commit into
NousResearch:mainfrom
drkolesnikov:main
Closed

fix(gateway): keep startup retry failures alive for reconnect recovery#17984
drkolesnikov wants to merge 1 commit into
NousResearch:mainfrom
drkolesnikov:main

Conversation

@drkolesnikov

Copy link
Copy Markdown

Treat all-retryable startup failures (e.g. telegram.error.TimedOut) as degraded mode instead of fatal exit. This prevents the gateway from suicide-looping into systemd rate-limit lockout when transient network blips occur at boot.

  • Remove early return False on all-retryable startup failures
  • Write runtime status as 'degraded' instead of 'startup_failed'
  • Allow background _platform_reconnect_watcher to recover automatically

Fixes the race where a single Telegram connect timeout causes:
exit(1) -> systemd restart -> 5 rapid restarts -> StartLimitHit -> dead

Treat all-retryable startup failures (e.g. telegram.error.TimedOut) as
degraded mode instead of fatal exit. This prevents the gateway from
suicide-looping into systemd rate-limit lockout when transient network
blips occur at boot.

- Remove early return False on all-retryable startup failures
- Write runtime status as 'degraded' instead of 'startup_failed'
- Allow background _platform_reconnect_watcher to recover automatically

Fixes the race where a single Telegram connect timeout causes:
  exit(1) -> systemd restart -> 5 rapid restarts -> StartLimitHit -> dead
@alt-glitch alt-glitch added type/bug Something isn't working P1 High — major feature broken, no workaround comp/gateway Gateway runner, session dispatch, delivery platform/telegram Telegram bot adapter labels Apr 30, 2026
@alt-glitch

Copy link
Copy Markdown
Collaborator

Fixes #9719 — gateway startup exits fatally on transient Telegram timeouts instead of entering degraded mode for reconnect recovery.

@teknium1

Copy link
Copy Markdown
Contributor

This has been implemented on current main. Thanks for the focused fix and for linking it to #9719.

This is an automated hermes-sweeper review.

Evidence:

  • gateway/run.py now queues retryable startup failures in _failed_platforms and, when no platforms connect but the errors are retryable, writes gateway_state="degraded" instead of startup_failed and falls through instead of returning False.
  • gateway/run.py then starts _platform_reconnect_watcher(), so queued platforms can recover in the background.
  • The change landed in commit 518f39557b6753a5dc766a05dd14dd5cf2b9edeb (fix(gateway): keep running when platforms fail; add per-platform circuit breaker + /platform (#26600)).
  • tests/gateway/test_runner_startup_failures.py covers the startup case: retryable Telegram startup errors keep the gateway alive, leave Telegram queued for retry, and mark the platform as retrying.

@teknium1 teknium1 closed this Jun 10, 2026
@teknium1 teknium1 added the sweeper:implemented-on-main Sweeper: behavior already present on current main label Jun 10, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/gateway Gateway runner, session dispatch, delivery P1 High — major feature broken, no workaround platform/telegram Telegram bot adapter sweeper:implemented-on-main Sweeper: behavior already present on current main type/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants