Skip to content

fix(telegram): self-reschedule reconnect when start_polling fails after network error#3177

Closed
Mibayy wants to merge 1 commit into
NousResearch:mainfrom
Mibayy:fix/telegram-reconnect-deadlock-3173
Closed

fix(telegram): self-reschedule reconnect when start_polling fails after network error#3177
Mibayy wants to merge 1 commit into
NousResearch:mainfrom
Mibayy:fix/telegram-reconnect-deadlock-3173

Conversation

@Mibayy

@Mibayy Mibayy commented Mar 26, 2026

Copy link
Copy Markdown
Contributor

Summary

Closes #3173

Fixes the gateway becoming completely unresponsive after Telegram returns HTTP 502 (Bad Gateway) and the reconnect attempt also fails.

Root cause

When Telegram returns a 502, _handle_polling_network_error does:

  1. await updater.stop() — kills the internal updater loop
  2. await updater.start_polling(...) — tries to reconnect

If step 2 raises (e.g. "Timed out"), the old code logged a warning and returned, with this comment:

# The next network error will trigger another attempt.

That comment was wrong. The polling error callback is only fired by the updater's internal loop. Once stop() has been called, that loop is dead. No further error callbacks ever fire, so the retry chain silently terminates — leaving the gateway alive but completely deaf to messages and unable to run cron jobs.

Fix

When start_polling() raises in the except branch, explicitly schedule a new _handle_polling_network_error task on the running event loop so the exponential-backoff retry chain continues without needing the updater loop to be alive.

except Exception as retry_err:
    logger.warning("[%s] Telegram polling reconnect failed: %s", self.name, retry_err)
    # start_polling failed — polling is now dead and no further error
    # callbacks will fire, so we must schedule the next attempt ourselves.
    loop = asyncio.get_event_loop()
    if loop.is_running() and not self.has_fatal_error:
        loop.create_task(self._handle_polling_network_error(retry_err))

Guards: only schedules when the loop is running and no fatal error is already set (avoids creating tasks during shutdown).

What this changes

  • After a failed reconnect, the retry chain continues using the existing exponential backoff (5s → 10s → 20s → 40s → 60s cap, up to 10 attempts)
  • After 10 failed attempts, _set_fatal_error is called as before → _handle_adapter_fatal_error queues the platform for background reconnection via _platform_reconnect_watcher
  • No change to behavior when reconnect succeeds or when a fatal error is already set

Tests

4 new tests in tests/gateway/test_telegram_network_reconnect.py:

  • test_reconnect_self_schedules_on_start_polling_failure — regression test for the exact bug
  • test_reconnect_does_not_self_schedule_when_fatal_error_set — no spurious tasks during shutdown
  • test_reconnect_success_resets_error_count — happy path unchanged
  • test_reconnect_triggers_fatal_after_max_retries — escalation path unchanged
152 passed  (148 pre-existing + 4 new)

…er 502

When Telegram returns HTTP 502 (Bad Gateway), _handle_polling_network_error
stops the updater then calls start_polling() again. If that second call also
fails (e.g. times out), the old code logged a warning and returned — leaving
polling permanently dead with no further error callbacks to trigger the next
retry. The gateway process stayed alive but handled no messages and stopped
running cron jobs after ~25 minutes.

Root cause: python-telegram-bot's polling error callback is only invoked by
the updater's internal loop. Once stop() is called, the updater loop exits and
no further callbacks ever fire, so the 'next network error will trigger another
attempt' comment was simply wrong.

Fix: when start_polling() raises in the except branch, explicitly schedule
a new _handle_polling_network_error task on the running event loop so the
exponential-backoff retry chain continues even with no updater running.
Guards: only schedules when the loop is running and no fatal error is set
(avoids redundant tasks during shutdown).

Closes NousResearch#3173
teknium1 added a commit that referenced this pull request Mar 26, 2026
After a Telegram 502, _handle_polling_network_error calls updater.stop()
then start_polling(). If start_polling() also raises, the old code logged
a warning and returned — but the comment 'The next network error will
trigger another attempt' was wrong. The updater loop is dead after stop(),
so no further error callbacks ever fire. The gateway stays alive but
permanently deaf to messages.

Fix: when start_polling() fails in the except branch, schedule a new
_handle_polling_network_error task to continue the exponential backoff
retry chain. The task is tracked in _background_tasks (preventing GC).
Guarded by has_fatal_error to avoid spurious retries during shutdown.

Closes #3173.
Salvaged from PR #3177 by Mibayy.
teknium1 added a commit that referenced this pull request Mar 26, 2026
)

After a Telegram 502, _handle_polling_network_error calls updater.stop()
then start_polling(). If start_polling() also raises, the old code logged
a warning and returned — but the comment 'The next network error will
trigger another attempt' was wrong. The updater loop is dead after stop(),
so no further error callbacks ever fire. The gateway stays alive but
permanently deaf to messages.

Fix: when start_polling() fails in the except branch, schedule a new
_handle_polling_network_error task to continue the exponential backoff
retry chain. The task is tracked in _background_tasks (preventing GC).
Guarded by has_fatal_error to avoid spurious retries during shutdown.

Closes #3173.
Salvaged from PR #3177 by Mibayy.
@teknium1

Copy link
Copy Markdown
Contributor

Merged via PR #3268. Your fix and analysis were spot-on — the dead updater loop meant no callbacks would ever fire again. Salvaged onto current main with a couple tweaks: asyncio.ensure_future() instead of deprecated get_event_loop(), and the retry task is now tracked in _background_tasks (consistent with a task-tracking fix that just landed). Your 4 tests were adapted to match. Thanks for the great bug report and fix!

@teknium1 teknium1 closed this Mar 26, 2026
angelburgosrosado pushed a commit to angelburgosrosado/hermes-agent that referenced this pull request Apr 27, 2026
…usResearch#3268)

After a Telegram 502, _handle_polling_network_error calls updater.stop()
then start_polling(). If start_polling() also raises, the old code logged
a warning and returned — but the comment 'The next network error will
trigger another attempt' was wrong. The updater loop is dead after stop(),
so no further error callbacks ever fire. The gateway stays alive but
permanently deaf to messages.

Fix: when start_polling() fails in the except branch, schedule a new
_handle_polling_network_error task to continue the exponential backoff
retry chain. The task is tracked in _background_tasks (preventing GC).
Guarded by has_fatal_error to avoid spurious retries during shutdown.

Closes NousResearch#3173.
Salvaged from PR NousResearch#3177 by Mibayy.
02356abc pushed a commit to 02356abc/hermes-agent that referenced this pull request May 14, 2026
…usResearch#3268)

After a Telegram 502, _handle_polling_network_error calls updater.stop()
then start_polling(). If start_polling() also raises, the old code logged
a warning and returned — but the comment 'The next network error will
trigger another attempt' was wrong. The updater loop is dead after stop(),
so no further error callbacks ever fire. The gateway stays alive but
permanently deaf to messages.

Fix: when start_polling() fails in the except branch, schedule a new
_handle_polling_network_error task to continue the exponential backoff
retry chain. The task is tracked in _background_tasks (preventing GC).
Guarded by has_fatal_error to avoid spurious retries during shutdown.

Closes NousResearch#3173.
Salvaged from PR NousResearch#3177 by Mibayy.
olympus-terminal pushed a commit to olympus-terminal/hermes-agent that referenced this pull request May 16, 2026
…usResearch#3268)

After a Telegram 502, _handle_polling_network_error calls updater.stop()
then start_polling(). If start_polling() also raises, the old code logged
a warning and returned — but the comment 'The next network error will
trigger another attempt' was wrong. The updater loop is dead after stop(),
so no further error callbacks ever fire. The gateway stays alive but
permanently deaf to messages.

Fix: when start_polling() fails in the except branch, schedule a new
_handle_polling_network_error task to continue the exponential backoff
retry chain. The task is tracked in _background_tasks (preventing GC).
Guarded by has_fatal_error to avoid spurious retries during shutdown.

Closes NousResearch#3173.
Salvaged from PR NousResearch#3177 by Mibayy.
gweeteve pushed a commit to gweeteve/hermes-agent that referenced this pull request Jun 2, 2026
…usResearch#3268)

After a Telegram 502, _handle_polling_network_error calls updater.stop()
then start_polling(). If start_polling() also raises, the old code logged
a warning and returned — but the comment 'The next network error will
trigger another attempt' was wrong. The updater loop is dead after stop(),
so no further error callbacks ever fire. The gateway stays alive but
permanently deaf to messages.

Fix: when start_polling() fails in the except branch, schedule a new
_handle_polling_network_error task to continue the exponential backoff
retry chain. The task is tracked in _background_tasks (preventing GC).
Guarded by has_fatal_error to avoid spurious retries during shutdown.

Closes NousResearch#3173.
Salvaged from PR NousResearch#3177 by Mibayy.
Egavasyug pushed a commit to Egavasyug/hermes-agent that referenced this pull request Jun 10, 2026
…usResearch#3268)

After a Telegram 502, _handle_polling_network_error calls updater.stop()
then start_polling(). If start_polling() also raises, the old code logged
a warning and returned — but the comment 'The next network error will
trigger another attempt' was wrong. The updater loop is dead after stop(),
so no further error callbacks ever fire. The gateway stays alive but
permanently deaf to messages.

Fix: when start_polling() fails in the except branch, schedule a new
_handle_polling_network_error task to continue the exponential backoff
retry chain. The task is tracked in _background_tasks (preventing GC).
Guarded by has_fatal_error to avoid spurious retries during shutdown.

Closes NousResearch#3173.
Salvaged from PR NousResearch#3177 by Mibayy.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Gateway crashes on Telegram Bad Gateway (502) — reconnect loop fails

2 participants