fix(telegram): self-reschedule reconnect when start_polling fails after network error#3177
Closed
Mibayy wants to merge 1 commit into
Closed
fix(telegram): self-reschedule reconnect when start_polling fails after network error#3177Mibayy wants to merge 1 commit into
Mibayy wants to merge 1 commit into
Conversation
…er 502 When Telegram returns HTTP 502 (Bad Gateway), _handle_polling_network_error stops the updater then calls start_polling() again. If that second call also fails (e.g. times out), the old code logged a warning and returned — leaving polling permanently dead with no further error callbacks to trigger the next retry. The gateway process stayed alive but handled no messages and stopped running cron jobs after ~25 minutes. Root cause: python-telegram-bot's polling error callback is only invoked by the updater's internal loop. Once stop() is called, the updater loop exits and no further callbacks ever fire, so the 'next network error will trigger another attempt' comment was simply wrong. Fix: when start_polling() raises in the except branch, explicitly schedule a new _handle_polling_network_error task on the running event loop so the exponential-backoff retry chain continues even with no updater running. Guards: only schedules when the loop is running and no fatal error is set (avoids redundant tasks during shutdown). Closes NousResearch#3173
teknium1
added a commit
that referenced
this pull request
Mar 26, 2026
After a Telegram 502, _handle_polling_network_error calls updater.stop() then start_polling(). If start_polling() also raises, the old code logged a warning and returned — but the comment 'The next network error will trigger another attempt' was wrong. The updater loop is dead after stop(), so no further error callbacks ever fire. The gateway stays alive but permanently deaf to messages. Fix: when start_polling() fails in the except branch, schedule a new _handle_polling_network_error task to continue the exponential backoff retry chain. The task is tracked in _background_tasks (preventing GC). Guarded by has_fatal_error to avoid spurious retries during shutdown. Closes #3173. Salvaged from PR #3177 by Mibayy.
2 tasks
teknium1
added a commit
that referenced
this pull request
Mar 26, 2026
) After a Telegram 502, _handle_polling_network_error calls updater.stop() then start_polling(). If start_polling() also raises, the old code logged a warning and returned — but the comment 'The next network error will trigger another attempt' was wrong. The updater loop is dead after stop(), so no further error callbacks ever fire. The gateway stays alive but permanently deaf to messages. Fix: when start_polling() fails in the except branch, schedule a new _handle_polling_network_error task to continue the exponential backoff retry chain. The task is tracked in _background_tasks (preventing GC). Guarded by has_fatal_error to avoid spurious retries during shutdown. Closes #3173. Salvaged from PR #3177 by Mibayy.
Contributor
|
Merged via PR #3268. Your fix and analysis were spot-on — the dead updater loop meant no callbacks would ever fire again. Salvaged onto current main with a couple tweaks: |
angelburgosrosado
pushed a commit
to angelburgosrosado/hermes-agent
that referenced
this pull request
Apr 27, 2026
…usResearch#3268) After a Telegram 502, _handle_polling_network_error calls updater.stop() then start_polling(). If start_polling() also raises, the old code logged a warning and returned — but the comment 'The next network error will trigger another attempt' was wrong. The updater loop is dead after stop(), so no further error callbacks ever fire. The gateway stays alive but permanently deaf to messages. Fix: when start_polling() fails in the except branch, schedule a new _handle_polling_network_error task to continue the exponential backoff retry chain. The task is tracked in _background_tasks (preventing GC). Guarded by has_fatal_error to avoid spurious retries during shutdown. Closes NousResearch#3173. Salvaged from PR NousResearch#3177 by Mibayy.
02356abc
pushed a commit
to 02356abc/hermes-agent
that referenced
this pull request
May 14, 2026
…usResearch#3268) After a Telegram 502, _handle_polling_network_error calls updater.stop() then start_polling(). If start_polling() also raises, the old code logged a warning and returned — but the comment 'The next network error will trigger another attempt' was wrong. The updater loop is dead after stop(), so no further error callbacks ever fire. The gateway stays alive but permanently deaf to messages. Fix: when start_polling() fails in the except branch, schedule a new _handle_polling_network_error task to continue the exponential backoff retry chain. The task is tracked in _background_tasks (preventing GC). Guarded by has_fatal_error to avoid spurious retries during shutdown. Closes NousResearch#3173. Salvaged from PR NousResearch#3177 by Mibayy.
olympus-terminal
pushed a commit
to olympus-terminal/hermes-agent
that referenced
this pull request
May 16, 2026
…usResearch#3268) After a Telegram 502, _handle_polling_network_error calls updater.stop() then start_polling(). If start_polling() also raises, the old code logged a warning and returned — but the comment 'The next network error will trigger another attempt' was wrong. The updater loop is dead after stop(), so no further error callbacks ever fire. The gateway stays alive but permanently deaf to messages. Fix: when start_polling() fails in the except branch, schedule a new _handle_polling_network_error task to continue the exponential backoff retry chain. The task is tracked in _background_tasks (preventing GC). Guarded by has_fatal_error to avoid spurious retries during shutdown. Closes NousResearch#3173. Salvaged from PR NousResearch#3177 by Mibayy.
gweeteve
pushed a commit
to gweeteve/hermes-agent
that referenced
this pull request
Jun 2, 2026
…usResearch#3268) After a Telegram 502, _handle_polling_network_error calls updater.stop() then start_polling(). If start_polling() also raises, the old code logged a warning and returned — but the comment 'The next network error will trigger another attempt' was wrong. The updater loop is dead after stop(), so no further error callbacks ever fire. The gateway stays alive but permanently deaf to messages. Fix: when start_polling() fails in the except branch, schedule a new _handle_polling_network_error task to continue the exponential backoff retry chain. The task is tracked in _background_tasks (preventing GC). Guarded by has_fatal_error to avoid spurious retries during shutdown. Closes NousResearch#3173. Salvaged from PR NousResearch#3177 by Mibayy.
Egavasyug
pushed a commit
to Egavasyug/hermes-agent
that referenced
this pull request
Jun 10, 2026
…usResearch#3268) After a Telegram 502, _handle_polling_network_error calls updater.stop() then start_polling(). If start_polling() also raises, the old code logged a warning and returned — but the comment 'The next network error will trigger another attempt' was wrong. The updater loop is dead after stop(), so no further error callbacks ever fire. The gateway stays alive but permanently deaf to messages. Fix: when start_polling() fails in the except branch, schedule a new _handle_polling_network_error task to continue the exponential backoff retry chain. The task is tracked in _background_tasks (preventing GC). Guarded by has_fatal_error to avoid spurious retries during shutdown. Closes NousResearch#3173. Salvaged from PR NousResearch#3177 by Mibayy.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Closes #3173
Fixes the gateway becoming completely unresponsive after Telegram returns HTTP 502 (Bad Gateway) and the reconnect attempt also fails.
Root cause
When Telegram returns a 502,
_handle_polling_network_errordoes:await updater.stop()— kills the internal updater loopawait updater.start_polling(...)— tries to reconnectIf step 2 raises (e.g. "Timed out"), the old code logged a warning and returned, with this comment:
# The next network error will trigger another attempt.That comment was wrong. The polling error callback is only fired by the updater's internal loop. Once
stop()has been called, that loop is dead. No further error callbacks ever fire, so the retry chain silently terminates — leaving the gateway alive but completely deaf to messages and unable to run cron jobs.Fix
When
start_polling()raises in theexceptbranch, explicitly schedule a new_handle_polling_network_errortask on the running event loop so the exponential-backoff retry chain continues without needing the updater loop to be alive.Guards: only schedules when the loop is running and no fatal error is already set (avoids creating tasks during shutdown).
What this changes
_set_fatal_erroris called as before →_handle_adapter_fatal_errorqueues the platform for background reconnection via_platform_reconnect_watcherTests
4 new tests in
tests/gateway/test_telegram_network_reconnect.py:test_reconnect_self_schedules_on_start_polling_failure— regression test for the exact bugtest_reconnect_does_not_self_schedule_when_fatal_error_set— no spurious tasks during shutdowntest_reconnect_success_resets_error_count— happy path unchangedtest_reconnect_triggers_fatal_after_max_retries— escalation path unchanged