fix(telegram): self-reschedule reconnect when start_polling fails after network error by Mibayy · Pull Request #3177 · NousResearch/hermes-agent

Mibayy · 2026-03-26T10:19:44Z

Summary

Fixes the gateway becoming completely unresponsive after Telegram returns HTTP 502 (Bad Gateway) and the reconnect attempt also fails.

Root cause

When Telegram returns a 502, _handle_polling_network_error does:

await updater.stop() — kills the internal updater loop
await updater.start_polling(...) — tries to reconnect

If step 2 raises (e.g. "Timed out"), the old code logged a warning and returned, with this comment:

# The next network error will trigger another attempt.

That comment was wrong. The polling error callback is only fired by the updater's internal loop. Once stop() has been called, that loop is dead. No further error callbacks ever fire, so the retry chain silently terminates — leaving the gateway alive but completely deaf to messages and unable to run cron jobs.

Fix

When start_polling() raises in the except branch, explicitly schedule a new _handle_polling_network_error task on the running event loop so the exponential-backoff retry chain continues without needing the updater loop to be alive.

except Exception as retry_err:
    logger.warning("[%s] Telegram polling reconnect failed: %s", self.name, retry_err)
    # start_polling failed — polling is now dead and no further error
    # callbacks will fire, so we must schedule the next attempt ourselves.
    loop = asyncio.get_event_loop()
    if loop.is_running() and not self.has_fatal_error:
        loop.create_task(self._handle_polling_network_error(retry_err))

Guards: only schedules when the loop is running and no fatal error is already set (avoids creating tasks during shutdown).

What this changes

After a failed reconnect, the retry chain continues using the existing exponential backoff (5s → 10s → 20s → 40s → 60s cap, up to 10 attempts)
After 10 failed attempts, _set_fatal_error is called as before → _handle_adapter_fatal_error queues the platform for background reconnection via _platform_reconnect_watcher
No change to behavior when reconnect succeeds or when a fatal error is already set

Tests

4 new tests in tests/gateway/test_telegram_network_reconnect.py:

test_reconnect_self_schedules_on_start_polling_failure — regression test for the exact bug
test_reconnect_does_not_self_schedule_when_fatal_error_set — no spurious tasks during shutdown
test_reconnect_success_resets_error_count — happy path unchanged
test_reconnect_triggers_fatal_after_max_retries — escalation path unchanged

152 passed  (148 pre-existing + 4 new)

…er 502 When Telegram returns HTTP 502 (Bad Gateway), _handle_polling_network_error stops the updater then calls start_polling() again. If that second call also fails (e.g. times out), the old code logged a warning and returned — leaving polling permanently dead with no further error callbacks to trigger the next retry. The gateway process stayed alive but handled no messages and stopped running cron jobs after ~25 minutes. Root cause: python-telegram-bot's polling error callback is only invoked by the updater's internal loop. Once stop() is called, the updater loop exits and no further callbacks ever fire, so the 'next network error will trigger another attempt' comment was simply wrong. Fix: when start_polling() raises in the except branch, explicitly schedule a new _handle_polling_network_error task on the running event loop so the exponential-backoff retry chain continues even with no updater running. Guards: only schedules when the loop is running and no fatal error is set (avoids redundant tasks during shutdown). Closes NousResearch#3173

After a Telegram 502, _handle_polling_network_error calls updater.stop() then start_polling(). If start_polling() also raises, the old code logged a warning and returned — but the comment 'The next network error will trigger another attempt' was wrong. The updater loop is dead after stop(), so no further error callbacks ever fire. The gateway stays alive but permanently deaf to messages. Fix: when start_polling() fails in the except branch, schedule a new _handle_polling_network_error task to continue the exponential backoff retry chain. The task is tracked in _background_tasks (preventing GC). Guarded by has_fatal_error to avoid spurious retries during shutdown. Closes #3173. Salvaged from PR #3177 by Mibayy.

) After a Telegram 502, _handle_polling_network_error calls updater.stop() then start_polling(). If start_polling() also raises, the old code logged a warning and returned — but the comment 'The next network error will trigger another attempt' was wrong. The updater loop is dead after stop(), so no further error callbacks ever fire. The gateway stays alive but permanently deaf to messages. Fix: when start_polling() fails in the except branch, schedule a new _handle_polling_network_error task to continue the exponential backoff retry chain. The task is tracked in _background_tasks (preventing GC). Guarded by has_fatal_error to avoid spurious retries during shutdown. Closes #3173. Salvaged from PR #3177 by Mibayy.

teknium1 · 2026-03-26T22:34:46Z

Merged via PR #3268. Your fix and analysis were spot-on — the dead updater loop meant no callbacks would ever fire again. Salvaged onto current main with a couple tweaks: asyncio.ensure_future() instead of deprecated get_event_loop(), and the retry task is now tracked in _background_tasks (consistent with a task-tracking fix that just landed). Your 4 tests were adapted to match. Thanks for the great bug report and fix!

…usResearch#3268) After a Telegram 502, _handle_polling_network_error calls updater.stop() then start_polling(). If start_polling() also raises, the old code logged a warning and returned — but the comment 'The next network error will trigger another attempt' was wrong. The updater loop is dead after stop(), so no further error callbacks ever fire. The gateway stays alive but permanently deaf to messages. Fix: when start_polling() fails in the except branch, schedule a new _handle_polling_network_error task to continue the exponential backoff retry chain. The task is tracked in _background_tasks (preventing GC). Guarded by has_fatal_error to avoid spurious retries during shutdown. Closes NousResearch#3173. Salvaged from PR NousResearch#3177 by Mibayy.

teknium1 mentioned this pull request Mar 26, 2026

fix(telegram): self-reschedule reconnect when start_polling fails after 502 #3268

Merged

2 tasks

teknium1 closed this Mar 26, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(telegram): self-reschedule reconnect when start_polling fails after network error#3177

fix(telegram): self-reschedule reconnect when start_polling fails after network error#3177
Mibayy wants to merge 1 commit into
NousResearch:mainfrom
Mibayy:fix/telegram-reconnect-deadlock-3173

Mibayy commented Mar 26, 2026

Uh oh!

teknium1 commented Mar 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Mibayy commented Mar 26, 2026

Summary

Root cause

Fix

What this changes

Tests

Uh oh!

teknium1 commented Mar 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants