Skip to content

fix(gateway): drain stale httpx connections on Telegram polling reconnect#16466

Closed
Mirac1eSky wants to merge 1 commit into
NousResearch:mainfrom
Mirac1eSky:fix/telegram-pool-connection-drain
Closed

fix(gateway): drain stale httpx connections on Telegram polling reconnect#16466
Mirac1eSky wants to merge 1 commit into
NousResearch:mainfrom
Mirac1eSky:fix/telegram-pool-connection-drain

Conversation

@Mirac1eSky

Copy link
Copy Markdown
Contributor

Problem

When the Telegram polling connection drops (e.g. proxy interruption, network blip), the _handle_polling_network_error reconnect path calls updater.stop() followed by start_polling(). However, this does not close the underlying httpx connections in the HTTPXRequest connection pool.

Each network error leaves stale/half-closed connections occupying pool slots. After repeated errors (we observed 40+ per day through a sing-box proxy), the default 256-connection pool fills up entirely, causing:

Pool timeout: All connections in the connection pool are occupied.
Request was *not* sent to Telegram.

At this point the bot becomes completely unresponsive — no inbound messages, no outbound replies.

Fix

During reconnect in _handle_polling_network_error(), shut down and re-initialize the bot's request objects before starting a new polling session:

await self._app.bot.shutdown()   # release all stale connections
await self._app.bot.initialize() # create fresh connections

Both steps are wrapped in try/except so a failure in either one doesn't block the reconnect attempt.

Evidence

Verified live: after triggering a network error (restart sing-box), the logs show:

16:26:15  WARNING  Telegram network error, scheduling reconnect
16:26:21  INFO     Bot request objects shut down before reconnect
16:26:22  INFO     Bot request objects re-initialized for reconnect
16:26:22  INFO     Telegram polling resumed after network error (attempt 1)

Tests

3 new tests added to tests/gateway/test_telegram_network_reconnect.py:

  • test_reconnect_drains_stale_connections — verifies shutdown → initialize → start_polling order
  • test_reconnect_continues_if_bot_shutdown_fails — shutdown failure doesn't block reconnect
  • test_reconnect_continues_if_bot_initialize_fails — init failure doesn't block reconnect

All 7 tests in the file pass (4 existing + 3 new).

Platforms tested

  • Linux (Ubuntu 24.04, Python 3.11)

…nect

Network errors (especially through a proxy like sing-box) leave httpx
connections in a half-closed state that occupy pool slots. After ~40
errors the 256-connection pool fills up, causing PoolTimeout and making
the bot unresponsive to both inbound and outbound messages.

Fix: during reconnect in _handle_polling_network_error(), shut down
and re-initialize the bot's request objects to release stale connections
before starting a new polling session.

Regression tests: 3 new tests cover the shutdown/init drain, and
graceful continuation if either step fails.
Copilot AI review requested due to automatic review settings April 27, 2026 09:02

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes Telegram polling reconnection in the gateway by explicitly draining and recreating the bot’s underlying HTTPX request objects during network-error recovery, preventing the connection pool from being exhausted after repeated proxy/network interruptions.

Changes:

  • Add bot.shutdown() + bot.initialize() during _handle_polling_network_error() reconnect flow to release stale httpx pool connections before restarting polling.
  • Add async tests covering the reconnect flow and ensuring reconnect proceeds even if shutdown/initialize fail.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.

File Description
gateway/platforms/telegram.py Drains/reinitializes bot request objects prior to restarting Telegram polling after transient network errors.
tests/gateway/test_telegram_network_reconnect.py Adds tests for the new reconnect behavior (drain order + resilience to shutdown/initialize failures).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +198 to +212
# The order matters: shutdown → initialize → start_polling
call_order = []
for call in mock_bot.shutdown.mock_calls:
call_order.append("shutdown")
for call in mock_updater.stop.mock_calls:
call_order.append("stop")
for call in mock_bot.initialize.mock_calls:
call_order.append("initialize")
for call in mock_updater.start_polling.mock_calls:
call_order.append("start_polling")

assert "shutdown" in call_order
assert "initialize" in call_order
assert call_order.index("shutdown") < call_order.index("start_polling")
assert call_order.index("initialize") < call_order.index("start_polling")
Comment on lines +392 to +397
pass
try:
await self._app.bot.initialize()
logger.debug("[%s] Bot request objects re-initialized for reconnect", self.name)
except Exception:
pass
Comment on lines +224 to +227
mock_bot = AsyncMock()
mock_bot.shutdown = AsyncMock(side_effect=Exception("shutdown failed"))
mock_bot.initialize = AsyncMock()

Comment on lines +256 to +259
mock_bot = AsyncMock()
mock_bot.shutdown = AsyncMock()
mock_bot.initialize = AsyncMock(side_effect=Exception("init failed"))

@alt-glitch alt-glitch added type/bug Something isn't working P1 High — major feature broken, no workaround comp/gateway Gateway runner, session dispatch, delivery platform/telegram Telegram bot adapter labels Apr 27, 2026
@Mirac1eSky Mirac1eSky force-pushed the fix/telegram-pool-connection-drain branch from 77351d6 to dd94477 Compare April 28, 2026 03:26
kshitijk4poor pushed a commit that referenced this pull request Apr 28, 2026
…nect

Network errors through proxies (e.g. sing-box) can leave httpx
connections in a half-closed state occupying pool slots.  After enough
reconnect cycles the 256-connection default fills up entirely, causing
Pool timeout: All connections in the connection pool are occupied.

Fix: cycle only the getUpdates request object (_request[0]) via
shut-down + re-initialize before restarting polling.  This drains stale
connections without touching the general request (_request[1]) that
concurrent send_message / edit_message calls rely on.

The drain is applied to both _handle_polling_network_error and
_handle_polling_conflict reconnect paths via a shared
_drain_polling_connections() helper.  Failures in the drain are
swallowed so reconnect always proceeds.

Based on #16466 by @Mirac1eSky.
kshitijk4poor pushed a commit that referenced this pull request Apr 28, 2026
…nect

Network errors through proxies (e.g. sing-box) can leave httpx
connections in a half-closed state occupying pool slots.  After enough
reconnect cycles the 256-connection default fills up entirely, causing
Pool timeout: All connections in the connection pool are occupied.

Fix: cycle only the getUpdates request object (_request[0]) via
shut-down + re-initialize before restarting polling.  This drains stale
connections without touching the general request (_request[1]) that
concurrent send_message / edit_message calls rely on.

The drain is applied to both _handle_polling_network_error and
_handle_polling_conflict reconnect paths via a shared
_drain_polling_connections() helper.  Failures in the drain are
swallowed so reconnect always proceeds.

Based on #16466 by @Mirac1eSky.
@kshitijk4poor

Copy link
Copy Markdown
Collaborator

Merged via #17015. Your commit was cherry-picked onto current main with your authorship preserved in git log.

The salvage narrows the fix to only reset the polling request object (_request[0]) instead of calling bot.shutdown() which would cycle both connection pools and race with concurrent send_message/edit_message calls. Also extended the drain to the _handle_polling_conflict path and added separate try blocks so initialize() always runs even if the prior step raises.

Thanks for identifying the root cause — the proxy-related pool exhaustion analysis was spot on!

cluricaun28 referenced this pull request in cluricaun28/Logos Apr 28, 2026
…nect

Network errors through proxies (e.g. sing-box) can leave httpx
connections in a half-closed state occupying pool slots.  After enough
reconnect cycles the 256-connection default fills up entirely, causing
Pool timeout: All connections in the connection pool are occupied.

Fix: cycle only the getUpdates request object (_request[0]) via
shut-down + re-initialize before restarting polling.  This drains stale
connections without touching the general request (_request[1]) that
concurrent send_message / edit_message calls rely on.

The drain is applied to both _handle_polling_network_error and
_handle_polling_conflict reconnect paths via a shared
_drain_polling_connections() helper.  Failures in the drain are
swallowed so reconnect always proceeds.

Based on #16466 by @Mirac1eSky.
ulasbilgen pushed a commit to ulasbilgen/hermes-adhd-agent that referenced this pull request May 1, 2026
…nect

Network errors through proxies (e.g. sing-box) can leave httpx
connections in a half-closed state occupying pool slots.  After enough
reconnect cycles the 256-connection default fills up entirely, causing
Pool timeout: All connections in the connection pool are occupied.

Fix: cycle only the getUpdates request object (_request[0]) via
shut-down + re-initialize before restarting polling.  This drains stale
connections without touching the general request (_request[1]) that
concurrent send_message / edit_message calls rely on.

The drain is applied to both _handle_polling_network_error and
_handle_polling_conflict reconnect paths via a shared
_drain_polling_connections() helper.  Failures in the drain are
swallowed so reconnect always proceeds.

Based on NousResearch#16466 by @Mirac1eSky.
donald131 pushed a commit to donald131/hermes-agent that referenced this pull request May 2, 2026
…nect

Network errors through proxies (e.g. sing-box) can leave httpx
connections in a half-closed state occupying pool slots.  After enough
reconnect cycles the 256-connection default fills up entirely, causing
Pool timeout: All connections in the connection pool are occupied.

Fix: cycle only the getUpdates request object (_request[0]) via
shut-down + re-initialize before restarting polling.  This drains stale
connections without touching the general request (_request[1]) that
concurrent send_message / edit_message calls rely on.

The drain is applied to both _handle_polling_network_error and
_handle_polling_conflict reconnect paths via a shared
_drain_polling_connections() helper.  Failures in the drain are
swallowed so reconnect always proceeds.

Based on NousResearch#16466 by @Mirac1eSky.
02356abc pushed a commit to 02356abc/hermes-agent that referenced this pull request May 14, 2026
…nect

Network errors through proxies (e.g. sing-box) can leave httpx
connections in a half-closed state occupying pool slots.  After enough
reconnect cycles the 256-connection default fills up entirely, causing
Pool timeout: All connections in the connection pool are occupied.

Fix: cycle only the getUpdates request object (_request[0]) via
shut-down + re-initialize before restarting polling.  This drains stale
connections without touching the general request (_request[1]) that
concurrent send_message / edit_message calls rely on.

The drain is applied to both _handle_polling_network_error and
_handle_polling_conflict reconnect paths via a shared
_drain_polling_connections() helper.  Failures in the drain are
swallowed so reconnect always proceeds.

Based on NousResearch#16466 by @Mirac1eSky.
dannyJ848 pushed a commit to dannyJ848/hermes-agent that referenced this pull request May 17, 2026
…nect

Network errors through proxies (e.g. sing-box) can leave httpx
connections in a half-closed state occupying pool slots.  After enough
reconnect cycles the 256-connection default fills up entirely, causing
Pool timeout: All connections in the connection pool are occupied.

Fix: cycle only the getUpdates request object (_request[0]) via
shut-down + re-initialize before restarting polling.  This drains stale
connections without touching the general request (_request[1]) that
concurrent send_message / edit_message calls rely on.

The drain is applied to both _handle_polling_network_error and
_handle_polling_conflict reconnect paths via a shared
_drain_polling_connections() helper.  Failures in the drain are
swallowed so reconnect always proceeds.

Based on NousResearch#16466 by @Mirac1eSky.
gweeteve pushed a commit to gweeteve/hermes-agent that referenced this pull request Jun 2, 2026
…nect

Network errors through proxies (e.g. sing-box) can leave httpx
connections in a half-closed state occupying pool slots.  After enough
reconnect cycles the 256-connection default fills up entirely, causing
Pool timeout: All connections in the connection pool are occupied.

Fix: cycle only the getUpdates request object (_request[0]) via
shut-down + re-initialize before restarting polling.  This drains stale
connections without touching the general request (_request[1]) that
concurrent send_message / edit_message calls rely on.

The drain is applied to both _handle_polling_network_error and
_handle_polling_conflict reconnect paths via a shared
_drain_polling_connections() helper.  Failures in the drain are
swallowed so reconnect always proceeds.

Based on NousResearch#16466 by @Mirac1eSky.
Egavasyug pushed a commit to Egavasyug/hermes-agent that referenced this pull request Jun 10, 2026
…nect

Network errors through proxies (e.g. sing-box) can leave httpx
connections in a half-closed state occupying pool slots.  After enough
reconnect cycles the 256-connection default fills up entirely, causing
Pool timeout: All connections in the connection pool are occupied.

Fix: cycle only the getUpdates request object (_request[0]) via
shut-down + re-initialize before restarting polling.  This drains stale
connections without touching the general request (_request[1]) that
concurrent send_message / edit_message calls rely on.

The drain is applied to both _handle_polling_network_error and
_handle_polling_conflict reconnect paths via a shared
_drain_polling_connections() helper.  Failures in the drain are
swallowed so reconnect always proceeds.

Based on NousResearch#16466 by @Mirac1eSky.
Seven74AI pushed a commit to Seven74AI/hermes-agent that referenced this pull request Jun 13, 2026
…nect

Network errors through proxies (e.g. sing-box) can leave httpx
connections in a half-closed state occupying pool slots.  After enough
reconnect cycles the 256-connection default fills up entirely, causing
Pool timeout: All connections in the connection pool are occupied.

Fix: cycle only the getUpdates request object (_request[0]) via
shut-down + re-initialize before restarting polling.  This drains stale
connections without touching the general request (_request[1]) that
concurrent send_message / edit_message calls rely on.

The drain is applied to both _handle_polling_network_error and
_handle_polling_conflict reconnect paths via a shared
_drain_polling_connections() helper.  Failures in the drain are
swallowed so reconnect always proceeds.

Based on NousResearch#16466 by @Mirac1eSky.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/gateway Gateway runner, session dispatch, delivery P1 High — major feature broken, no workaround platform/telegram Telegram bot adapter type/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants