fix(gateway): drain stale httpx connections on Telegram polling reconnect#16466
fix(gateway): drain stale httpx connections on Telegram polling reconnect#16466Mirac1eSky wants to merge 1 commit into
Conversation
…nect Network errors (especially through a proxy like sing-box) leave httpx connections in a half-closed state that occupy pool slots. After ~40 errors the 256-connection pool fills up, causing PoolTimeout and making the bot unresponsive to both inbound and outbound messages. Fix: during reconnect in _handle_polling_network_error(), shut down and re-initialize the bot's request objects to release stale connections before starting a new polling session. Regression tests: 3 new tests cover the shutdown/init drain, and graceful continuation if either step fails.
There was a problem hiding this comment.
Pull request overview
Fixes Telegram polling reconnection in the gateway by explicitly draining and recreating the bot’s underlying HTTPX request objects during network-error recovery, preventing the connection pool from being exhausted after repeated proxy/network interruptions.
Changes:
- Add
bot.shutdown()+bot.initialize()during_handle_polling_network_error()reconnect flow to release stale httpx pool connections before restarting polling. - Add async tests covering the reconnect flow and ensuring reconnect proceeds even if shutdown/initialize fail.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
gateway/platforms/telegram.py |
Drains/reinitializes bot request objects prior to restarting Telegram polling after transient network errors. |
tests/gateway/test_telegram_network_reconnect.py |
Adds tests for the new reconnect behavior (drain order + resilience to shutdown/initialize failures). |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| # The order matters: shutdown → initialize → start_polling | ||
| call_order = [] | ||
| for call in mock_bot.shutdown.mock_calls: | ||
| call_order.append("shutdown") | ||
| for call in mock_updater.stop.mock_calls: | ||
| call_order.append("stop") | ||
| for call in mock_bot.initialize.mock_calls: | ||
| call_order.append("initialize") | ||
| for call in mock_updater.start_polling.mock_calls: | ||
| call_order.append("start_polling") | ||
|
|
||
| assert "shutdown" in call_order | ||
| assert "initialize" in call_order | ||
| assert call_order.index("shutdown") < call_order.index("start_polling") | ||
| assert call_order.index("initialize") < call_order.index("start_polling") |
| pass | ||
| try: | ||
| await self._app.bot.initialize() | ||
| logger.debug("[%s] Bot request objects re-initialized for reconnect", self.name) | ||
| except Exception: | ||
| pass |
| mock_bot = AsyncMock() | ||
| mock_bot.shutdown = AsyncMock(side_effect=Exception("shutdown failed")) | ||
| mock_bot.initialize = AsyncMock() | ||
|
|
| mock_bot = AsyncMock() | ||
| mock_bot.shutdown = AsyncMock() | ||
| mock_bot.initialize = AsyncMock(side_effect=Exception("init failed")) | ||
|
|
77351d6 to
dd94477
Compare
…nect Network errors through proxies (e.g. sing-box) can leave httpx connections in a half-closed state occupying pool slots. After enough reconnect cycles the 256-connection default fills up entirely, causing Pool timeout: All connections in the connection pool are occupied. Fix: cycle only the getUpdates request object (_request[0]) via shut-down + re-initialize before restarting polling. This drains stale connections without touching the general request (_request[1]) that concurrent send_message / edit_message calls rely on. The drain is applied to both _handle_polling_network_error and _handle_polling_conflict reconnect paths via a shared _drain_polling_connections() helper. Failures in the drain are swallowed so reconnect always proceeds. Based on #16466 by @Mirac1eSky.
…nect Network errors through proxies (e.g. sing-box) can leave httpx connections in a half-closed state occupying pool slots. After enough reconnect cycles the 256-connection default fills up entirely, causing Pool timeout: All connections in the connection pool are occupied. Fix: cycle only the getUpdates request object (_request[0]) via shut-down + re-initialize before restarting polling. This drains stale connections without touching the general request (_request[1]) that concurrent send_message / edit_message calls rely on. The drain is applied to both _handle_polling_network_error and _handle_polling_conflict reconnect paths via a shared _drain_polling_connections() helper. Failures in the drain are swallowed so reconnect always proceeds. Based on #16466 by @Mirac1eSky.
|
Merged via #17015. Your commit was cherry-picked onto current main with your authorship preserved in git log. The salvage narrows the fix to only reset the polling request object (_request[0]) instead of calling bot.shutdown() which would cycle both connection pools and race with concurrent send_message/edit_message calls. Also extended the drain to the _handle_polling_conflict path and added separate try blocks so initialize() always runs even if the prior step raises. Thanks for identifying the root cause — the proxy-related pool exhaustion analysis was spot on! |
…nect Network errors through proxies (e.g. sing-box) can leave httpx connections in a half-closed state occupying pool slots. After enough reconnect cycles the 256-connection default fills up entirely, causing Pool timeout: All connections in the connection pool are occupied. Fix: cycle only the getUpdates request object (_request[0]) via shut-down + re-initialize before restarting polling. This drains stale connections without touching the general request (_request[1]) that concurrent send_message / edit_message calls rely on. The drain is applied to both _handle_polling_network_error and _handle_polling_conflict reconnect paths via a shared _drain_polling_connections() helper. Failures in the drain are swallowed so reconnect always proceeds. Based on #16466 by @Mirac1eSky.
…nect Network errors through proxies (e.g. sing-box) can leave httpx connections in a half-closed state occupying pool slots. After enough reconnect cycles the 256-connection default fills up entirely, causing Pool timeout: All connections in the connection pool are occupied. Fix: cycle only the getUpdates request object (_request[0]) via shut-down + re-initialize before restarting polling. This drains stale connections without touching the general request (_request[1]) that concurrent send_message / edit_message calls rely on. The drain is applied to both _handle_polling_network_error and _handle_polling_conflict reconnect paths via a shared _drain_polling_connections() helper. Failures in the drain are swallowed so reconnect always proceeds. Based on NousResearch#16466 by @Mirac1eSky.
…nect Network errors through proxies (e.g. sing-box) can leave httpx connections in a half-closed state occupying pool slots. After enough reconnect cycles the 256-connection default fills up entirely, causing Pool timeout: All connections in the connection pool are occupied. Fix: cycle only the getUpdates request object (_request[0]) via shut-down + re-initialize before restarting polling. This drains stale connections without touching the general request (_request[1]) that concurrent send_message / edit_message calls rely on. The drain is applied to both _handle_polling_network_error and _handle_polling_conflict reconnect paths via a shared _drain_polling_connections() helper. Failures in the drain are swallowed so reconnect always proceeds. Based on NousResearch#16466 by @Mirac1eSky.
…nect Network errors through proxies (e.g. sing-box) can leave httpx connections in a half-closed state occupying pool slots. After enough reconnect cycles the 256-connection default fills up entirely, causing Pool timeout: All connections in the connection pool are occupied. Fix: cycle only the getUpdates request object (_request[0]) via shut-down + re-initialize before restarting polling. This drains stale connections without touching the general request (_request[1]) that concurrent send_message / edit_message calls rely on. The drain is applied to both _handle_polling_network_error and _handle_polling_conflict reconnect paths via a shared _drain_polling_connections() helper. Failures in the drain are swallowed so reconnect always proceeds. Based on NousResearch#16466 by @Mirac1eSky.
…nect Network errors through proxies (e.g. sing-box) can leave httpx connections in a half-closed state occupying pool slots. After enough reconnect cycles the 256-connection default fills up entirely, causing Pool timeout: All connections in the connection pool are occupied. Fix: cycle only the getUpdates request object (_request[0]) via shut-down + re-initialize before restarting polling. This drains stale connections without touching the general request (_request[1]) that concurrent send_message / edit_message calls rely on. The drain is applied to both _handle_polling_network_error and _handle_polling_conflict reconnect paths via a shared _drain_polling_connections() helper. Failures in the drain are swallowed so reconnect always proceeds. Based on NousResearch#16466 by @Mirac1eSky.
…nect Network errors through proxies (e.g. sing-box) can leave httpx connections in a half-closed state occupying pool slots. After enough reconnect cycles the 256-connection default fills up entirely, causing Pool timeout: All connections in the connection pool are occupied. Fix: cycle only the getUpdates request object (_request[0]) via shut-down + re-initialize before restarting polling. This drains stale connections without touching the general request (_request[1]) that concurrent send_message / edit_message calls rely on. The drain is applied to both _handle_polling_network_error and _handle_polling_conflict reconnect paths via a shared _drain_polling_connections() helper. Failures in the drain are swallowed so reconnect always proceeds. Based on NousResearch#16466 by @Mirac1eSky.
…nect Network errors through proxies (e.g. sing-box) can leave httpx connections in a half-closed state occupying pool slots. After enough reconnect cycles the 256-connection default fills up entirely, causing Pool timeout: All connections in the connection pool are occupied. Fix: cycle only the getUpdates request object (_request[0]) via shut-down + re-initialize before restarting polling. This drains stale connections without touching the general request (_request[1]) that concurrent send_message / edit_message calls rely on. The drain is applied to both _handle_polling_network_error and _handle_polling_conflict reconnect paths via a shared _drain_polling_connections() helper. Failures in the drain are swallowed so reconnect always proceeds. Based on NousResearch#16466 by @Mirac1eSky.
…nect Network errors through proxies (e.g. sing-box) can leave httpx connections in a half-closed state occupying pool slots. After enough reconnect cycles the 256-connection default fills up entirely, causing Pool timeout: All connections in the connection pool are occupied. Fix: cycle only the getUpdates request object (_request[0]) via shut-down + re-initialize before restarting polling. This drains stale connections without touching the general request (_request[1]) that concurrent send_message / edit_message calls rely on. The drain is applied to both _handle_polling_network_error and _handle_polling_conflict reconnect paths via a shared _drain_polling_connections() helper. Failures in the drain are swallowed so reconnect always proceeds. Based on NousResearch#16466 by @Mirac1eSky.
Problem
When the Telegram polling connection drops (e.g. proxy interruption, network blip), the
_handle_polling_network_errorreconnect path callsupdater.stop()followed bystart_polling(). However, this does not close the underlying httpx connections in theHTTPXRequestconnection pool.Each network error leaves stale/half-closed connections occupying pool slots. After repeated errors (we observed 40+ per day through a sing-box proxy), the default 256-connection pool fills up entirely, causing:
At this point the bot becomes completely unresponsive — no inbound messages, no outbound replies.
Fix
During reconnect in
_handle_polling_network_error(), shut down and re-initialize the bot's request objects before starting a new polling session:Both steps are wrapped in
try/exceptso a failure in either one doesn't block the reconnect attempt.Evidence
Verified live: after triggering a network error (restart sing-box), the logs show:
Tests
3 new tests added to
tests/gateway/test_telegram_network_reconnect.py:test_reconnect_drains_stale_connections— verifies shutdown → initialize → start_polling ordertest_reconnect_continues_if_bot_shutdown_fails— shutdown failure doesn't block reconnecttest_reconnect_continues_if_bot_initialize_fails— init failure doesn't block reconnectAll 7 tests in the file pass (4 existing + 3 new).
Platforms tested