Skip to content

fix: Bound Telegram httpx general-pool and drain after reconnect (#31599)#37400

Open
datin-antasena wants to merge 1 commit into
NousResearch:mainfrom
datin-antasena:fix/telegram-httpx-pool-fd-leak
Open

fix: Bound Telegram httpx general-pool and drain after reconnect (#31599)#37400
datin-antasena wants to merge 1 commit into
NousResearch:mainfrom
datin-antasena:fix/telegram-httpx-pool-fd-leak

Conversation

@datin-antasena

Copy link
Copy Markdown

Summary

Fixes the Telegram adapter FD leak where half-closed httpx sockets accumulate in the general request pool (_request[1]) through proxy reconnection cycles, eventually hitting OSError: [Errno 24] Too many open files after ~2 days.

Credit: Excellent root cause analysis by @JustinHuber in #31599. The distinction between polling pool (_request[0], already drained) and general pool (_request[1], never drained) was the key insight.

Root Cause

  1. connection_pool_size=512 in request_kwargs — far more than any single adapter needs
  2. _drain_polling_connections() only resets _request[0] (polling). _request[1] (general, used by send_message/set_my_commands) is never drained
  3. _http_client_limits.py already provides platform_httpx_limits() with tighter defaults (keepalive 2s, max 10 idle), used by Signal, WeCom, DingTalk, BlueBubbles — but Telegram wasn't using it

Fix

Three changes, minimal surface:

  1. Bound the general poolconnection_pool_size reduced from 512 to 20. platform_httpx_limits() now applied to both HTTPXRequest instances via httpx_kwargs["limits"], giving max_connections=20, max_keepalive_connections=10, keepalive_expiry=2.0. HERMES_TELEGRAM_HTTP_POOL_SIZE env var still works as override.

  2. Drain general pool after reconnect — Added _drain_general_connections() (calls _drain_request_connections(1, "General")). Called after successful polling reconnect in _handle_polling_network_error, when the send path is already marked degraded. Not called during normal polling drain to avoid interrupting active sends.

  3. Refactored drain helper_drain_request_connections(request_index, label) is the shared implementation. _drain_polling_connections() and _drain_general_connections() are thin wrappers.

Tests

  • test_connect_bounds_httpx_request_pools_with_shared_limits — verifies limits applied correctly
  • test_connect_keeps_telegram_pool_size_env_override — verifies env var override still works
  • test_reconnect_drains_polling_request_only — polling drain helper doesn't touch general pool
  • test_reconnect_success_drains_general_request_after_polling_resumes — general pool drained after successful reconnect
  • 567 Telegram tests pass (canonical per-file runner)

Why not drain general pool on every reconnect attempt?

The general pool serves send_message / edit_message. Draining it mid-reconnect (before polling resumes) would interrupt in-flight user-facing sends. We drain it only after polling successfully resumes, when the send path is already degraded and callers should be using fallback delivery.

Closes #31599

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/gateway Gateway runner, session dispatch, delivery P2 Medium — degraded but workaround exists platform/telegram Telegram bot adapter type/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Telegram adapter leaks httpx general-pool connections through HTTP proxy (CLOSED sockets accumulate, fd limit hit after ~2 days)

2 participants