Skip to content

fix(gateway): bound Telegram proxy httpx pools to stop fd leak#31885

Closed
konsisumer wants to merge 1 commit into
NousResearch:mainfrom
konsisumer:fix/telegram-proxy-general-pool-leak
Closed

fix(gateway): bound Telegram proxy httpx pools to stop fd leak#31885
konsisumer wants to merge 1 commit into
NousResearch:mainfrom
konsisumer:fix/telegram-proxy-general-pool-leak

Conversation

@konsisumer

Copy link
Copy Markdown
Contributor

What does this PR do?

The Telegram adapter accumulates half-closed sockets in its httpx general-request pool when running behind a tunneling proxy. As the reporter documents in #31599, httpcore does not reliably release the underlying socket on ConnectError, so over long flaky-proxy runs CLOSED connections pile up (lsof showed 280/287 fds terminating at the local proxy port) until the process fd limit is hit and every bot.send_message() / set_my_commands() fails with httpx.ConnectError: All connection attempts failed.

The existing _drain_polling_connections only resets the polling pool (_request[0]) and deliberately leaves the general pool (_request[1]) untouched, so that leak vector was unbounded.

This applies the reporter's prioritized suggestion (#1): when a proxy is configured, construct both proxied HTTPXRequest pools with bounded httpx.Limits and a finite keepalive_expiry. The cap stops unbounded growth, and the keepalive expiry lets httpx evict idle/dead connections during pool maintenance instead of pinning them for the process lifetime. All three knobs are env-tunable so high-throughput proxied deployments can widen them.

This is a bounded mitigation of the leak, not a full cure of the upstream httpcore socket-release behavior — hence Refs rather than Fixes. The reporter's deeper options (periodic general-pool drain, send-path heartbeat for observability) are left for maintainer direction.

Related Issue

Refs #31599

Type of Change

  • 🐛 Bug fix (non-breaking change that fixes an issue)

Changes Made

  • gateway/platforms/telegram.py: add module-level _proxy_http_limits() helper that builds bounded httpx.Limits (defaults: max_connections=20, max_keepalive_connections=10, keepalive_expiry=30.0s), each overridable via HERMES_TELEGRAM_PROXY_MAX_CONNECTIONS / HERMES_TELEGRAM_PROXY_MAX_KEEPALIVE / HERMES_TELEGRAM_PROXY_KEEPALIVE_EXPIRY. Pass these limits through httpx_kwargs when constructing the proxied request and get-updates pools. Non-proxy and fallback-IP paths are unchanged.
  • tests/gateway/test_telegram_proxy_limits.py: unit tests for the helper — defaults, env overrides, and garbage-env fallback.

How to Test

  1. pytest tests/gateway/test_telegram_proxy_limits.py -q — verifies the limits defaults and env overrides.
  2. pytest tests/tools/test_send_message_telegram_proxy.py tests/gateway/test_proxy_mode.py -q — confirms the proxy send/construction paths still pass (27 passed locally).
  3. Functional: with TELEGRAM_PROXY set, start the gateway and confirm lsof -p <pid> | wc -l stabilizes under flaky-proxy reconnect cycles instead of climbing past the fd limit; the proxied pools are now capped at max_connections (20 by default) rather than 512.

Checklist

Code

  • I've read the Contributing Guide
  • My commit messages follow Conventional Commits (fix(gateway):)
  • I searched for existing PRs to make sure this isn't a duplicate
  • My PR contains only changes related to this fix
  • I've run the relevant pytest suites and they pass
  • I've added tests for my changes
  • I've tested on my platform: macOS on darwin-arm64

Documentation & Housekeeping

  • I've updated relevant documentation (docstrings) — or N/A
  • I've updated cli-config.yaml.example if I added/changed config keys — N/A (env-only knobs)
  • I've updated CONTRIBUTING.md or AGENTS.md if I changed architecture or workflows — N/A
  • I've considered cross-platform impact (Windows, macOS) — N/A (pure httpx limits config, no platform-specific calls)

The Telegram adapter's general-request pool accumulates half-closed
sockets when running behind a tunneling proxy: httpcore does not reliably
release the underlying socket on ConnectError, so over long flaky-proxy
runs CLOSED connections pile up until the process fd limit is hit and all
sends fail. Cap the proxied pools and set a finite keepalive_expiry so
idle/dead connections are evicted during pool maintenance instead of
pinned for the process lifetime. All three knobs are env-tunable.

Refs NousResearch#31599
@alt-glitch alt-glitch added type/bug Something isn't working P2 Medium — degraded but workaround exists comp/gateway Gateway runner, session dispatch, delivery platform/telegram Telegram bot adapter labels May 25, 2026
@alt-glitch

Copy link
Copy Markdown
Collaborator

Competing fix with #31687 — both address #31599 (Telegram proxy httpx fd leak). #31687 routes proxy-path pools through the shared platform_httpx_limits() helper. This PR adds a standalone _proxy_http_limits() with env-tunable knobs. Same root cause, different approaches. Recommend consolidating into a single PR.

@konsisumer

Copy link
Copy Markdown
Contributor Author

Closing — deferring to #31687 by @konsisumer which addresses the same. Reopen if that PR stalls.

@konsisumer konsisumer closed this Jun 4, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/gateway Gateway runner, session dispatch, delivery P2 Medium — degraded but workaround exists platform/telegram Telegram bot adapter type/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants