Skip to content

fix(gateway): bound Telegram general pool on proxy path to cap fd leak#32003

Closed
konsisumer wants to merge 1 commit into
NousResearch:mainfrom
konsisumer:fix/telegram-proxy-pool-bound
Closed

fix(gateway): bound Telegram general pool on proxy path to cap fd leak#32003
konsisumer wants to merge 1 commit into
NousResearch:mainfrom
konsisumer:fix/telegram-proxy-pool-bound

Conversation

@konsisumer

Copy link
Copy Markdown
Contributor

Bounds the Telegram adapter's httpx connection pool when a proxy is configured, so half-closed sockets can no longer accumulate until the process hits its file-descriptor limit.

What does this PR do?

When the Telegram adapter routes through a local HTTP proxy, half-closed connections accumulate in the httpx general request pool faster than httpx evicts them — httpcore's tunnel-proxy path does not always release the underlying socket on ConnectError. With the current unbounded connection_pool_size default (512), these dead CLOSED sockets pile up over days of flaky-proxy operation until the process exceeds its fd limit and every send_message / set_my_commands fails with httpx.ConnectError: All connection attempts failed.

This implements the reporter's suggested Refs #1: pass a bounded httpx.Limits into the proxy HTTPXRequest construction. Capping max_connections/max_keepalive_connections bounds the leak and makes it surface immediately instead of silently wedging the gateway after ~2 days. The caps are env-overridable (HERMES_TELEGRAM_PROXY_MAX_CONNECTIONS, HERMES_TELEGRAM_PROXY_MAX_KEEPALIVE) for deployments that legitimately need more.

Scoped to the proxy branch only — the non-proxy and fallback-IP paths are unchanged. The issue's other suggested fixes (periodic drain of _request[1], send-path heartbeat, maxfiles startup warning) are deliberately deferred, hence Refs rather than Fixes.

Related Issue

Refs #31599

Type of Change

  • 🐛 Bug fix (non-breaking change that fixes an issue)
  • ✨ New feature (non-breaking change that adds functionality)
  • 🔒 Security fix
  • 📝 Documentation update
  • ✅ Tests (adding or improving test coverage)
  • ♻️ Refactor (no behavior change)
  • 🎯 New skill (bundled or hub)

Changes Made

  • gateway/platforms/telegram.py: add _bounded_proxy_limits() helper returning an env-overridable httpx.Limits (defaults: max_connections=20, max_keepalive_connections=10); pass it via httpx_kwargs={"limits": ...} to the proxy HTTPXRequest and get_updates_request construction. PTB merges httpx_kwargs on top of its connection_pool_size-derived limits, so the bounded limits win.
  • tests/gateway/test_telegram_proxy_pool_bound.py: unit tests for the helper's defaults, env overrides, and invalid-env fallback.

How to Test

  1. pytest tests/gateway/test_telegram_proxy_pool_bound.py -q — passes (3 tests).
  2. Configure TELEGRAM_PROXY (or a system proxy resolved by resolve_proxy_url) pointing at a flaky local proxy; start the gateway with Telegram enabled. Over reconnect cycles, lsof -p <gateway_pid> | wc -l now plateaus near the bounded pool size instead of climbing past launchctl limit maxfiles.
  3. Note: the full pytest tests/gateway/ -k telegram selection shows 6 pre-existing order-dependent failures (MarkdownV2-escaping / model-picker tests) that reproduce on main without this change and pass in isolation; they are unrelated to this proxy fix.

Checklist

Code

  • I've read the Contributing Guide
  • My commit messages follow Conventional Commits (fix(scope):, feat(scope):, etc.)
  • I searched for existing PRs to make sure this isn't a duplicate
  • My PR contains only changes related to this fix/feature (no unrelated commits)
  • I've run pytest tests/ -q and all tests pass
  • I've added tests for my changes (required for bug fixes, strongly encouraged for features)
  • I've tested on my platform: macOS 15 (darwin-arm64)

Documentation & Housekeeping

  • I've updated relevant documentation (README, docs/, docstrings) — or N/A
  • I've updated cli-config.yaml.example if I added/changed config keys — or N/A
  • I've updated CONTRIBUTING.md or AGENTS.md if I changed architecture or workflows — or N/A
  • I've considered cross-platform impact (Windows, macOS) per the compatibility guide — or N/A
  • I've updated tool descriptions/schemas if I changed tool behavior — or N/A

Through a flaky local HTTP proxy, half-closed sockets accumulate in the
httpx general request pool faster than httpx evicts them (httpcore's
tunnel-proxy path does not always release the socket on ConnectError).
With the unbounded connection_pool_size default the dead sockets pile up
until the process hits its fd limit and every send fails after ~2 days.

Pass a bounded httpx.Limits (max_connections=20, max_keepalive=10,
env-overridable) into the proxy HTTPXRequest construction so the leak is
capped and surfaces immediately instead of silently wedging the gateway.

Refs NousResearch#31599
@alt-glitch alt-glitch added type/bug Something isn't working P2 Medium — degraded but workaround exists comp/gateway Gateway runner, session dispatch, delivery platform/telegram Telegram bot adapter labels May 25, 2026
@alt-glitch

Copy link
Copy Markdown
Collaborator

Competing PR — #31885 and #31687 also address the same Telegram proxy-path fd leak (Refs #31599). All three take slightly different approaches to bounding the httpx connection pool on the proxy code path.

@hclsys

hclsys commented May 25, 2026

Copy link
Copy Markdown

The fix is sensible — bounding the proxy-path httpx pool so half-closed tunnel sockets can't accumulate to the fd limit, with env-overridable caps and an immediate-failure-over-slow-leak tradeoff, is the right shape for #31599.

But heads-up: this looks like it overlaps your own still-open #31885 (fix(gateway): bound Telegram proxy httpx pools). Both target #31599, both touch the proxy branch in gateway/platforms/telegram.py, both add an httpx.Limits(...) helper (_bounded_proxy_limits() here vs _proxy_http_limits() there) passed via httpx_kwargs={"limits": ...} on the proxy HTTPXRequest, and both add a near-identical test (test_telegram_proxy_pool_bound.py vs test_telegram_proxy_limits.py). They'll conflict on the same lines, and a maintainer would have to pick one. Might be cleanest to fold this into #31885 (or close that in favor of this) and note which is canonical, so they're not reviewed/merged against each other.

One verification I couldn't do locally (the telegram SDK isn't installed in my env): please confirm the pinned python-telegram-bot version's HTTPXRequest actually accepts httpx_kwargs={"limits": ...} and forwards limits to the underlying httpx.AsyncClient — if that kwarg name changed across PTB versions, the proxy branch would raise at construction. Assuming that's confirmed (and presumably it is, since #31885 uses the same mechanism), the bounding logic itself is sound.

@konsisumer

Copy link
Copy Markdown
Contributor Author

Closing — deferring to #31885 by @konsisumer which addresses the same. Reopen if that PR stalls.

@konsisumer konsisumer closed this Jun 4, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/gateway Gateway runner, session dispatch, delivery P2 Medium — degraded but workaround exists platform/telegram Telegram bot adapter type/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants