fix(gateway): bound Telegram general pool on proxy path to cap fd leak#32003
fix(gateway): bound Telegram general pool on proxy path to cap fd leak#32003konsisumer wants to merge 1 commit into
Conversation
Through a flaky local HTTP proxy, half-closed sockets accumulate in the httpx general request pool faster than httpx evicts them (httpcore's tunnel-proxy path does not always release the socket on ConnectError). With the unbounded connection_pool_size default the dead sockets pile up until the process hits its fd limit and every send fails after ~2 days. Pass a bounded httpx.Limits (max_connections=20, max_keepalive=10, env-overridable) into the proxy HTTPXRequest construction so the leak is capped and surfaces immediately instead of silently wedging the gateway. Refs NousResearch#31599
|
The fix is sensible — bounding the proxy-path httpx pool so half-closed tunnel sockets can't accumulate to the fd limit, with env-overridable caps and an immediate-failure-over-slow-leak tradeoff, is the right shape for #31599. But heads-up: this looks like it overlaps your own still-open #31885 ( One verification I couldn't do locally (the |
|
Closing — deferring to #31885 by @konsisumer which addresses the same. Reopen if that PR stalls. |
Bounds the Telegram adapter's httpx connection pool when a proxy is configured, so half-closed sockets can no longer accumulate until the process hits its file-descriptor limit.
What does this PR do?
When the Telegram adapter routes through a local HTTP proxy, half-closed connections accumulate in the httpx general request pool faster than httpx evicts them — httpcore's tunnel-proxy path does not always release the underlying socket on
ConnectError. With the current unboundedconnection_pool_sizedefault (512), these deadCLOSEDsockets pile up over days of flaky-proxy operation until the process exceeds its fd limit and everysend_message/set_my_commandsfails withhttpx.ConnectError: All connection attempts failed.This implements the reporter's suggested Refs #1: pass a bounded
httpx.Limitsinto the proxyHTTPXRequestconstruction. Cappingmax_connections/max_keepalive_connectionsbounds the leak and makes it surface immediately instead of silently wedging the gateway after ~2 days. The caps are env-overridable (HERMES_TELEGRAM_PROXY_MAX_CONNECTIONS,HERMES_TELEGRAM_PROXY_MAX_KEEPALIVE) for deployments that legitimately need more.Scoped to the proxy branch only — the non-proxy and fallback-IP paths are unchanged. The issue's other suggested fixes (periodic drain of
_request[1], send-path heartbeat,maxfilesstartup warning) are deliberately deferred, henceRefsrather thanFixes.Related Issue
Refs #31599
Type of Change
Changes Made
gateway/platforms/telegram.py: add_bounded_proxy_limits()helper returning an env-overridablehttpx.Limits(defaults:max_connections=20,max_keepalive_connections=10); pass it viahttpx_kwargs={"limits": ...}to the proxyHTTPXRequestandget_updates_requestconstruction. PTB mergeshttpx_kwargson top of itsconnection_pool_size-derived limits, so the bounded limits win.tests/gateway/test_telegram_proxy_pool_bound.py: unit tests for the helper's defaults, env overrides, and invalid-env fallback.How to Test
pytest tests/gateway/test_telegram_proxy_pool_bound.py -q— passes (3 tests).TELEGRAM_PROXY(or a system proxy resolved byresolve_proxy_url) pointing at a flaky local proxy; start the gateway with Telegram enabled. Over reconnect cycles,lsof -p <gateway_pid> | wc -lnow plateaus near the bounded pool size instead of climbing pastlaunchctl limit maxfiles.pytest tests/gateway/ -k telegramselection shows 6 pre-existing order-dependent failures (MarkdownV2-escaping / model-picker tests) that reproduce onmainwithout this change and pass in isolation; they are unrelated to this proxy fix.Checklist
Code
fix(scope):,feat(scope):, etc.)pytest tests/ -qand all tests passDocumentation & Housekeeping
docs/, docstrings) — or N/Acli-config.yaml.exampleif I added/changed config keys — or N/ACONTRIBUTING.mdorAGENTS.mdif I changed architecture or workflows — or N/A