You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
After ~2 days of continuous operation behind a local HTTP proxy (xray on 127.0.0.1:10808), the gateway's Telegram adapter accumulates hundreds of half-closed sockets in the httpx general-request pool. The OS-level fd count exceeds the macOS launchd default maxfiles=256, after which every subsequent bot.send_message() / set_my_commands() fails:
telegram.error.NetworkError: httpx.ConnectError: All connection attempts failed
Simultaneously, kanban dispatcher and channel-directory writes start failing with [Errno 24] Too many open files and sqlite3.OperationalError: unable to open database file.
gateway_state.json continues to report platforms.telegram.state = "connected" (stale — last updated when the pool was still healthy), so external monitoring does not detect the wedge.
This was the first thing I checked. The leak vector here is distinct:
Gateway hits macOS fd limit (256): OSError Too many open files #30230 blames MCP subprocess pipes/sockets in multi-profile setups. In my case there are 0 MCP servers and 1 profile, but the gateway still hits fd 287 after 2 days — see lsof breakdown below, 280/287 fds are httpx-through-proxy sockets, not MCP pipes.
i.e., proxy hiccups → reconnect ladder fires → polling pool gets drained correctly → but each cycle also leaks 1–2 connections in the general pool (which set_my_commands, send_message, and the resolver-fallback HTTPXRequest all use).
Root cause
gateway/platforms/telegram.py::_drain_polling_connections (added in #17015) mitigates this for _request[0] (getUpdates) only, with explicit rationale at lines 822–824:
# We reset ONLY _request[0] (the getUpdates request) — the general# request (_request[1]) is left untouched so concurrent# send_message / edit_message calls are never interrupted.
Reasonable for short outages. But over many days of flaky-proxy operation, the general pool accumulates half-closed connections faster than httpx evicts them — visible as CLOSED in lsof — because the proxy=… HTTPXRequest construction goes through httpcore's tunnel-proxy path which does not always release the underlying socket on ConnectError.
After enough cycles, every general-pool slot holds a dead connection and new sends can't acquire one → httpx.ConnectError: All connection attempts failed.
Reproduction
Configure system HTTP/HTTPS proxy to a local proxy that occasionally drops connections (xray / clash / v2ray are typical on macOS in restricted-network environments).
Start the gateway with Telegram enabled, single profile, no MCP servers.
Let it run 24–48h; observe periodic Telegram network error, scheduling reconnect: httpx.ConnectError in gateway.log.
After enough cycles: lsof -p <gateway_pid> | wc -l exceeds launchctl limit maxfiles soft limit, all sends fail.
Workaround (confirmed)
hermes gateway restart clears the leaked sockets (fd 287 → 54, Telegram resumes). Recurs in 1–2 days.
Suggested fixes
In rough order of impact:
Bound the general pool when proxy is configured: pass limits=httpx.Limits(max_connections=20, max_keepalive_connections=10) into the HTTPXRequest(..., proxy=proxy_url) construction at gateway/platforms/telegram.py:1424–1425. Caps the leak, makes it surface immediately instead of after days.
Periodically drain _request[1] — e.g., on a low-frequency schedule (hourly) gracefully drain the general request with a brief grace period for in-flight sends. Symmetrical with the existing polling-pool drain. Targeted fix.
Heartbeat on the send path, not just polling: update platforms.telegram.updated_at from a probe that exercises _request[1], so wedged-but-still-polling state is observable externally instead of silently lying as connected.
I'm happy to send a PR for fix (1) if a maintainer can confirm the approach — it's a 2-line change at telegram.py:1414–1425 and the failure mode it prevents is well-bounded.
Problem
After ~2 days of continuous operation behind a local HTTP proxy (
xrayon127.0.0.1:10808), the gateway's Telegram adapter accumulates hundreds of half-closed sockets in the httpx general-request pool. The OS-level fd count exceeds the macOS launchd defaultmaxfiles=256, after which every subsequentbot.send_message()/set_my_commands()fails:Simultaneously, kanban dispatcher and channel-directory writes start failing with
[Errno 24] Too many open filesandsqlite3.OperationalError: unable to open database file.gateway_state.jsoncontinues to reportplatforms.telegram.state = "connected"(stale — last updated when the pool was still healthy), so external monitoring does not detect the wedge.Why this is NOT a duplicate of #30230 or #5729 / #21548
This was the first thing I checked. The leak vector here is distinct:
_request[0]) while the general pool is healthy andgetMeworks. My case is the opposite: polling pool is fine and reconnects via_drain_polling_connectionswork; the general pool (_request[1]) is the one accumulating dead connections, and eventuallybot.send_message()(which routes through_request[1]) fails.Evidence
Captured from a wedged gateway (uptime ~2 days, single profile, no MCP servers configured):
280 of the 287 fds terminate at the local proxy port. Persistent log pattern in the days leading up to the wedge:
i.e., proxy hiccups → reconnect ladder fires → polling pool gets drained correctly → but each cycle also leaks 1–2 connections in the general pool (which
set_my_commands,send_message, and the resolver-fallback HTTPXRequest all use).Root cause
gateway/platforms/telegram.py::_drain_polling_connections(added in #17015) mitigates this for_request[0](getUpdates) only, with explicit rationale at lines 822–824:Reasonable for short outages. But over many days of flaky-proxy operation, the general pool accumulates half-closed connections faster than httpx evicts them — visible as
CLOSEDin lsof — because theproxy=…HTTPXRequest construction goes through httpcore's tunnel-proxy path which does not always release the underlying socket onConnectError.After enough cycles, every general-pool slot holds a dead connection and new sends can't acquire one →
httpx.ConnectError: All connection attempts failed.Reproduction
Telegram network error, scheduling reconnect: httpx.ConnectErroringateway.log.lsof -p <gateway_pid> | wc -lexceedslaunchctl limit maxfilessoft limit, all sends fail.Workaround (confirmed)
hermes gateway restartclears the leaked sockets (fd 287 → 54, Telegram resumes). Recurs in 1–2 days.Suggested fixes
In rough order of impact:
limits=httpx.Limits(max_connections=20, max_keepalive_connections=10)into theHTTPXRequest(..., proxy=proxy_url)construction atgateway/platforms/telegram.py:1424–1425. Caps the leak, makes it surface immediately instead of after days._request[1]— e.g., on a low-frequency schedule (hourly) gracefully drain the general request with a brief grace period for in-flight sends. Symmetrical with the existing polling-pool drain. Targeted fix.platforms.telegram.updated_atfrom a probe that exercises_request[1], so wedged-but-still-polling state is observable externally instead of silently lying asconnected.maxfiles< 1024 at startup and emit a single WARN.I'm happy to send a PR for fix (1) if a maintainer can confirm the approach — it's a 2-line change at
telegram.py:1414–1425and the failure mode it prevents is well-bounded.Environment
0.14.0(commit7f1b2b4)127.0.0.1:10808(xray)maxfiles: 256 (default)Related
_drain_polling_connectionsfor polling pool only