Skip to content

Telegram adapter leaks httpx general-pool connections through HTTP proxy (CLOSED sockets accumulate, fd limit hit after ~2 days) #31599

@JustinHuber

Description

@JustinHuber

Problem

After ~2 days of continuous operation behind a local HTTP proxy (xray on 127.0.0.1:10808), the gateway's Telegram adapter accumulates hundreds of half-closed sockets in the httpx general-request pool. The OS-level fd count exceeds the macOS launchd default maxfiles=256, after which every subsequent bot.send_message() / set_my_commands() fails:

telegram.error.NetworkError: httpx.ConnectError: All connection attempts failed

Simultaneously, kanban dispatcher and channel-directory writes start failing with [Errno 24] Too many open files and sqlite3.OperationalError: unable to open database file.

gateway_state.json continues to report platforms.telegram.state = "connected" (stale — last updated when the pool was still healthy), so external monitoring does not detect the wedge.

Why this is NOT a duplicate of #30230 or #5729 / #21548

This was the first thing I checked. The leak vector here is distinct:

Evidence

Captured from a wedged gateway (uptime ~2 days, single profile, no MCP servers configured):

$ lsof -p <gateway_pid> | wc -l
287                                      # vs launchctl limit maxfiles soft = 256

$ lsof -p <gateway_pid> | awk '{print $5}' | sort | uniq -c | sort -rn
  235 IPv4
   42 REG
    3 unix
    ...

$ lsof -a -p <gateway_pid> -iTCP | awk '{print $NF}' | sort | uniq -c | sort -rn
  267 (CLOSED)
  117 (ESTABLISHED)
    4 (CLOSE_WAIT)

$ lsof -a -p <gateway_pid> -iTCP | awk '{print $9}' | sed 's/.*->//' | sort | uniq -c | sort -rn | head -3
  280  localhost:10808     ← local xray HTTP proxy
   12  216.38.168.230:45979
   10  localhost:13580

280 of the 287 fds terminate at the local proxy port. Persistent log pattern in the days leading up to the wedge:

[Telegram] Telegram network error, scheduling reconnect: httpx.ConnectError:
[Telegram] Telegram network error (attempt 1/10), reconnecting in 5s. Error: httpx.ConnectError:
[Telegram] Telegram polling reconnect failed: httpx.ConnectError:
[Telegram] Telegram polling resumed after network error (attempt N)

i.e., proxy hiccups → reconnect ladder fires → polling pool gets drained correctly → but each cycle also leaks 1–2 connections in the general pool (which set_my_commands, send_message, and the resolver-fallback HTTPXRequest all use).

Root cause

gateway/platforms/telegram.py::_drain_polling_connections (added in #17015) mitigates this for _request[0] (getUpdates) only, with explicit rationale at lines 822–824:

# We reset ONLY _request[0] (the getUpdates request) — the general
# request (_request[1]) is left untouched so concurrent
# send_message / edit_message calls are never interrupted.

Reasonable for short outages. But over many days of flaky-proxy operation, the general pool accumulates half-closed connections faster than httpx evicts them — visible as CLOSED in lsof — because the proxy=… HTTPXRequest construction goes through httpcore's tunnel-proxy path which does not always release the underlying socket on ConnectError.

After enough cycles, every general-pool slot holds a dead connection and new sends can't acquire one → httpx.ConnectError: All connection attempts failed.

Reproduction

  1. Configure system HTTP/HTTPS proxy to a local proxy that occasionally drops connections (xray / clash / v2ray are typical on macOS in restricted-network environments).
  2. Start the gateway with Telegram enabled, single profile, no MCP servers.
  3. Let it run 24–48h; observe periodic Telegram network error, scheduling reconnect: httpx.ConnectError in gateway.log.
  4. After enough cycles: lsof -p <gateway_pid> | wc -l exceeds launchctl limit maxfiles soft limit, all sends fail.

Workaround (confirmed)

hermes gateway restart clears the leaked sockets (fd 287 → 54, Telegram resumes). Recurs in 1–2 days.

Suggested fixes

In rough order of impact:

  1. Bound the general pool when proxy is configured: pass limits=httpx.Limits(max_connections=20, max_keepalive_connections=10) into the HTTPXRequest(..., proxy=proxy_url) construction at gateway/platforms/telegram.py:1424–1425. Caps the leak, makes it surface immediately instead of after days.
  2. Periodically drain _request[1] — e.g., on a low-frequency schedule (hourly) gracefully drain the general request with a brief grace period for in-flight sends. Symmetrical with the existing polling-pool drain. Targeted fix.
  3. Heartbeat on the send path, not just polling: update platforms.telegram.updated_at from a probe that exercises _request[1], so wedged-but-still-polling state is observable externally instead of silently lying as connected.
  4. (Cross-ref Gateway hits macOS fd limit (256): OSError Too many open files #30230) Detect launchd maxfiles < 1024 at startup and emit a single WARN.

I'm happy to send a PR for fix (1) if a maintainer can confirm the approach — it's a 2-line change at telegram.py:1414–1425 and the failure mode it prevents is well-bounded.

Environment

  • macOS 15 (Darwin 25.5.0, Apple Silicon)
  • hermes-agent 0.14.0 (commit 7f1b2b4)
  • Python 3.11.15
  • httpx 0.28.1, httpcore 1.0.9, python-telegram-bot 22.6
  • Single profile, no MCP servers
  • Local HTTP proxy on 127.0.0.1:10808 (xray)
  • launchd maxfiles: 256 (default)

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    P1High — major feature broken, no workaroundcomp/gatewayGateway runner, session dispatch, deliveryplatform/telegramTelegram bot adaptertype/bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions