Telegram adapter leaks httpx general-pool connections through HTTP proxy (CLOSED sockets accumulate, fd limit hit after ~2 days)

## Problem

After ~2 days of continuous operation behind a local HTTP proxy (`xray` on `127.0.0.1:10808`), the gateway's Telegram adapter accumulates hundreds of half-closed sockets in the httpx **general-request** pool. The OS-level fd count exceeds the macOS launchd default `maxfiles=256`, after which every subsequent `bot.send_message()` / `set_my_commands()` fails:

```
telegram.error.NetworkError: httpx.ConnectError: All connection attempts failed
```

Simultaneously, kanban dispatcher and channel-directory writes start failing with `[Errno 24] Too many open files` and `sqlite3.OperationalError: unable to open database file`.

`gateway_state.json` continues to report `platforms.telegram.state = "connected"` (stale — last updated when the pool was still healthy), so external monitoring does not detect the wedge.

## Why this is NOT a duplicate of #30230 or #5729 / #21548

This was the first thing I checked. The leak vector here is distinct:

- **#30230** blames MCP subprocess pipes/sockets in multi-profile setups. In my case there are **0 MCP servers** and **1 profile**, but the gateway still hits fd 287 after 2 days — see lsof breakdown below, 280/287 fds are httpx-through-proxy sockets, not MCP pipes.
- **#5729 / PR #21548** describe a *cold-boot* wedge of the **polling pool** (`_request[0]`) while the general pool is healthy and `getMe` works. My case is the opposite: polling pool is fine and reconnects via `_drain_polling_connections` work; the **general pool** (`_request[1]`) is the one accumulating dead connections, and eventually `bot.send_message()` (which routes through `_request[1]`) fails.

## Evidence

Captured from a wedged gateway (uptime ~2 days, single profile, no MCP servers configured):

```
$ lsof -p <gateway_pid> | wc -l
287                                      # vs launchctl limit maxfiles soft = 256

$ lsof -p <gateway_pid> | awk '{print $5}' | sort | uniq -c | sort -rn
  235 IPv4
   42 REG
    3 unix
    ...

$ lsof -a -p <gateway_pid> -iTCP | awk '{print $NF}' | sort | uniq -c | sort -rn
  267 (CLOSED)
  117 (ESTABLISHED)
    4 (CLOSE_WAIT)

$ lsof -a -p <gateway_pid> -iTCP | awk '{print $9}' | sed 's/.*->//' | sort | uniq -c | sort -rn | head -3
  280  localhost:10808     ← local xray HTTP proxy
   12  216.38.168.230:45979
   10  localhost:13580
```

280 of the 287 fds terminate at the local proxy port. Persistent log pattern in the days leading up to the wedge:

```
[Telegram] Telegram network error, scheduling reconnect: httpx.ConnectError:
[Telegram] Telegram network error (attempt 1/10), reconnecting in 5s. Error: httpx.ConnectError:
[Telegram] Telegram polling reconnect failed: httpx.ConnectError:
[Telegram] Telegram polling resumed after network error (attempt N)
```

i.e., proxy hiccups → reconnect ladder fires → polling pool gets drained correctly → but each cycle also leaks 1–2 connections in the **general** pool (which `set_my_commands`, `send_message`, and the resolver-fallback HTTPXRequest all use).

## Root cause

`gateway/platforms/telegram.py::_drain_polling_connections` (added in #17015) mitigates this for `_request[0]` (getUpdates) only, with explicit rationale at lines 822–824:

```python
# We reset ONLY _request[0] (the getUpdates request) — the general
# request (_request[1]) is left untouched so concurrent
# send_message / edit_message calls are never interrupted.
```

Reasonable for short outages. But over many days of flaky-proxy operation, the general pool accumulates half-closed connections faster than httpx evicts them — visible as `CLOSED` in lsof — because the `proxy=…` HTTPXRequest construction goes through httpcore's tunnel-proxy path which does not always release the underlying socket on `ConnectError`.

After enough cycles, every general-pool slot holds a dead connection and new sends can't acquire one → `httpx.ConnectError: All connection attempts failed`.

## Reproduction

1. Configure system HTTP/HTTPS proxy to a local proxy that occasionally drops connections (xray / clash / v2ray are typical on macOS in restricted-network environments).
2. Start the gateway with Telegram enabled, single profile, no MCP servers.
3. Let it run 24–48h; observe periodic `Telegram network error, scheduling reconnect: httpx.ConnectError` in `gateway.log`.
4. After enough cycles: `lsof -p <gateway_pid> | wc -l` exceeds `launchctl limit maxfiles` soft limit, all sends fail.

## Workaround (confirmed)

`hermes gateway restart` clears the leaked sockets (fd 287 → 54, Telegram resumes). Recurs in 1–2 days.

## Suggested fixes

In rough order of impact:

1. **Bound the general pool** when proxy is configured: pass `limits=httpx.Limits(max_connections=20, max_keepalive_connections=10)` into the `HTTPXRequest(..., proxy=proxy_url)` construction at `gateway/platforms/telegram.py:1424–1425`. Caps the leak, makes it surface immediately instead of after days.
2. **Periodically drain `_request[1]`** — e.g., on a low-frequency schedule (hourly) gracefully drain the general request with a brief grace period for in-flight sends. Symmetrical with the existing polling-pool drain. Targeted fix.
3. **Heartbeat on the send path**, not just polling: update `platforms.telegram.updated_at` from a probe that exercises `_request[1]`, so wedged-but-still-polling state is observable externally instead of silently lying as `connected`.
4. (Cross-ref #30230) Detect launchd `maxfiles` < 1024 at startup and emit a single WARN.

I'm happy to send a PR for fix (1) if a maintainer can confirm the approach — it's a 2-line change at `telegram.py:1414–1425` and the failure mode it prevents is well-bounded.

## Environment

- macOS 15 (Darwin 25.5.0, Apple Silicon)
- hermes-agent `0.14.0` (commit `7f1b2b4`)
- Python 3.11.15
- httpx 0.28.1, httpcore 1.0.9, python-telegram-bot 22.6
- Single profile, no MCP servers
- Local HTTP proxy on `127.0.0.1:10808` (xray)
- launchd `maxfiles`: 256 (default)

## Related

- #30230 — same hit-the-wall symptom, different leak vector (MCP subprocesses + multi-profile)
- #5729 / PR #21548 — polling-pool wedge on cold boot; complementary fix on the *other* pool
- #17015 — merged fix that added `_drain_polling_connections` for polling pool only
- #25666 — SIGSEGV on aarch64 during httpx.ReadError reconnect; same code path, different platform

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Telegram adapter leaks httpx general-pool connections through HTTP proxy (CLOSED sockets accumulate, fd limit hit after ~2 days) #31599

Problem

Why this is NOT a duplicate of #30230 or #5729 / #21548

Evidence

Root cause

Reproduction

Workaround (confirmed)

Suggested fixes

Environment

Related

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Telegram adapter leaks httpx general-pool connections through HTTP proxy (CLOSED sockets accumulate, fd limit hit after ~2 days) #31599

Description

Problem

Why this is NOT a duplicate of #30230 or #5729 / #21548

Evidence

Root cause

Reproduction

Workaround (confirmed)

Suggested fixes

Environment

Related

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions