Skip to content

fix(gateway): tighten Telegram proxy-pool keepalive to stop fd leak#31687

Closed
konsisumer wants to merge 1 commit into
NousResearch:mainfrom
konsisumer:fix/telegram-proxy-pool-keepalive-leak
Closed

fix(gateway): tighten Telegram proxy-pool keepalive to stop fd leak#31687
konsisumer wants to merge 1 commit into
NousResearch:mainfrom
konsisumer:fix/telegram-proxy-pool-keepalive-leak

Conversation

@konsisumer

@konsisumer konsisumer commented May 24, 2026

Copy link
Copy Markdown
Contributor

Tightens keepalive eviction on the Telegram proxy-path HTTPXRequest pools so half-closed sockets stop accumulating until the process hits its fd limit.

What does this PR do?

Behind a flaky local HTTP proxy, the Telegram adapter's general request pool accumulates hundreds of half-closed (CLOSED) sockets over days of operation until the process exceeds its fd budget and every bot.send_message() / set_my_commands() fails with httpx.ConnectError: All connection attempts failed (#31599).

Root cause: PTB's HTTPXRequest derives httpx.Limits from connection_pool_size but only sets max_connections, leaving max_keepalive_connections / keepalive_expiry at httpx's defaults (20 / 5s). Telegram is the only long-lived httpx client in the gateway not using the shared platform_httpx_limits() keepalive hardening added for the same fd-exhaustion-through-proxy class of bug in #18451 — every other persistent adapter (QQ Bot, Feishu, WeCom, DingTalk, Signal, BlueBubbles, WeCom-callback) already does.

This PR closes that gap for the proxy path: it builds the proxy-path HTTPXRequest pools with the shared #18451 keepalive tuning (shorter keepalive_expiry so idle/dead sockets drain promptly) while keeping max_connections at the configured pool size, so concurrent sends are unaffected and the deliberate 512-slot pool is preserved. The helper returns empty kwargs when httpx is unavailable, so HTTPXRequest falls back to its own limits.

Related Issue

Refs #31599

This addresses the proxy-path keepalive accumulation (the reporter's suggested fix 1, adapted to preserve max_connections). The periodic general-pool drain (fix 2) and send-path heartbeat observability (fix 3) the reporter also proposed are left as follow-ups, so this is Refs, not Closes.

Type of Change

  • 🐛 Bug fix (non-breaking change that fixes an issue)
  • ✨ New feature (non-breaking change that adds functionality)
  • 🔒 Security fix
  • 📝 Documentation update
  • ✅ Tests (adding or improving test coverage)
  • ♻️ Refactor (no behavior change)
  • 🎯 New skill (bundled or hub)

Changes Made

  • gateway/platforms/telegram.py: add _proxy_request_httpx_kwargs() that reuses platform_httpx_limits() ([Bug]: CLOSE_WAIT fd leak causes all platforms to stop responding after ~1-2 hours #18451) to bound keepalive while keeping max_connections at the configured pool size; pass it into the proxy-path HTTPXRequest constructions.
  • tests/gateway/test_telegram_proxy_pool_limits.py: regression tests asserting keepalive is tightened below httpx's 5s default, the max_connections ceiling is preserved, kwargs are empty when httpx is unavailable, and the shared keepalive env overrides apply.

How to Test

  1. pytest tests/gateway/test_telegram_proxy_pool_limits.py tests/gateway/test_platform_http_client_limits.py -q — all pass.
  2. Manual: with a Telegram proxy configured, the proxy-path pools now build with the shared tighter httpx.Limits (keepalive_expiry < 5s, bounded max_keepalive_connections, max_connections still equal to HERMES_TELEGRAM_HTTP_POOL_SIZE), so idle/half-closed sockets through the proxy drain promptly instead of piling up as CLOSED fds.

What platforms tested on

  • macOS on darwin-arm64 (local): pytest + ruff check pass; the change is pure-Python httpx pool configuration with no platform-specific syscalls.

Checklist

Code

  • I've read the Contributing Guide
  • My commit messages follow Conventional Commits (fix(scope):, feat(scope):, etc.)
  • I searched for existing PRs to make sure this isn't a duplicate
  • My PR contains only changes related to this fix/feature (no unrelated commits)
  • I've run pytest tests/ -q and all tests pass (ran the affected gateway tests)
  • I've added tests for my changes (required for bug fixes, strongly encouraged for features)
  • I've tested on my platform: macOS 15 (darwin-arm64)

Documentation & Housekeeping

  • I've updated relevant documentation (README, docs/, docstrings) — or N/A (covered by docstrings)
  • I've updated cli-config.yaml.example if I added/changed config keys — N/A (reuses existing HERMES_GATEWAY_HTTPX_* env overrides)
  • I've updated CONTRIBUTING.md or AGENTS.md if I changed architecture or workflows — N/A
  • I've considered cross-platform impact (Windows, macOS) per the compatibility guide — N/A (no platform-specific code)
  • I've updated tool descriptions/schemas if I changed tool behavior — N/A

The proxy-path HTTPXRequest pools only set max_connections, leaving
keepalive_expiry at httpx's 5s default. Behind a flaky proxy, half-closed
sockets accumulate faster than they drain and exhaust the fd budget after
days of operation. Reuse the shared NousResearch#18451 keepalive tuning while keeping
the configured max_connections ceiling.

Refs NousResearch#31599
@konsisumer

Copy link
Copy Markdown
Contributor Author

Closing — deferring to #37400 by @datin-antasena which addresses the same. Reopen if that PR stalls.

@konsisumer konsisumer closed this Jun 4, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/gateway Gateway runner, session dispatch, delivery P2 Medium — degraded but workaround exists platform/telegram Telegram bot adapter type/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants