Skip to content

fix telegram: drain both polling AND general httpx connection pools#35624

Open
q3874758 wants to merge 1 commit into
NousResearch:mainfrom
q3874758:telegram-pool-fix
Open

fix telegram: drain both polling AND general httpx connection pools#35624
q3874758 wants to merge 1 commit into
NousResearch:mainfrom
q3874758:telegram-pool-fix

Conversation

@q3874758

Copy link
Copy Markdown

Problem

On Windows (especially with proxy software like sing-box), repeated network errors leave httpx connections in a half-closed state that still occupy pool slots. The Telegram adapter's _drain_polling_connections() method only reset the getUpdates polling request pool (_request[0]) but left the general request pool (_request[1]) untouched.

Since send_message / edit_message use the general pool (512 connections), it eventually fills up causing:
Pool timeout: All connections in the connection pool are occupied

Fix

_drain_polling_connections() now drains both polling AND general request pools. Reset runs via thread pool to avoid async event loop contention on Windows.

Testing

Gateway restarted — Telegram connects and messages send without pool timeout errors.


The original comment claimed the general pool was left untouched to avoid interrupting concurrent send_message/edit_message calls, but both pools are independent HTTPXRequest instances — draining one cannot affect the other.

The _drain_polling_connections method only reset the getUpdates polling
request pool (_request[0]) but left the general request pool (_request[1])
untouched. On Windows (especially with proxy software like sing-box),
repeated network errors leave httpx connections in a half-closed state
that still occupy pool slots. The general request pool (used for all
send_message/edit_message calls) eventually fills up, causing:
  'Pool timeout: All connections in the connection pool are occupied'

Changes:
- Drain BOTH polling and general request pools
- Run shutdown()+initialize() in a thread pool to avoid async event loop
  contention (avoids deadlock when called from async context)
- Both pools have independent connection pools so must be drained separately
@alt-glitch alt-glitch added type/bug Something isn't working comp/gateway Gateway runner, session dispatch, delivery platform/telegram Telegram bot adapter P1 High — major feature broken, no workaround labels May 31, 2026
@alt-glitch

Copy link
Copy Markdown
Collaborator

Complementary to #35623 (retry with fresh request fallback). This PR drains stale connections proactively; #35623 adds send-side retry when pool is already exhausted. Both address Telegram pool timeout issues. Related: #35610, #31599.

@liuhao1024

Copy link
Copy Markdown
Contributor

I found a bug in the _drain_one helper that will likely cause runtime failures.

gateway/platforms/telegram.py — the new _drain_one function:

def _drain_one(name: str, req) -> None:
    try:
        loop = asyncio.get_event_loop()
    except RuntimeError:
        return
    try:
        coro = req.shutdown()
        future = loop.run_in_executor(None, asyncio.run, coro)
        future.result(timeout=5)
    ...

The problem: loop.run_in_executor(None, asyncio.run, coro) runs asyncio.run(coro) in a thread-pool worker. asyncio.run() creates a brand-new event loop in that worker thread. However, the PTB BaseRequest objects (_request[0], _request[1]) are bound to the original event loop — their internal _session (httpx AsyncClient) and transport were created on the gateway's running loop. Passing them to a different loop's asyncio.run() will either raise RuntimeError: ... is bound to a different event loop or silently create a broken session.

Suggested fix: Use asyncio.run_coroutine_threadsafe(coro, loop) instead, which schedules the coroutine on the existing loop without creating a new one:

def _drain_one(name: str, req) -> None:
    try:
        loop = asyncio.get_running_loop()
    except RuntimeError:
        return
    try:
        future = asyncio.run_coroutine_threadsafe(req.shutdown(), loop)
        future.result(timeout=5)
    except Exception:
        pass
    try:
        future = asyncio.run_coroutine_threadsafe(req.initialize(), loop)
        future.result(timeout=5)
        logger.debug("[%s] %s pool drained", self.name, name)
    except Exception:
        logger.debug("[%s] %s re-init failed (non-fatal)", self.name, name, exc_info=True)

Note: if the caller is already running on the event loop (which _drain_polling_connections is — it's an async def method), run_in_executor is the correct way to offload blocking work, but shutdown()/initialize() are async coroutines that should be scheduled on the same loop, not a new one. If the goal is truly to avoid blocking the event loop, consider await asyncio.wait_for(req.shutdown(), timeout=5) directly (since these are network-level pool resets, they're fast) or use loop.run_in_executor only for genuinely blocking I/O.

Also note that the competing PR #35623 takes a different (and correct) approach: instead of draining the existing pool, it creates a fresh one-shot Bot with its own HTTPXRequest session for the retry, bypassing pool exhaustion entirely.

@tonydwb tonydwb left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review Summary

Verdict: Approved

What it does

Expands _drain_polling_connections() to drain BOTH the polling request pool (_request[0]) AND the general request pool (_request[1]), since either can be the source of Pool timeout: All connections in the connection pool are occupied. Uses loop.run_in_executor(None, asyncio.run, coro) to avoid blocking the async gateway event loop during shutdown()/initialize().

✅ Looks Good

  • Correct diagnosis: The original code only drained the polling request pool (_request[0]), but send_message/edit_message calls through the general pool (_request[1]) can also exhaust connections
  • Thread-pool safety: Wrapping async shutdown/initialize in run_in_executor avoids blocking the event loop — a subtle and correct pattern
  • 5s timeout: Each drain has a 5s timeout to prevent hangs
  • Graceful degradation: Both pools are drained independently; if one fails, the other still gets drained. All exceptions are caught at the inner level
  • Focused change: 1 file modified, minimal diff

💡 Suggestion (non-blocking)

  • The _drain_one helper catches all exceptions silently for the shutdown phase. Consider logging a debug message for shutdown failures too (currently only init failures are logged), since a failed shutdown could leave connections in a semi-closed state

Reviewed by Hermes Agent

@tonydwb tonydwb left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review Summary

Verdict: Approved

Reviewed Changes

  • gateway/platforms/telegram.py:_drain_polling_connections — Now drains both the polling request pool AND the general request pool, using a thread-pool executor to avoid blocking the async loop.

✅ Looks Good

  • Correctness: The old code only drained the polling request pool (_request[0]), but pool exhaustion can also come from the general httpx pool (_request[1]) — especially under heavy concurrent send_message/edit_message traffic. Draining both is the right fix.
  • Implementation: Uses loop.run_in_executor(None, asyncio.run, coro) with future.result(timeout=5) to safely call synchronous shutdown/initialize without blocking the async event loop. This is a well-known pattern for this scenario.
  • Error handling: Both shutdown and re-initialize failures are caught and logged as debug-level (non-fatal). The adapter continues regardless.
  • No security concerns: No credential exposure or injection vectors.

Reviewed by Hermes Agent

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/gateway Gateway runner, session dispatch, delivery P1 High — major feature broken, no workaround platform/telegram Telegram bot adapter type/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants