fix(gateway): detect Telegram polling-pool wedges via correct pool by iws17 · Pull Request #21548 · NousResearch/hermes-agent

iws17 · 2026-05-07T22:38:39Z

Summary

Fixes #5729 — Telegram silent wedges where getMe, getWebhookInfo, and curl all
work but messages are not handled until manual hermes gateway restart.

The existing _verify_polling_after_reconnect heartbeat probe at
gateway/platforms/telegram.py:596 calls bot.get_me(). Per PTB v22
_bot.py:717:

request = self._request[0] if endpoint == "getUpdates" else self._request[1]

bot.get_me() routes through the general pool (_request[1]); the long-poll
runs on the polling pool (_request[0]). _drain_polling_connections only
resets _request[0]. After a botched drain+reconnect — or a cold-boot wedge
where the polling pool's first getUpdates hits a stale DNS/TCP race that doesn't
raise — _request[1] stays healthy and bot.get_me() returns instantly while
_request[0] is wedged. The probe falsely passes; the gateway sits silent.

This matches @rblakemesser's 2026-05-07 field report on #5729 exactly:

at diagnosis time: curl https://api.telegram.org/ worked. getMe worked.
getWebhookInfo showed no webhook and pending_update_count=0. launchd still
reported the gateway running.

Changes

Two surgical changes (Change 2 depends on Change 1). Shipping Change 2
without Change 1 is strictly worse than the status quo because it would
schedule a probe of the wrong pool.

1. Replace the probe primitive

Bypass Bot._do_post's endpoint dispatch and call BaseRequest.post directly
on _request[0] against getMe — same connection pool the long-poll uses,
non-destructive Telegram-side semantics:

pool = self._app.bot._request[0]
async with self._probe_drain_lock:
    await asyncio.wait_for(pool.post(f"{base}/getMe"), PROBE_TIMEOUT)

Using any flavor of getUpdates would be destructive per Telegram's API
contract — offset=-1 "forgets" the queue
and concurrent getUpdates calls on the same pool are forbidden.

PROBE_TIMEOUT raised from 10s to 65s to exceed the long-poll contention
window (the polling pool may be busy with an in-flight getUpdates long-poll).

2. Schedule the probe after cold-boot `start_polling`

Previously, _verify_polling_after_reconnect was only triggered from
_handle_polling_network_error. The cold-boot connect() path called
start_polling() and fell through to _mark_connected() with no
post-start probe.

A wedge that starts at cold boot produces no exception (the long-poll task is
running, just not making progress) → no error callback → no heartbeat
scheduled → silent failure. The cold-boot probe is the only signal for that
class.

Supporting hardening

Shared asyncio.Lock between the probe and _drain_polling_connections
so the probe never hits a half-torn-down pool.
Startup shape-check on bot._request (PTB-internal API). Custom
HTTPXRequest injection via ApplicationBuilder.request() /
get_updates_request() can replace the tuple, and a PTB minor bump may
rename or restructure it. On shape mismatch the probe logs a warning and
skips — never crashes, never falsely declares wedge.
Two-strike escalation: two probe failures within 5 minutes → fatal
retryable, supervisor restarts the process. The slow ladder couldn't fix
what the first reconnect missed; a fresh interpreter clears any stale
resolver / connection state.
trigger= parameter on the probe so logs distinguish cold_boot vs
reconnect for operator triage.
Token redaction in logs: bot.base_url literally contains the bot
token (Telegram's URL format is https://api.telegram.org/bot<TOKEN>/).
Probe-related logs use the exception class name and trigger, never the
full URL or repr(probe_err).
Probe task cancellation on disconnect() to avoid Task was destroyed but it is pending warnings during shutdown.

Adjacent in-flight work (distinct surfaces)

fix(telegram): send resilience — retry transient tool sends + increase cold-boot budget (#9101, #5770) #10947 (merged): cold-boot retry budget bumped 3 → 8 with backoff cap 15s.
Mode-1 fix; this PR is the mode-2 fix on top.
fix: gateway resilience — token rejection handling, state self-heal, health endpoint #7023 (open): gateway_health.json substrate. Composes well — a follow-up
can add last_polling_pool_probe_ok_at once fix: gateway resilience — token rejection handling, state self-heal, health endpoint #7023 lands.
fix(gateway): keep startup retry failures alive for reconnect recovery #17984 (open): degraded-mode on startup retry exhaustion. Different surface
(this PR addresses the post-reconnect / cold-boot wedge that does not
exhaust startup retries).
fix: gateway reconnect watcher retries indefinitely instead of giving up after 20 attempts #17216 (open): removes the 20-attempt cap. Different surface
(rblakemesser's incident did not hit Giving up after 20 attempts).

Test plan

pytest tests/gateway/test_telegram_polling_pool_wedge.py
pytest tests/gateway/test_telegram_network_reconnect.py
pytest tests/gateway/test_telegram*.py

Result: 368 telegram tests pass.

New tests (tests/gateway/test_telegram_polling_pool_wedge.py):

test_probe_routes_through_polling_pool_not_general_pool — wedged
_request[0] with healthy _request[1] is detected; bot.get_me() is
not called.
test_cold_boot_schedules_polling_pool_probe — probe scheduled by
connect() after start_polling, not just by _handle_polling_network_error.
test_probe_skips_when_request_shape_mismatch — non-tuple bot._request
→ log warning + skip, no crash, no false-positive wedge.
test_probe_and_drain_mutually_exclude — concurrent probe blocks while
drain holds _probe_drain_lock.
test_drain_acquires_shared_probe_drain_lock — symmetric.
test_two_probe_failures_within_window_escalate_to_fatal — two failures
in 5min → _set_fatal_error("telegram_network_error", retryable=True).
test_disconnect_cancels_pending_probe_task — pending probe cancelled,
not leaked.
test_probe_does_not_advance_long_poll_offset — probe leaves Updater's
_last_update_id untouched (regression guard against any future
getUpdates-flavored probe).
test_wedge_log_distinguishes_cold_boot_vs_reconnect_trigger — operator
triage signal.

Updated tests (tests/gateway/test_telegram_network_reconnect.py):

test_heartbeat_probe_no_op_when_polling_healthy — asserts probe routes
through _request[0], not bot.get_me().
test_heartbeat_probe_reenters_ladder_when_polling_pool_times_out —
renamed from _get_me_times_out and updated for new probe.
test_heartbeat_probe_reenters_ladder_on_polling_pool_network_error —
renamed from _get_me_network_error and updated.

Manual smoke test (recommended for reviewers)

Start gateway with Telegram polling configured.

Block DNS for api.telegram.org for 60s during cold boot:

# macOS:    sudo pfctl -t blocklist -T add api.telegram.org
# Linux:    sudo iptables -A OUTPUT -d api.telegram.org -j REJECT
# restore after 60s

Without this PR: polling logs "started", no message handling, manual
restart required to recover.
With this PR: post-connect probe at T+60s detects the wedge via
_request[0] failure, re-enters the existing reconnect ladder.
Send a Telegram message at T+90s — confirm it's handled.

Residual blind spot

Flood-wait masking. If Telegram is rate-limiting the bot, getMe via
_request[0] returns success (different rate-limit bucket from getUpdates).
The probe says healthy; the polling pool is throttled, not wedged. Same
false-positive shape as the previous bot.get_me() probe — just narrower.
A future PR could inspect rate-limit response headers; out of scope here.

Rollback

Both changes are additive. Revert is a single PR revert, no migration,
no schema change.
The shape-check (bot._request validation) makes the probe self-disable
on PTB version drift — built-in escape hatch without requiring a code
change.

References

PTB v22 pool routing: _bot.py:717
Telegram API contract on getUpdates: https://core.telegram.org/bots/api#getupdates
Field reports: @alt-glitch (2026-04-30), @rblakemesser (2026-05-07)

The existing _verify_polling_after_reconnect heartbeat probe called bot.get_me(), which routes through PTB v22's general request pool (_request[1]). The long-poll runs on the polling pool (_request[0]). A wedged polling pool with a healthy general pool — the failure mode in NousResearch#5729 — passes the existing probe falsely. Two changes (Change 2 depends on Change 1): 1. Replace the probe primitive so it routes through _request[0] via a non-destructive getMe call, bypassing Bot._do_post's endpoint dispatch: pool = bot._request[0] await pool.post(f"{bot.base_url}/getMe") Using any flavor of getUpdates would be destructive per Telegram's API contract (offset=-1 forgets the queue; concurrent getUpdates is forbidden). 2. Schedule the same probe after cold-boot start_polling, not just after error-driven reconnects. The error callback can't see a silent cold-boot wedge — the long-poll task runs but never makes progress and never raises — so an explicit probe is the only signal. This is load-bearing: shipping without (1) would schedule a probe of the wrong pool. Plus four small hardening pieces: - Shared asyncio.Lock between the probe and _drain_polling_connections so the probe never hits a half-torn-down pool. - Startup shape-check on bot._request with log-and-skip fallback for custom HTTPXRequest injection or PTB version drift. - Two-strike escalation: two probe failures within 5 minutes escalate to fatal-retryable so the supervisor restarts the process from a clean interpreter (the slow ladder couldn't fix what the first reconnect missed). - trigger= parameter on the probe so logs distinguish cold_boot vs reconnect for operator triage. Probe-related logs redact bot.base_url (which contains the bot token in Telegram's URL format). The probe task is tracked on the adapter and cancelled on disconnect() to avoid "Task was destroyed but it is pending" warnings. Tests: - 9 new tests in test_telegram_polling_pool_wedge.py covering pool routing, cold-boot scheduling, shape-check skip, probe/drain mutual exclusion, two-strike escalation, disconnect cancellation, long-poll offset preservation, and trigger-distinguishing logs. - 2 existing tests in test_telegram_network_reconnect.py updated for the new probe shape. Field reports: rblakemesser (2026-05-07) and alt-glitch (2026-04-30). Fixes NousResearch#5729

Copilot

Pull request overview

This PR fixes a Telegram gateway failure mode where polling can silently wedge: getMe/curl succeed while the long-polling getUpdates pool is stuck, so the existing heartbeat falsely reports “healthy.” The change targets python-telegram-bot’s split request pools by probing the polling pool directly and adds a cold-boot probe to catch wedges that never trigger the reconnect error callback.

Changes:

Replace the heartbeat probe primitive to exercise bot._request[0] (polling pool) directly via BaseRequest.post(.../getMe) and add a two-strike escalation window.
Schedule the polling-pool probe after cold-boot start_polling() as well as after reconnect.
Add/adjust tests to pin correct pool routing, cold-boot scheduling, lock mutual exclusion with drain, escalation behavior, and shutdown cancellation.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

File	Description
`gateway/platforms/telegram.py`	Routes wedge probe through the polling pool, adds strike escalation + shared probe/drain lock, schedules probe on cold boot, and cancels probe on disconnect.
`tests/gateway/test_telegram_polling_pool_wedge.py`	New focused regression suite covering polling-pool wedge detection, scheduling, locking, escalation, and log safety.
`tests/gateway/test_telegram_network_reconnect.py`	Updates heartbeat-probe tests to assert polling-pool routing (`_request[0]`) instead of `bot.get_me()`.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

iws17 · 2026-05-07T23:17:27Z

+        if self.has_fatal_error:
+            return
+        probe = asyncio.ensure_future(self._verify_polling_after_reconnect(trigger=trigger))
+        self._background_tasks.add(probe)
+        probe.add_done_callback(self._background_tasks.discard)
+        self._latest_probe_task = probe
+


Fixed in 535976c.

_schedule_polling_pool_probe now cancels any previously pending probe before scheduling a new one — at-most-one-probe-in-flight invariant. _verify_polling_after_reconnect also bails on self._app is None after the heartbeat sleep as defense-in-depth, so a probe whose cancellation races with disconnect can't recurse into _handle_polling_network_error against a torn-down adapter.

Tests: test_schedule_cancels_previously_pending_probe, test_probe_bails_when_app_torn_down_after_sleep.

iws17 · 2026-05-07T23:17:28Z

+            # docstring above for the dispatch rationale.
+            # Serialize against `_drain_polling_connections` so the probe
+            # never hits a half-torn-down pool.
+            base = self._app.bot.base_url


Fixed in 535976c. Probe URL is now built with str(self._app.bot.base_url).rstrip("/") before appending /getMe. PTB v22's default _parse_base_url returns no trailing slash, but callable / format-map / local_mode=True configs can produce one — rstrip covers all cases.

Test: test_probe_url_normalizes_trailing_slash_in_base_url (passes a trailing-slash base_url and asserts no //getMe in the call URL).

Two fixes from @Copilot's review on PR NousResearch#21548: 1. At-most-one-probe invariant. `_schedule_polling_pool_probe` is called from cold-boot AND reconnect paths. Without cancelling a prior pending probe, multiple 60s-sleeping probes can accumulate. `disconnect()` only cancels `_latest_probe_task`, so older probes can wake on a torn-down adapter and recurse into `_handle_polling_network_error` against `_app=None`. Fix: cancel any previous pending probe before scheduling a new one, and bail in `_verify_polling_after_reconnect` if `self._app is None` after the heartbeat sleep (defense-in-depth). 2. URL normalization. PTB's default `base_url` has no trailing slash, but callable / format-map / `local_mode=True` configs can produce one, yielding `f"{base}/getMe"` → `.../bot<TOKEN>//getMe`. Most servers normalize, but it's avoidable. Fix: `str(self._app.bot.base_url).rstrip("/")` before appending the endpoint path. Tests added: - test_schedule_cancels_previously_pending_probe - test_probe_bails_when_app_torn_down_after_sleep - test_probe_url_normalizes_trailing_slash_in_base_url 371 telegram tests pass.

iws17 · 2026-05-07T23:45:51Z

Self-review pass on the test suite — pushed 06afd31f1 with two test-validity strengthenings:

Token redaction assertion in test_probe_skips_when_request_shape_mismatch was using a weaker boolean form that could miss real token leaks; now uses a realistic token sentinel and asserts the literal token never appears in any captured log line.
Mutual-exclusion test in test_probe_and_drain_mutually_exclude was patching asyncio.sleep such that scheduling yields didn't actually advance the event loop — confirmed by manually removing the probe's lock and observing the test now correctly fails. Test loop now uses real asyncio.sleep while only the probe's heartbeat sleep is patched.

Production code unchanged. 371 telegram tests pass.

Two test-validity bugs caught in a separate review pass: 1. Token redaction assertion was too weak. The check was: not ("<REDACTED>" in msg AND "base_url" in msg) A real token-bearing URL like `.../botREAL_TOKEN/getMe` would pass without the literal word `base_url`. Replaced with a realistic-looking token sentinel and a strong assertion that the token literal never appears in any captured log message. 2. Mutual-exclusion test patched `asyncio.sleep` with `AsyncMock`, then used `await asyncio.sleep(0)` as scheduling yields. With sleep patched, the yields don't advance the event loop — the probe task may never reach the `async with self._probe_drain_lock` line, and `pool0_post.await_count == 0` could pass for the wrong reason (test never gave the probe CPU time at all). Fix: capture real `asyncio.sleep` before patching, and only patch `gateway.platforms.telegram.asyncio.sleep` (so the test loop's yields use the real one). Verified by manually removing the probe's `async with self._probe_drain_lock:` and confirming the test now correctly fails (`pool0_post.await_count == 1`); restored. Both findings are test-quality issues — production code is unchanged. 371 telegram tests pass.

iws17 · 2026-05-07T23:53:40Z

Force-pushed to overwrite the previous head — the latest commit replaces a token-shaped string in a test sentinel that GitHub's secret scanner was flagging. The new sentinel ('TEST-FAKE-TOKEN-FOR-CI-DO-NOT-SCAN') is intentionally the wrong shape to match any provider's secret regex. Test logic is unchanged.

PR head is now 2abef870 (was 06afd31f1). 371 telegram tests pass.

Copilot AI review requested due to automatic review settings May 7, 2026 22:38

Copilot started reviewing on behalf of iws17 May 7, 2026 22:39 View session

iws17 mentioned this pull request May 7, 2026

Telegram resolver failure caching + missing degraded-state detection after cold-boot resolver exhaustion #5729

Open

Copilot AI reviewed May 7, 2026

View reviewed changes

alt-glitch added type/bug Something isn't working comp/gateway Gateway runner, session dispatch, delivery platform/telegram Telegram bot adapter P1 High — major feature broken, no workaround labels May 7, 2026

iws17 force-pushed the fix/telegram-polling-pool-wedge-detection branch from 06afd31 to 2abef87 Compare May 7, 2026 23:53

JustinHuber mentioned this pull request May 24, 2026

Telegram adapter leaks httpx general-pool connections through HTTP proxy (CLOSED sockets accumulate, fd limit hit after ~2 days) #31599

Open

q3874758 mentioned this pull request May 31, 2026

[Bug] Telegram: Pool timeout causes silent send failures — no retry or fallback on exhausted general pool #35610

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(gateway): detect Telegram polling-pool wedges via correct pool#21548

fix(gateway): detect Telegram polling-pool wedges via correct pool#21548
iws17 wants to merge 3 commits into
NousResearch:mainfrom
iws17:fix/telegram-polling-pool-wedge-detection

iws17 commented May 7, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

iws17 May 7, 2026

Uh oh!

iws17 May 7, 2026

Uh oh!

iws17 commented May 7, 2026

Uh oh!

iws17 commented May 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

iws17 commented May 7, 2026

Summary

Changes

1. Replace the probe primitive

2. Schedule the probe after cold-boot start_polling

Supporting hardening

Adjacent in-flight work (distinct surfaces)

Test plan

Manual smoke test (recommended for reviewers)

Residual blind spot

Rollback

References

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

iws17 May 7, 2026

Choose a reason for hiding this comment

Uh oh!

iws17 May 7, 2026

Choose a reason for hiding this comment

Uh oh!

iws17 commented May 7, 2026

Uh oh!

iws17 commented May 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

2. Schedule the probe after cold-boot `start_polling`