fix(gateway): detect Telegram polling-pool wedges via correct pool#21548
fix(gateway): detect Telegram polling-pool wedges via correct pool#21548iws17 wants to merge 3 commits into
Conversation
The existing _verify_polling_after_reconnect heartbeat probe called bot.get_me(), which routes through PTB v22's general request pool (_request[1]). The long-poll runs on the polling pool (_request[0]). A wedged polling pool with a healthy general pool — the failure mode in NousResearch#5729 — passes the existing probe falsely. Two changes (Change 2 depends on Change 1): 1. Replace the probe primitive so it routes through _request[0] via a non-destructive getMe call, bypassing Bot._do_post's endpoint dispatch: pool = bot._request[0] await pool.post(f"{bot.base_url}/getMe") Using any flavor of getUpdates would be destructive per Telegram's API contract (offset=-1 forgets the queue; concurrent getUpdates is forbidden). 2. Schedule the same probe after cold-boot start_polling, not just after error-driven reconnects. The error callback can't see a silent cold-boot wedge — the long-poll task runs but never makes progress and never raises — so an explicit probe is the only signal. This is load-bearing: shipping without (1) would schedule a probe of the wrong pool. Plus four small hardening pieces: - Shared asyncio.Lock between the probe and _drain_polling_connections so the probe never hits a half-torn-down pool. - Startup shape-check on bot._request with log-and-skip fallback for custom HTTPXRequest injection or PTB version drift. - Two-strike escalation: two probe failures within 5 minutes escalate to fatal-retryable so the supervisor restarts the process from a clean interpreter (the slow ladder couldn't fix what the first reconnect missed). - trigger= parameter on the probe so logs distinguish cold_boot vs reconnect for operator triage. Probe-related logs redact bot.base_url (which contains the bot token in Telegram's URL format). The probe task is tracked on the adapter and cancelled on disconnect() to avoid "Task was destroyed but it is pending" warnings. Tests: - 9 new tests in test_telegram_polling_pool_wedge.py covering pool routing, cold-boot scheduling, shape-check skip, probe/drain mutual exclusion, two-strike escalation, disconnect cancellation, long-poll offset preservation, and trigger-distinguishing logs. - 2 existing tests in test_telegram_network_reconnect.py updated for the new probe shape. Field reports: rblakemesser (2026-05-07) and alt-glitch (2026-04-30). Fixes NousResearch#5729
There was a problem hiding this comment.
Pull request overview
This PR fixes a Telegram gateway failure mode where polling can silently wedge: getMe/curl succeed while the long-polling getUpdates pool is stuck, so the existing heartbeat falsely reports “healthy.” The change targets python-telegram-bot’s split request pools by probing the polling pool directly and adds a cold-boot probe to catch wedges that never trigger the reconnect error callback.
Changes:
- Replace the heartbeat probe primitive to exercise
bot._request[0](polling pool) directly viaBaseRequest.post(.../getMe)and add a two-strike escalation window. - Schedule the polling-pool probe after cold-boot
start_polling()as well as after reconnect. - Add/adjust tests to pin correct pool routing, cold-boot scheduling, lock mutual exclusion with drain, escalation behavior, and shutdown cancellation.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
gateway/platforms/telegram.py |
Routes wedge probe through the polling pool, adds strike escalation + shared probe/drain lock, schedules probe on cold boot, and cancels probe on disconnect. |
tests/gateway/test_telegram_polling_pool_wedge.py |
New focused regression suite covering polling-pool wedge detection, scheduling, locking, escalation, and log safety. |
tests/gateway/test_telegram_network_reconnect.py |
Updates heartbeat-probe tests to assert polling-pool routing (_request[0]) instead of bot.get_me(). |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| if self.has_fatal_error: | ||
| return | ||
| probe = asyncio.ensure_future(self._verify_polling_after_reconnect(trigger=trigger)) | ||
| self._background_tasks.add(probe) | ||
| probe.add_done_callback(self._background_tasks.discard) | ||
| self._latest_probe_task = probe | ||
|
|
There was a problem hiding this comment.
Fixed in 535976c.
_schedule_polling_pool_probe now cancels any previously pending probe before scheduling a new one — at-most-one-probe-in-flight invariant. _verify_polling_after_reconnect also bails on self._app is None after the heartbeat sleep as defense-in-depth, so a probe whose cancellation races with disconnect can't recurse into _handle_polling_network_error against a torn-down adapter.
Tests: test_schedule_cancels_previously_pending_probe, test_probe_bails_when_app_torn_down_after_sleep.
| # docstring above for the dispatch rationale. | ||
| # Serialize against `_drain_polling_connections` so the probe | ||
| # never hits a half-torn-down pool. | ||
| base = self._app.bot.base_url |
There was a problem hiding this comment.
Fixed in 535976c. Probe URL is now built with str(self._app.bot.base_url).rstrip("/") before appending /getMe. PTB v22's default _parse_base_url returns no trailing slash, but callable / format-map / local_mode=True configs can produce one — rstrip covers all cases.
Test: test_probe_url_normalizes_trailing_slash_in_base_url (passes a trailing-slash base_url and asserts no //getMe in the call URL).
Two fixes from @Copilot's review on PR NousResearch#21548: 1. At-most-one-probe invariant. `_schedule_polling_pool_probe` is called from cold-boot AND reconnect paths. Without cancelling a prior pending probe, multiple 60s-sleeping probes can accumulate. `disconnect()` only cancels `_latest_probe_task`, so older probes can wake on a torn-down adapter and recurse into `_handle_polling_network_error` against `_app=None`. Fix: cancel any previous pending probe before scheduling a new one, and bail in `_verify_polling_after_reconnect` if `self._app is None` after the heartbeat sleep (defense-in-depth). 2. URL normalization. PTB's default `base_url` has no trailing slash, but callable / format-map / `local_mode=True` configs can produce one, yielding `f"{base}/getMe"` → `.../bot<TOKEN>//getMe`. Most servers normalize, but it's avoidable. Fix: `str(self._app.bot.base_url).rstrip("/")` before appending the endpoint path. Tests added: - test_schedule_cancels_previously_pending_probe - test_probe_bails_when_app_torn_down_after_sleep - test_probe_url_normalizes_trailing_slash_in_base_url 371 telegram tests pass.
|
Self-review pass on the test suite — pushed 06afd31f1 with two test-validity strengthenings:
Production code unchanged. 371 telegram tests pass. |
Two test-validity bugs caught in a separate review pass:
1. Token redaction assertion was too weak. The check was:
not ("<REDACTED>" in msg AND "base_url" in msg)
A real token-bearing URL like `.../botREAL_TOKEN/getMe` would pass
without the literal word `base_url`. Replaced with a realistic-looking
token sentinel and a strong assertion that the token literal never
appears in any captured log message.
2. Mutual-exclusion test patched `asyncio.sleep` with `AsyncMock`, then
used `await asyncio.sleep(0)` as scheduling yields. With sleep patched,
the yields don't advance the event loop — the probe task may never
reach the `async with self._probe_drain_lock` line, and
`pool0_post.await_count == 0` could pass for the wrong reason
(test never gave the probe CPU time at all).
Fix: capture real `asyncio.sleep` before patching, and only patch
`gateway.platforms.telegram.asyncio.sleep` (so the test loop's
yields use the real one). Verified by manually removing the
probe's `async with self._probe_drain_lock:` and confirming the
test now correctly fails (`pool0_post.await_count == 1`); restored.
Both findings are test-quality issues — production code is unchanged.
371 telegram tests pass.
06afd31 to
2abef87
Compare
|
Force-pushed to overwrite the previous head — the latest commit replaces a token-shaped string in a test sentinel that GitHub's secret scanner was flagging. The new sentinel ('TEST-FAKE-TOKEN-FOR-CI-DO-NOT-SCAN') is intentionally the wrong shape to match any provider's secret regex. Test logic is unchanged. PR head is now |
Summary
Fixes #5729 — Telegram silent wedges where
getMe,getWebhookInfo, andcurlallwork but messages are not handled until manual
hermes gateway restart.The existing
_verify_polling_after_reconnectheartbeat probe atgateway/platforms/telegram.py:596callsbot.get_me(). Per PTB v22_bot.py:717:bot.get_me()routes through the general pool (_request[1]); the long-pollruns on the polling pool (
_request[0])._drain_polling_connectionsonlyresets
_request[0]. After a botched drain+reconnect — or a cold-boot wedgewhere the polling pool's first
getUpdateshits a stale DNS/TCP race that doesn'traise —
_request[1]stays healthy andbot.get_me()returns instantly while_request[0]is wedged. The probe falsely passes; the gateway sits silent.This matches @rblakemesser's 2026-05-07 field report on #5729 exactly:
Changes
Two surgical changes (Change 2 depends on Change 1). Shipping Change 2
without Change 1 is strictly worse than the status quo because it would
schedule a probe of the wrong pool.
1. Replace the probe primitive
Bypass
Bot._do_post's endpoint dispatch and callBaseRequest.postdirectlyon
_request[0]againstgetMe— same connection pool the long-poll uses,non-destructive Telegram-side semantics:
Using any flavor of
getUpdateswould be destructive per Telegram's APIcontract —
offset=-1"forgets" the queueand concurrent
getUpdatescalls on the same pool are forbidden.PROBE_TIMEOUTraised from 10s to 65s to exceed the long-poll contentionwindow (the polling pool may be busy with an in-flight
getUpdateslong-poll).2. Schedule the probe after cold-boot
start_pollingPreviously,
_verify_polling_after_reconnectwas only triggered from_handle_polling_network_error. The cold-bootconnect()path calledstart_polling()and fell through to_mark_connected()with nopost-start probe.
A wedge that starts at cold boot produces no exception (the long-poll task is
running, just not making progress) → no error callback → no heartbeat
scheduled → silent failure. The cold-boot probe is the only signal for that
class.
Supporting hardening
asyncio.Lockbetween the probe and_drain_polling_connectionsso the probe never hits a half-torn-down pool.
bot._request(PTB-internal API). CustomHTTPXRequestinjection viaApplicationBuilder.request()/get_updates_request()can replace the tuple, and a PTB minor bump mayrename or restructure it. On shape mismatch the probe logs a warning and
skips — never crashes, never falsely declares wedge.
retryable, supervisor restarts the process. The slow ladder couldn't fix
what the first reconnect missed; a fresh interpreter clears any stale
resolver / connection state.
trigger=parameter on the probe so logs distinguishcold_bootvsreconnectfor operator triage.bot.base_urlliterally contains the bottoken (Telegram's URL format is
https://api.telegram.org/bot<TOKEN>/).Probe-related logs use the exception class name and trigger, never the
full URL or
repr(probe_err).disconnect()to avoidTask was destroyed but it is pendingwarnings during shutdown.Adjacent in-flight work (distinct surfaces)
Mode-1 fix; this PR is the mode-2 fix on top.
gateway_health.jsonsubstrate. Composes well — a follow-upcan add
last_polling_pool_probe_ok_atonce fix: gateway resilience — token rejection handling, state self-heal, health endpoint #7023 lands.(this PR addresses the post-reconnect / cold-boot wedge that does not
exhaust startup retries).
(rblakemesser's incident did not hit
Giving up after 20 attempts).Test plan
Result: 368 telegram tests pass.
New tests (
tests/gateway/test_telegram_polling_pool_wedge.py):test_probe_routes_through_polling_pool_not_general_pool— wedged_request[0]with healthy_request[1]is detected;bot.get_me()isnot called.
test_cold_boot_schedules_polling_pool_probe— probe scheduled byconnect()afterstart_polling, not just by_handle_polling_network_error.test_probe_skips_when_request_shape_mismatch— non-tuplebot._request→ log warning + skip, no crash, no false-positive wedge.
test_probe_and_drain_mutually_exclude— concurrent probe blocks whiledrain holds
_probe_drain_lock.test_drain_acquires_shared_probe_drain_lock— symmetric.test_two_probe_failures_within_window_escalate_to_fatal— two failuresin 5min →
_set_fatal_error("telegram_network_error", retryable=True).test_disconnect_cancels_pending_probe_task— pending probe cancelled,not leaked.
test_probe_does_not_advance_long_poll_offset— probe leavesUpdater's_last_update_iduntouched (regression guard against any futuregetUpdates-flavored probe).test_wedge_log_distinguishes_cold_boot_vs_reconnect_trigger— operatortriage signal.
Updated tests (
tests/gateway/test_telegram_network_reconnect.py):test_heartbeat_probe_no_op_when_polling_healthy— asserts probe routesthrough
_request[0], notbot.get_me().test_heartbeat_probe_reenters_ladder_when_polling_pool_times_out—renamed from
_get_me_times_outand updated for new probe.test_heartbeat_probe_reenters_ladder_on_polling_pool_network_error—renamed from
_get_me_network_errorand updated.Manual smoke test (recommended for reviewers)
api.telegram.orgfor 60s during cold boot:restart required to recover.
_request[0]failure, re-enters the existing reconnect ladder.Residual blind spot
Flood-wait masking. If Telegram is rate-limiting the bot,
getMevia_request[0]returns success (different rate-limit bucket fromgetUpdates).The probe says healthy; the polling pool is throttled, not wedged. Same
false-positive shape as the previous
bot.get_me()probe — just narrower.A future PR could inspect rate-limit response headers; out of scope here.
Rollback
no schema change.
bot._requestvalidation) makes the probe self-disableon PTB version drift — built-in escape hatch without requiring a code
change.
References
_bot.py:717getUpdates: https://core.telegram.org/bots/api#getupdates