Skip to content

fix(gateway): detect Telegram polling-pool wedges via correct pool#21548

Open
iws17 wants to merge 3 commits into
NousResearch:mainfrom
iws17:fix/telegram-polling-pool-wedge-detection
Open

fix(gateway): detect Telegram polling-pool wedges via correct pool#21548
iws17 wants to merge 3 commits into
NousResearch:mainfrom
iws17:fix/telegram-polling-pool-wedge-detection

Conversation

@iws17

@iws17 iws17 commented May 7, 2026

Copy link
Copy Markdown

Summary

Fixes #5729 — Telegram silent wedges where getMe, getWebhookInfo, and curl all
work but messages are not handled until manual hermes gateway restart.

The existing _verify_polling_after_reconnect heartbeat probe at
gateway/platforms/telegram.py:596 calls bot.get_me(). Per PTB v22
_bot.py:717:

request = self._request[0] if endpoint == "getUpdates" else self._request[1]

bot.get_me() routes through the general pool (_request[1]); the long-poll
runs on the polling pool (_request[0]). _drain_polling_connections only
resets _request[0]. After a botched drain+reconnect — or a cold-boot wedge
where the polling pool's first getUpdates hits a stale DNS/TCP race that doesn't
raise — _request[1] stays healthy and bot.get_me() returns instantly while
_request[0] is wedged. The probe falsely passes; the gateway sits silent.

This matches @rblakemesser's 2026-05-07 field report on #5729 exactly:

at diagnosis time: curl https://api.telegram.org/ worked. getMe worked.
getWebhookInfo showed no webhook and pending_update_count=0. launchd still
reported the gateway running.

Changes

Two surgical changes (Change 2 depends on Change 1). Shipping Change 2
without Change 1 is strictly worse than the status quo because it would
schedule a probe of the wrong pool.

1. Replace the probe primitive

Bypass Bot._do_post's endpoint dispatch and call BaseRequest.post directly
on _request[0] against getMe — same connection pool the long-poll uses,
non-destructive Telegram-side semantics:

pool = self._app.bot._request[0]
async with self._probe_drain_lock:
    await asyncio.wait_for(pool.post(f"{base}/getMe"), PROBE_TIMEOUT)

Using any flavor of getUpdates would be destructive per Telegram's API
contract — offset=-1 "forgets" the queue
and concurrent getUpdates calls on the same pool are forbidden.

PROBE_TIMEOUT raised from 10s to 65s to exceed the long-poll contention
window (the polling pool may be busy with an in-flight getUpdates long-poll).

2. Schedule the probe after cold-boot start_polling

Previously, _verify_polling_after_reconnect was only triggered from
_handle_polling_network_error. The cold-boot connect() path called
start_polling() and fell through to _mark_connected() with no
post-start probe.

A wedge that starts at cold boot produces no exception (the long-poll task is
running, just not making progress) → no error callback → no heartbeat
scheduled → silent failure. The cold-boot probe is the only signal for that
class.

Supporting hardening

  • Shared asyncio.Lock between the probe and _drain_polling_connections
    so the probe never hits a half-torn-down pool.
  • Startup shape-check on bot._request (PTB-internal API). Custom
    HTTPXRequest injection via ApplicationBuilder.request() /
    get_updates_request() can replace the tuple, and a PTB minor bump may
    rename or restructure it. On shape mismatch the probe logs a warning and
    skips — never crashes, never falsely declares wedge.
  • Two-strike escalation: two probe failures within 5 minutes → fatal
    retryable, supervisor restarts the process. The slow ladder couldn't fix
    what the first reconnect missed; a fresh interpreter clears any stale
    resolver / connection state.
  • trigger= parameter on the probe so logs distinguish cold_boot vs
    reconnect for operator triage.
  • Token redaction in logs: bot.base_url literally contains the bot
    token (Telegram's URL format is https://api.telegram.org/bot<TOKEN>/).
    Probe-related logs use the exception class name and trigger, never the
    full URL or repr(probe_err).
  • Probe task cancellation on disconnect() to avoid Task was destroyed but it is pending warnings during shutdown.

Adjacent in-flight work (distinct surfaces)

Test plan

pytest tests/gateway/test_telegram_polling_pool_wedge.py
pytest tests/gateway/test_telegram_network_reconnect.py
pytest tests/gateway/test_telegram*.py

Result: 368 telegram tests pass.

New tests (tests/gateway/test_telegram_polling_pool_wedge.py):

  • test_probe_routes_through_polling_pool_not_general_pool — wedged
    _request[0] with healthy _request[1] is detected; bot.get_me() is
    not called.
  • test_cold_boot_schedules_polling_pool_probe — probe scheduled by
    connect() after start_polling, not just by _handle_polling_network_error.
  • test_probe_skips_when_request_shape_mismatch — non-tuple bot._request
    → log warning + skip, no crash, no false-positive wedge.
  • test_probe_and_drain_mutually_exclude — concurrent probe blocks while
    drain holds _probe_drain_lock.
  • test_drain_acquires_shared_probe_drain_lock — symmetric.
  • test_two_probe_failures_within_window_escalate_to_fatal — two failures
    in 5min → _set_fatal_error("telegram_network_error", retryable=True).
  • test_disconnect_cancels_pending_probe_task — pending probe cancelled,
    not leaked.
  • test_probe_does_not_advance_long_poll_offset — probe leaves Updater's
    _last_update_id untouched (regression guard against any future
    getUpdates-flavored probe).
  • test_wedge_log_distinguishes_cold_boot_vs_reconnect_trigger — operator
    triage signal.

Updated tests (tests/gateway/test_telegram_network_reconnect.py):

  • test_heartbeat_probe_no_op_when_polling_healthy — asserts probe routes
    through _request[0], not bot.get_me().
  • test_heartbeat_probe_reenters_ladder_when_polling_pool_times_out
    renamed from _get_me_times_out and updated for new probe.
  • test_heartbeat_probe_reenters_ladder_on_polling_pool_network_error
    renamed from _get_me_network_error and updated.

Manual smoke test (recommended for reviewers)

  1. Start gateway with Telegram polling configured.
  2. Block DNS for api.telegram.org for 60s during cold boot:
    # macOS:    sudo pfctl -t blocklist -T add api.telegram.org
    # Linux:    sudo iptables -A OUTPUT -d api.telegram.org -j REJECT
    # restore after 60s
    
  3. Without this PR: polling logs "started", no message handling, manual
    restart required to recover.
  4. With this PR: post-connect probe at T+60s detects the wedge via
    _request[0] failure, re-enters the existing reconnect ladder.
  5. Send a Telegram message at T+90s — confirm it's handled.

Residual blind spot

Flood-wait masking. If Telegram is rate-limiting the bot, getMe via
_request[0] returns success (different rate-limit bucket from getUpdates).
The probe says healthy; the polling pool is throttled, not wedged. Same
false-positive shape as the previous bot.get_me() probe — just narrower.
A future PR could inspect rate-limit response headers; out of scope here.

Rollback

  • Both changes are additive. Revert is a single PR revert, no migration,
    no schema change.
  • The shape-check (bot._request validation) makes the probe self-disable
    on PTB version drift — built-in escape hatch without requiring a code
    change.

References

The existing _verify_polling_after_reconnect heartbeat probe called
bot.get_me(), which routes through PTB v22's general request pool
(_request[1]). The long-poll runs on the polling pool (_request[0]).
A wedged polling pool with a healthy general pool — the failure mode
in NousResearch#5729 — passes the existing probe falsely.

Two changes (Change 2 depends on Change 1):

1. Replace the probe primitive so it routes through _request[0] via
   a non-destructive getMe call, bypassing Bot._do_post's endpoint
   dispatch:

       pool = bot._request[0]
       await pool.post(f"{bot.base_url}/getMe")

   Using any flavor of getUpdates would be destructive per Telegram's
   API contract (offset=-1 forgets the queue; concurrent getUpdates
   is forbidden).

2. Schedule the same probe after cold-boot start_polling, not just
   after error-driven reconnects. The error callback can't see a
   silent cold-boot wedge — the long-poll task runs but never makes
   progress and never raises — so an explicit probe is the only signal.
   This is load-bearing: shipping without (1) would schedule a probe
   of the wrong pool.

Plus four small hardening pieces:
- Shared asyncio.Lock between the probe and _drain_polling_connections
  so the probe never hits a half-torn-down pool.
- Startup shape-check on bot._request with log-and-skip fallback for
  custom HTTPXRequest injection or PTB version drift.
- Two-strike escalation: two probe failures within 5 minutes escalate
  to fatal-retryable so the supervisor restarts the process from a
  clean interpreter (the slow ladder couldn't fix what the first
  reconnect missed).
- trigger= parameter on the probe so logs distinguish cold_boot vs
  reconnect for operator triage.

Probe-related logs redact bot.base_url (which contains the bot token
in Telegram's URL format).

The probe task is tracked on the adapter and cancelled on disconnect()
to avoid "Task was destroyed but it is pending" warnings.

Tests:
- 9 new tests in test_telegram_polling_pool_wedge.py covering pool
  routing, cold-boot scheduling, shape-check skip, probe/drain mutual
  exclusion, two-strike escalation, disconnect cancellation, long-poll
  offset preservation, and trigger-distinguishing logs.
- 2 existing tests in test_telegram_network_reconnect.py updated for
  the new probe shape.

Field reports: rblakemesser (2026-05-07) and alt-glitch (2026-04-30).

Fixes NousResearch#5729

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes a Telegram gateway failure mode where polling can silently wedge: getMe/curl succeed while the long-polling getUpdates pool is stuck, so the existing heartbeat falsely reports “healthy.” The change targets python-telegram-bot’s split request pools by probing the polling pool directly and adds a cold-boot probe to catch wedges that never trigger the reconnect error callback.

Changes:

  • Replace the heartbeat probe primitive to exercise bot._request[0] (polling pool) directly via BaseRequest.post(.../getMe) and add a two-strike escalation window.
  • Schedule the polling-pool probe after cold-boot start_polling() as well as after reconnect.
  • Add/adjust tests to pin correct pool routing, cold-boot scheduling, lock mutual exclusion with drain, escalation behavior, and shutdown cancellation.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

File Description
gateway/platforms/telegram.py Routes wedge probe through the polling pool, adds strike escalation + shared probe/drain lock, schedules probe on cold boot, and cancels probe on disconnect.
tests/gateway/test_telegram_polling_pool_wedge.py New focused regression suite covering polling-pool wedge detection, scheduling, locking, escalation, and log safety.
tests/gateway/test_telegram_network_reconnect.py Updates heartbeat-probe tests to assert polling-pool routing (_request[0]) instead of bot.get_me().

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +598 to +604
if self.has_fatal_error:
return
probe = asyncio.ensure_future(self._verify_polling_after_reconnect(trigger=trigger))
self._background_tasks.add(probe)
probe.add_done_callback(self._background_tasks.discard)
self._latest_probe_task = probe

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 535976c.

_schedule_polling_pool_probe now cancels any previously pending probe before scheduling a new one — at-most-one-probe-in-flight invariant. _verify_polling_after_reconnect also bails on self._app is None after the heartbeat sleep as defense-in-depth, so a probe whose cancellation races with disconnect can't recurse into _handle_polling_network_error against a torn-down adapter.

Tests: test_schedule_cancels_previously_pending_probe, test_probe_bails_when_app_torn_down_after_sleep.

Comment thread gateway/platforms/telegram.py Outdated
# docstring above for the dispatch rationale.
# Serialize against `_drain_polling_connections` so the probe
# never hits a half-torn-down pool.
base = self._app.bot.base_url

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 535976c. Probe URL is now built with str(self._app.bot.base_url).rstrip("/") before appending /getMe. PTB v22's default _parse_base_url returns no trailing slash, but callable / format-map / local_mode=True configs can produce one — rstrip covers all cases.

Test: test_probe_url_normalizes_trailing_slash_in_base_url (passes a trailing-slash base_url and asserts no //getMe in the call URL).

@alt-glitch alt-glitch added type/bug Something isn't working comp/gateway Gateway runner, session dispatch, delivery platform/telegram Telegram bot adapter P1 High — major feature broken, no workaround labels May 7, 2026
Two fixes from @Copilot's review on PR NousResearch#21548:

1. At-most-one-probe invariant. `_schedule_polling_pool_probe` is called
   from cold-boot AND reconnect paths. Without cancelling a prior pending
   probe, multiple 60s-sleeping probes can accumulate. `disconnect()`
   only cancels `_latest_probe_task`, so older probes can wake on a
   torn-down adapter and recurse into `_handle_polling_network_error`
   against `_app=None`.

   Fix: cancel any previous pending probe before scheduling a new one,
   and bail in `_verify_polling_after_reconnect` if `self._app is None`
   after the heartbeat sleep (defense-in-depth).

2. URL normalization. PTB's default `base_url` has no trailing slash, but
   callable / format-map / `local_mode=True` configs can produce one,
   yielding `f"{base}/getMe"` → `.../bot<TOKEN>//getMe`. Most servers
   normalize, but it's avoidable.

   Fix: `str(self._app.bot.base_url).rstrip("/")` before appending the
   endpoint path.

Tests added:
- test_schedule_cancels_previously_pending_probe
- test_probe_bails_when_app_torn_down_after_sleep
- test_probe_url_normalizes_trailing_slash_in_base_url

371 telegram tests pass.
@iws17

iws17 commented May 7, 2026

Copy link
Copy Markdown
Author

Self-review pass on the test suite — pushed 06afd31f1 with two test-validity strengthenings:

  1. Token redaction assertion in test_probe_skips_when_request_shape_mismatch was using a weaker boolean form that could miss real token leaks; now uses a realistic token sentinel and asserts the literal token never appears in any captured log line.
  2. Mutual-exclusion test in test_probe_and_drain_mutually_exclude was patching asyncio.sleep such that scheduling yields didn't actually advance the event loop — confirmed by manually removing the probe's lock and observing the test now correctly fails. Test loop now uses real asyncio.sleep while only the probe's heartbeat sleep is patched.

Production code unchanged. 371 telegram tests pass.

Two test-validity bugs caught in a separate review pass:

1. Token redaction assertion was too weak. The check was:
       not ("<REDACTED>" in msg AND "base_url" in msg)
   A real token-bearing URL like `.../botREAL_TOKEN/getMe` would pass
   without the literal word `base_url`. Replaced with a realistic-looking
   token sentinel and a strong assertion that the token literal never
   appears in any captured log message.

2. Mutual-exclusion test patched `asyncio.sleep` with `AsyncMock`, then
   used `await asyncio.sleep(0)` as scheduling yields. With sleep patched,
   the yields don't advance the event loop — the probe task may never
   reach the `async with self._probe_drain_lock` line, and
   `pool0_post.await_count == 0` could pass for the wrong reason
   (test never gave the probe CPU time at all).

   Fix: capture real `asyncio.sleep` before patching, and only patch
   `gateway.platforms.telegram.asyncio.sleep` (so the test loop's
   yields use the real one). Verified by manually removing the
   probe's `async with self._probe_drain_lock:` and confirming the
   test now correctly fails (`pool0_post.await_count == 1`); restored.

Both findings are test-quality issues — production code is unchanged.
371 telegram tests pass.
@iws17 iws17 force-pushed the fix/telegram-polling-pool-wedge-detection branch from 06afd31 to 2abef87 Compare May 7, 2026 23:53
@iws17

iws17 commented May 7, 2026

Copy link
Copy Markdown
Author

Force-pushed to overwrite the previous head — the latest commit replaces a token-shaped string in a test sentinel that GitHub's secret scanner was flagging. The new sentinel ('TEST-FAKE-TOKEN-FOR-CI-DO-NOT-SCAN') is intentionally the wrong shape to match any provider's secret regex. Test logic is unchanged.

PR head is now 2abef870 (was 06afd31f1). 371 telegram tests pass.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/gateway Gateway runner, session dispatch, delivery P1 High — major feature broken, no workaround platform/telegram Telegram bot adapter type/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Telegram resolver failure caching + missing degraded-state detection after cold-boot resolver exhaustion

3 participants