[Bug] Telegram polling silently dies for 30+ min with no error and no auto-recovery; pollingStallThresholdMs watchdog suppressed by unrelated API activity
Summary
The Telegram channel's getUpdates long-poll silently stopped delivering inbound updates for 39 minutes on a healthy gateway. No errors, no warnings, no recovery — the polling watchdog (pollingStallThresholdMs, default 120000ms) never fired. A manual openclaw gateway restart recovered it instantly.
The behavior is reproducible across two unrelated hosts (macOS arm64 + Windows 11 x64) running the same OpenClaw build, and reading dist/monitor-polling.runtime-DjS2STzm.js shows the watchdog's stall-detect logic uses an || that lets unrelated API activity suppress stall detection even when getUpdates is dead.
Environment
- OpenClaw:
2026.5.4 (325df3e) (also seen on 2026.5.3-1 before upgrade)
- OS A: Windows 11 x64, Node
v22.22.2, gateway started via Task Scheduler gateway.cmd
- OS B: macOS Darwin 25.4.0 arm64, Node
v25.9.0
- Mode: polling (no webhook configured)
- Network: US home broadband, no proxy,
api.telegram.org directly reachable
- Account config (excerpt):
Incident timeline (Windows host, all times EDT)
| Time |
Evidence |
01:02:42 |
Last Telegram-channel embedded run start event recorded by gateway. |
01:02:42 → 01:45 |
39 min of silence. Gateway healthy: probe ok, port listening, crons firing on schedule. Heartbeats normal. No errors, no crashes, no warnings. Watchdog never logged a stall. Webhook heartbeats showed webhooks=0/0/0. compliance.jsonl has zero inbound from sender during this window. |
01:41 |
User sends a Telegram DM from a paired account. Message never reaches the gateway. Telegram client shows it as delivered to the bot. |
01:45 |
openclaw gateway restart issued. |
01:45 → now |
Polling resumes immediately. Inbound updates start flowing again, including the previously-missed 01:41 message via offset replay. |
The same pattern was observed on the macOS host on a separate occasion the same week.
Expected vs actual
Expected: With pollingStallThresholdMs=120000 (2 min default), if getUpdates stops completing for >2 min, the polling watchdog restarts the polling runner and recovers.
Actual: Polling stayed wedged for 39 minutes (~19× the threshold). No Polling stall detected log. No transport rebuild. No restart cycle.
Root cause hypothesis: || in detectStall allows unrelated API activity to suppress watchdog
In extensions/telegram/src/polling-liveness.ts (bundled at dist/monitor-polling.runtime-DjS2STzm.js lines 78–87):
detectStall(params) {
const now = params.now ?? this.#now();
const activeElapsed = this.#inFlightGetUpdates > 0 && this.#lastGetUpdatesStartedAt != null
? now - this.#lastGetUpdatesStartedAt : 0;
const idleElapsed = this.#inFlightGetUpdates > 0
? 0
: now - (this.#lastGetUpdatesFinishedAt ?? this.#lastGetUpdatesAt);
const elapsed = this.#inFlightGetUpdates > 0 ? activeElapsed : idleElapsed;
const apiElapsed = now - (this.#latestInFlightApiStartedAt == null
? this.#lastApiActivityAt
: Math.max(this.#lastApiActivityAt, this.#latestInFlightApiStartedAt));
if (elapsed <= params.thresholdMs || apiElapsed <= params.thresholdMs) return null;
...
}
Two issues compound here:
1. || instead of &&
The stall is suppressed if either getUpdates-elapsed or any-API-elapsed is below threshold. So any non-getUpdates API call (e.g. an outbound sendMessage from a cron, or a getMe health check) within the last 2 min keeps apiElapsed low and silences the watchdog even when getUpdates is fully dead.
This contradicts the watchdog's documented purpose: detect getUpdates liveness death and restart polling. Outbound-sender health does not imply inbound is working — they're independent code paths from Telegram's perspective.
2. noteGetUpdatesError and noteGetUpdatesSuccess both bump lastApiActivityAt
noteGetUpdatesSuccess(result, at) { ...; this.#lastApiActivityAt = at; ... }
noteGetUpdatesError(err, at) { ...; this.#lastApiActivityAt = at; ... }
If getUpdates is in a tight retry-error loop (server returning 502s, transient socket close), every error bumps lastApiActivityAt, again silencing the watchdog. The watchdog should treat successful API activity as the only liveness signal, not error events.
Why "no logs, no errors" in the incident
If getUpdates was either:
- (a) hung in-flight on a TCP zombie socket —
inFlightGetUpdates>0 but the request never returns. lastApiActivityAt is whatever the last completed activity was. After 2 min, apiElapsed > thresholdMs → stall should fire — unless any other internal bot.api.* call (including grammY's own internal getMe / setMyCommands / etc.) was made within the window, suppressing the check; or
- (b) returning empty arrays via grammY's own internal long-poll loop after the runner's internal task got into a bad state where the per-cycle
bot.api.config.use middleware no longer sees the calls. Then lastApiActivityAt stays at whatever it was when the bot was last "alive," and again any other API touch hides the stall.
Both are consistent with the observed evidence (no error logs, no recovery, immediate fix on restart).
Suggested fix
-
Change || to && on the stall-detect gate, so the watchdog fires whenever getUpdates itself is stale, regardless of unrelated API activity:
if (elapsed <= params.thresholdMs && apiElapsed <= params.thresholdMs) return null;
Or — preferably — drop apiElapsed from the gate entirely. Inbound-update liveness is the only thing the watchdog is supposed to protect. Outbound API health is orthogonal.
-
Stop bumping lastApiActivityAt from noteGetUpdatesError. Errors are not liveness; they are signals the runner is failing. Otherwise a tight error-retry loop perpetually suppresses the watchdog.
-
Add a 30-minute "last-resort" tier independent of the per-cycle watchdog: if lastTransportActivityAt is older than TELEGRAM_POLLING_STALE_TRANSPORT_MS (30 min, the same threshold collectTelegramPollingRuntimeIssues already uses for status-issue surfacing), force a session-level restart via setMyCommands-style health probe + runner.stop(). The status-issue threshold and the watchdog disagree by 15× — nothing acts on the 30-min staleness today; it only surfaces in openclaw channels status.
-
Emit a structured warning the first time the watchdog detects a stall, even if it ends up being a false alarm. Right now stall events are logged via opts.log to whatever stdout/stderr the gateway has — under Task Scheduler on Windows that's discarded by default, which contributed to debugging difficulty here.
Repro outline
I don't have a deterministic in-vitro repro for the underlying network stall yet (it's a transient zombie-socket condition that surfaced naturally over multiple weeks). I can repro the watchdog suppression trivially by:
- Start gateway with Telegram polling enabled.
- While polling is running, drop the upstream TCP connection mid-
getUpdates (e.g. block port 443 to api.telegram.org, or use pfctl/netsh to drop SYN-ACK).
- Concurrently fire any non-
getUpdates API call every 60s — e.g. a cron that does bot.api.getMe() or sendMessage to a chat the network reaches. (You can simulate this with a separate process holding the same token, but in practice gateway-internal traffic alone will trigger it in many deployments.)
- Observe: the inbound poll is dead but
pollingStallThresholdMs never fires, no transport rebuild, no [telegram] Polling stall detected (...) log.
A cleaner unit-test repro: in polling-liveness.test.ts, set inFlightGetUpdates=1, lastGetUpdatesStartedAt=now-180000, but call noteApiCallStarted() at now-30000. detectStall({ thresholdMs: 120000 }) returns null → bug. Should return a stall.
Cross-references / not duplicates
What I'm asking for
- Confirm or refute the
|| analysis in polling-liveness.ts
- Decide whether the watchdog should care about
apiElapsed at all; my read is no
- Add a
[telegram][diag] log line on stall detection even if no restart fires (currently [telegram][diag] polling cycle finished/error reason=... only logs on cycle exit)
Happy to test patches against the affected hosts.
[Bug] Telegram polling silently dies for 30+ min with no error and no auto-recovery;
pollingStallThresholdMswatchdog suppressed by unrelated API activitySummary
The Telegram channel's
getUpdateslong-poll silently stopped delivering inbound updates for 39 minutes on a healthy gateway. No errors, no warnings, no recovery — the polling watchdog (pollingStallThresholdMs, default 120000ms) never fired. A manualopenclaw gateway restartrecovered it instantly.The behavior is reproducible across two unrelated hosts (macOS arm64 + Windows 11 x64) running the same OpenClaw build, and reading
dist/monitor-polling.runtime-DjS2STzm.jsshows the watchdog's stall-detect logic uses an||that lets unrelated API activity suppress stall detection even whengetUpdatesis dead.Environment
2026.5.4 (325df3e)(also seen on2026.5.3-1before upgrade)v22.22.2, gateway started via Task Schedulergateway.cmdv25.9.0api.telegram.orgdirectly reachable{ "enabled": true, "dmPolicy": "pairing", "groupPolicy": "allowlist", "streaming": { "mode": "off", "preview": { "toolProgress": false } }, "timeoutSeconds": 180, "retry": { "attempts": 5, "minDelayMs": 500, "maxDelayMs": 30000, "jitter": 0.2 } // pollingStallThresholdMs not overridden — using default 120000 }Incident timeline (Windows host, all times EDT)
01:02:4201:02:42 → 01:45webhooks=0/0/0.compliance.jsonlhas zero inbound from sender during this window.01:4101:45openclaw gateway restartissued.01:45 → now01:41message via offset replay.The same pattern was observed on the macOS host on a separate occasion the same week.
Expected vs actual
Expected: With
pollingStallThresholdMs=120000(2 min default), ifgetUpdatesstops completing for >2 min, the polling watchdog restarts the polling runner and recovers.Actual: Polling stayed wedged for 39 minutes (~19× the threshold). No
Polling stall detectedlog. No transport rebuild. No restart cycle.Root cause hypothesis:
||indetectStallallows unrelated API activity to suppress watchdogIn
extensions/telegram/src/polling-liveness.ts(bundled atdist/monitor-polling.runtime-DjS2STzm.jslines 78–87):Two issues compound here:
1.
||instead of&&The stall is suppressed if either
getUpdates-elapsed or any-API-elapsed is below threshold. So any non-getUpdatesAPI call (e.g. an outboundsendMessagefrom a cron, or agetMehealth check) within the last 2 min keepsapiElapsedlow and silences the watchdog even whengetUpdatesis fully dead.This contradicts the watchdog's documented purpose: detect getUpdates liveness death and restart polling. Outbound-sender health does not imply inbound is working — they're independent code paths from Telegram's perspective.
2.
noteGetUpdatesErrorandnoteGetUpdatesSuccessboth bumplastApiActivityAtIf
getUpdatesis in a tight retry-error loop (server returning 502s, transient socket close), every error bumpslastApiActivityAt, again silencing the watchdog. The watchdog should treat successful API activity as the only liveness signal, not error events.Why "no logs, no errors" in the incident
If
getUpdateswas either:inFlightGetUpdates>0but the request never returns.lastApiActivityAtis whatever the last completed activity was. After 2 min,apiElapsed > thresholdMs→ stall should fire — unless any other internalbot.api.*call (including grammY's own internal getMe / setMyCommands / etc.) was made within the window, suppressing the check; orbot.api.config.usemiddleware no longer sees the calls. ThenlastApiActivityAtstays at whatever it was when the bot was last "alive," and again any other API touch hides the stall.Both are consistent with the observed evidence (no error logs, no recovery, immediate fix on restart).
Suggested fix
Change
||to&&on the stall-detect gate, so the watchdog fires whenever getUpdates itself is stale, regardless of unrelated API activity:Or — preferably — drop
apiElapsedfrom the gate entirely. Inbound-update liveness is the only thing the watchdog is supposed to protect. Outbound API health is orthogonal.Stop bumping
lastApiActivityAtfromnoteGetUpdatesError. Errors are not liveness; they are signals the runner is failing. Otherwise a tight error-retry loop perpetually suppresses the watchdog.Add a 30-minute "last-resort" tier independent of the per-cycle watchdog: if
lastTransportActivityAtis older thanTELEGRAM_POLLING_STALE_TRANSPORT_MS(30 min, the same thresholdcollectTelegramPollingRuntimeIssuesalready uses for status-issue surfacing), force a session-level restart viasetMyCommands-style health probe +runner.stop(). The status-issue threshold and the watchdog disagree by 15× — nothing acts on the 30-min staleness today; it only surfaces inopenclaw channels status.Emit a structured warning the first time the watchdog detects a stall, even if it ends up being a false alarm. Right now stall events are logged via
opts.logto whatever stdout/stderr the gateway has — under Task Scheduler on Windows that's discarded by default, which contributed to debugging difficulty here.Repro outline
I don't have a deterministic in-vitro repro for the underlying network stall yet (it's a transient zombie-socket condition that surfaced naturally over multiple weeks). I can repro the watchdog suppression trivially by:
getUpdates(e.g. block port 443 to api.telegram.org, or usepfctl/netshto drop SYN-ACK).getUpdatesAPI call every 60s — e.g. a cron that doesbot.api.getMe()orsendMessageto a chat the network reaches. (You can simulate this with a separate process holding the same token, but in practice gateway-internal traffic alone will trigger it in many deployments.)pollingStallThresholdMsnever fires, no transport rebuild, no[telegram] Polling stall detected (...)log.A cleaner unit-test repro: in
polling-liveness.test.ts, setinFlightGetUpdates=1,lastGetUpdatesStartedAt=now-180000, but callnoteApiCallStarted()atnow-30000.detectStall({ thresholdMs: 120000 })returnsnull→ bug. Should return a stall.Cross-references / not duplicates
modelOverrideSource:"auto"persistence we hit; we're filing the modelOverride half separatelyWhat I'm asking for
||analysis inpolling-liveness.tsapiElapsedat all; my read is no[telegram][diag]log line on stall detection even if no restart fires (currently[telegram][diag] polling cycle finished/error reason=...only logs on cycle exit)Happy to test patches against the affected hosts.