Skip to content

[Bug] Telegram polling silently dies for 30+ min with no error and no auto-recovery; pollingStallThresholdMs watchdog suppressed by unrelated API activity #78422

@joeywrightphoto

Description

@joeywrightphoto

[Bug] Telegram polling silently dies for 30+ min with no error and no auto-recovery; pollingStallThresholdMs watchdog suppressed by unrelated API activity

Summary

The Telegram channel's getUpdates long-poll silently stopped delivering inbound updates for 39 minutes on a healthy gateway. No errors, no warnings, no recovery — the polling watchdog (pollingStallThresholdMs, default 120000ms) never fired. A manual openclaw gateway restart recovered it instantly.

The behavior is reproducible across two unrelated hosts (macOS arm64 + Windows 11 x64) running the same OpenClaw build, and reading dist/monitor-polling.runtime-DjS2STzm.js shows the watchdog's stall-detect logic uses an || that lets unrelated API activity suppress stall detection even when getUpdates is dead.

Environment

  • OpenClaw: 2026.5.4 (325df3e) (also seen on 2026.5.3-1 before upgrade)
  • OS A: Windows 11 x64, Node v22.22.2, gateway started via Task Scheduler gateway.cmd
  • OS B: macOS Darwin 25.4.0 arm64, Node v25.9.0
  • Mode: polling (no webhook configured)
  • Network: US home broadband, no proxy, api.telegram.org directly reachable
  • Account config (excerpt):
    {
      "enabled": true,
      "dmPolicy": "pairing",
      "groupPolicy": "allowlist",
      "streaming": { "mode": "off", "preview": { "toolProgress": false } },
      "timeoutSeconds": 180,
      "retry": { "attempts": 5, "minDelayMs": 500, "maxDelayMs": 30000, "jitter": 0.2 }
      // pollingStallThresholdMs not overridden — using default 120000
    }

Incident timeline (Windows host, all times EDT)

Time Evidence
01:02:42 Last Telegram-channel embedded run start event recorded by gateway.
01:02:42 → 01:45 39 min of silence. Gateway healthy: probe ok, port listening, crons firing on schedule. Heartbeats normal. No errors, no crashes, no warnings. Watchdog never logged a stall. Webhook heartbeats showed webhooks=0/0/0. compliance.jsonl has zero inbound from sender during this window.
01:41 User sends a Telegram DM from a paired account. Message never reaches the gateway. Telegram client shows it as delivered to the bot.
01:45 openclaw gateway restart issued.
01:45 → now Polling resumes immediately. Inbound updates start flowing again, including the previously-missed 01:41 message via offset replay.

The same pattern was observed on the macOS host on a separate occasion the same week.

Expected vs actual

Expected: With pollingStallThresholdMs=120000 (2 min default), if getUpdates stops completing for >2 min, the polling watchdog restarts the polling runner and recovers.

Actual: Polling stayed wedged for 39 minutes (~19× the threshold). No Polling stall detected log. No transport rebuild. No restart cycle.

Root cause hypothesis: || in detectStall allows unrelated API activity to suppress watchdog

In extensions/telegram/src/polling-liveness.ts (bundled at dist/monitor-polling.runtime-DjS2STzm.js lines 78–87):

detectStall(params) {
    const now = params.now ?? this.#now();
    const activeElapsed = this.#inFlightGetUpdates > 0 && this.#lastGetUpdatesStartedAt != null
        ? now - this.#lastGetUpdatesStartedAt : 0;
    const idleElapsed = this.#inFlightGetUpdates > 0
        ? 0
        : now - (this.#lastGetUpdatesFinishedAt ?? this.#lastGetUpdatesAt);
    const elapsed = this.#inFlightGetUpdates > 0 ? activeElapsed : idleElapsed;
    const apiElapsed = now - (this.#latestInFlightApiStartedAt == null
        ? this.#lastApiActivityAt
        : Math.max(this.#lastApiActivityAt, this.#latestInFlightApiStartedAt));
    if (elapsed <= params.thresholdMs || apiElapsed <= params.thresholdMs) return null;
    ...
}

Two issues compound here:

1. || instead of &&

The stall is suppressed if either getUpdates-elapsed or any-API-elapsed is below threshold. So any non-getUpdates API call (e.g. an outbound sendMessage from a cron, or a getMe health check) within the last 2 min keeps apiElapsed low and silences the watchdog even when getUpdates is fully dead.

This contradicts the watchdog's documented purpose: detect getUpdates liveness death and restart polling. Outbound-sender health does not imply inbound is working — they're independent code paths from Telegram's perspective.

2. noteGetUpdatesError and noteGetUpdatesSuccess both bump lastApiActivityAt

noteGetUpdatesSuccess(result, at) { ...; this.#lastApiActivityAt = at; ... }
noteGetUpdatesError(err, at)      { ...; this.#lastApiActivityAt = at; ... }

If getUpdates is in a tight retry-error loop (server returning 502s, transient socket close), every error bumps lastApiActivityAt, again silencing the watchdog. The watchdog should treat successful API activity as the only liveness signal, not error events.

Why "no logs, no errors" in the incident

If getUpdates was either:

  • (a) hung in-flight on a TCP zombie socket — inFlightGetUpdates>0 but the request never returns. lastApiActivityAt is whatever the last completed activity was. After 2 min, apiElapsed > thresholdMs → stall should fire — unless any other internal bot.api.* call (including grammY's own internal getMe / setMyCommands / etc.) was made within the window, suppressing the check; or
  • (b) returning empty arrays via grammY's own internal long-poll loop after the runner's internal task got into a bad state where the per-cycle bot.api.config.use middleware no longer sees the calls. Then lastApiActivityAt stays at whatever it was when the bot was last "alive," and again any other API touch hides the stall.

Both are consistent with the observed evidence (no error logs, no recovery, immediate fix on restart).

Suggested fix

  1. Change || to && on the stall-detect gate, so the watchdog fires whenever getUpdates itself is stale, regardless of unrelated API activity:

    if (elapsed <= params.thresholdMs && apiElapsed <= params.thresholdMs) return null;

    Or — preferably — drop apiElapsed from the gate entirely. Inbound-update liveness is the only thing the watchdog is supposed to protect. Outbound API health is orthogonal.

  2. Stop bumping lastApiActivityAt from noteGetUpdatesError. Errors are not liveness; they are signals the runner is failing. Otherwise a tight error-retry loop perpetually suppresses the watchdog.

  3. Add a 30-minute "last-resort" tier independent of the per-cycle watchdog: if lastTransportActivityAt is older than TELEGRAM_POLLING_STALE_TRANSPORT_MS (30 min, the same threshold collectTelegramPollingRuntimeIssues already uses for status-issue surfacing), force a session-level restart via setMyCommands-style health probe + runner.stop(). The status-issue threshold and the watchdog disagree by 15× — nothing acts on the 30-min staleness today; it only surfaces in openclaw channels status.

  4. Emit a structured warning the first time the watchdog detects a stall, even if it ends up being a false alarm. Right now stall events are logged via opts.log to whatever stdout/stderr the gateway has — under Task Scheduler on Windows that's discarded by default, which contributed to debugging difficulty here.

Repro outline

I don't have a deterministic in-vitro repro for the underlying network stall yet (it's a transient zombie-socket condition that surfaced naturally over multiple weeks). I can repro the watchdog suppression trivially by:

  1. Start gateway with Telegram polling enabled.
  2. While polling is running, drop the upstream TCP connection mid-getUpdates (e.g. block port 443 to api.telegram.org, or use pfctl/netsh to drop SYN-ACK).
  3. Concurrently fire any non-getUpdates API call every 60s — e.g. a cron that does bot.api.getMe() or sendMessage to a chat the network reaches. (You can simulate this with a separate process holding the same token, but in practice gateway-internal traffic alone will trigger it in many deployments.)
  4. Observe: the inbound poll is dead but pollingStallThresholdMs never fires, no transport rebuild, no [telegram] Polling stall detected (...) log.

A cleaner unit-test repro: in polling-liveness.test.ts, set inFlightGetUpdates=1, lastGetUpdatesStartedAt=now-180000, but call noteApiCallStarted() at now-30000. detectStall({ thresholdMs: 120000 }) returns null → bug. Should return a stall.

Cross-references / not duplicates

What I'm asking for

  • Confirm or refute the || analysis in polling-liveness.ts
  • Decide whether the watchdog should care about apiElapsed at all; my read is no
  • Add a [telegram][diag] log line on stall detection even if no restart fires (currently [telegram][diag] polling cycle finished/error reason=... only logs on cycle exit)

Happy to test patches against the affected hosts.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions