Skip to content

Telegram polling silently wedges after stall — transport rebuild never starts new polling cycle (5.4 + 5.5) #78473

@Rigbay

Description

@Rigbay

Two related bugs in dist/monitor-polling.runtime-*.js reproduced in 2026.5.4 and 2026.5.5.

Symptom

  • Gateway running, telegram channel reports running, connected, mode:polling, works via openclaw channels status --probe
  • ZERO TCP from gateway PID to 149.154.x or 91.108.x (Telegram backbone)
  • pending_update_count > 0 at telegram side, growing over time
  • No getUpdates / polling log entries for hours
  • Outbound sendMessage works fine (state-drift: gateway reports healthy while inbound is dead)
  • Multiple gateway restarts (systemctl --user restart openclaw-gateway) re-enter the same wedged state
  • Self-recovery eventually (~75 min in one case, indeterminate in another) — mechanism unclear; possibly when the npm package is replaced (e.g. openclaw update)

Bug 1 — masked stall detection

File: dist/monitor-polling.runtime-DjS2STzm.js (5.4) / monitor-polling.runtime-DBv9gGnS.js (5.5)

Line 84:

if (elapsed <= params.thresholdMs || apiElapsed <= params.thresholdMs) return null;

apiElapsed is updated by noteApiCallSuccess() on ANY successful API call (including outbound sendMessage). Result: stall-detection is suppressed during normal outbound activity, even when getUpdates has hung indefinitely. Should likely be && or just if (elapsed <= params.thresholdMs) return null; — polling-elapsed alone determines the polling stall.

Bug 2 — transport-rebuild silent failure

When stall IS detected (e.g. before any outbound activity occurs), the recovery sequence logs:

[telegram] Polling stall detected (no completed getUpdates for 149.99s); forcing restart.
[telegram] Polling runner stop timed out after 15s; forcing restart cycle.
[telegram][diag] polling cycle finished reason=polling stall detected
[telegram] Telegram polling runner stopped (...); restarting in 2.22s.
[telegram][diag] rebuilding transport for next polling cycle

…then silence. No new polling cycle starts, no error logged. #runPollingCycle() either never re-enters or hangs in a state that doesn't surface diagnostics.

Cost / impact

Sky-down on inbound for 1–3 hours per occurrence. Two occurrences in a single day during 2026-05-06.

Trigger

Both occurrences followed an external disruption (network blip from Docker WSL toggle reset; auth-profile failure from Anthropic billing exhaustion). The disruption is recoverable in itself; the polling-restart code path doesn't survive it.

Workaround

Wait for self-recovery, or openclaw update --tag <new-version> to replace the npm package and force fresh JS file load.

Suggested fix

  1. Drop the apiElapsed check in detectStall — or use && — so stall-detection isn't masked by outbound activity.
  2. Add error/timeout handling in the transport-rebuild path so silent failures surface as logs.

Versions affected

  • openclaw@2026.5.4
  • openclaw@2026.5.5

Environment

  • Node v24.13.0 (nvm), Ubuntu (WSL2 on Windows 11)
  • Gateway managed by systemd-user

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions