Two related bugs in dist/monitor-polling.runtime-*.js reproduced in 2026.5.4 and 2026.5.5.
Symptom
- Gateway running, telegram channel reports
running, connected, mode:polling, works via openclaw channels status --probe
- ZERO TCP from gateway PID to 149.154.x or 91.108.x (Telegram backbone)
pending_update_count > 0 at telegram side, growing over time
- No
getUpdates / polling log entries for hours
- Outbound
sendMessage works fine (state-drift: gateway reports healthy while inbound is dead)
- Multiple gateway restarts (
systemctl --user restart openclaw-gateway) re-enter the same wedged state
- Self-recovery eventually (~75 min in one case, indeterminate in another) — mechanism unclear; possibly when the npm package is replaced (e.g.
openclaw update)
Bug 1 — masked stall detection
File: dist/monitor-polling.runtime-DjS2STzm.js (5.4) / monitor-polling.runtime-DBv9gGnS.js (5.5)
Line 84:
if (elapsed <= params.thresholdMs || apiElapsed <= params.thresholdMs) return null;
apiElapsed is updated by noteApiCallSuccess() on ANY successful API call (including outbound sendMessage). Result: stall-detection is suppressed during normal outbound activity, even when getUpdates has hung indefinitely. Should likely be && or just if (elapsed <= params.thresholdMs) return null; — polling-elapsed alone determines the polling stall.
Bug 2 — transport-rebuild silent failure
When stall IS detected (e.g. before any outbound activity occurs), the recovery sequence logs:
[telegram] Polling stall detected (no completed getUpdates for 149.99s); forcing restart.
[telegram] Polling runner stop timed out after 15s; forcing restart cycle.
[telegram][diag] polling cycle finished reason=polling stall detected
[telegram] Telegram polling runner stopped (...); restarting in 2.22s.
[telegram][diag] rebuilding transport for next polling cycle
…then silence. No new polling cycle starts, no error logged. #runPollingCycle() either never re-enters or hangs in a state that doesn't surface diagnostics.
Cost / impact
Sky-down on inbound for 1–3 hours per occurrence. Two occurrences in a single day during 2026-05-06.
Trigger
Both occurrences followed an external disruption (network blip from Docker WSL toggle reset; auth-profile failure from Anthropic billing exhaustion). The disruption is recoverable in itself; the polling-restart code path doesn't survive it.
Workaround
Wait for self-recovery, or openclaw update --tag <new-version> to replace the npm package and force fresh JS file load.
Suggested fix
- Drop the
apiElapsed check in detectStall — or use && — so stall-detection isn't masked by outbound activity.
- Add error/timeout handling in the transport-rebuild path so silent failures surface as logs.
Versions affected
openclaw@2026.5.4
openclaw@2026.5.5
Environment
- Node v24.13.0 (nvm), Ubuntu (WSL2 on Windows 11)
- Gateway managed by systemd-user
Two related bugs in
dist/monitor-polling.runtime-*.jsreproduced in 2026.5.4 and 2026.5.5.Symptom
running, connected, mode:polling, worksviaopenclaw channels status --probepending_update_count > 0at telegram side, growing over timegetUpdates/pollinglog entries for hourssendMessageworks fine (state-drift: gateway reports healthy while inbound is dead)systemctl --user restart openclaw-gateway) re-enter the same wedged stateopenclaw update)Bug 1 — masked stall detection
File:
dist/monitor-polling.runtime-DjS2STzm.js(5.4) /monitor-polling.runtime-DBv9gGnS.js(5.5)Line 84:
apiElapsedis updated bynoteApiCallSuccess()on ANY successful API call (including outboundsendMessage). Result: stall-detection is suppressed during normal outbound activity, even whengetUpdateshas hung indefinitely. Should likely be&&or justif (elapsed <= params.thresholdMs) return null;— polling-elapsed alone determines the polling stall.Bug 2 — transport-rebuild silent failure
When stall IS detected (e.g. before any outbound activity occurs), the recovery sequence logs:
…then silence. No new polling cycle starts, no error logged.
#runPollingCycle()either never re-enters or hangs in a state that doesn't surface diagnostics.Cost / impact
Sky-down on inbound for 1–3 hours per occurrence. Two occurrences in a single day during 2026-05-06.
Trigger
Both occurrences followed an external disruption (network blip from Docker WSL toggle reset; auth-profile failure from Anthropic billing exhaustion). The disruption is recoverable in itself; the polling-restart code path doesn't survive it.
Workaround
Wait for self-recovery, or
openclaw update --tag <new-version>to replace the npm package and force fresh JS file load.Suggested fix
apiElapsedcheck indetectStall— or use&&— so stall-detection isn't masked by outbound activity.Versions affected
openclaw@2026.5.4openclaw@2026.5.5Environment