Telegram polling silently wedges after stall — transport rebuild never starts new polling cycle (5.4 + 5.5)

Two related bugs in `dist/monitor-polling.runtime-*.js` reproduced in 2026.5.4 and 2026.5.5.

## Symptom

- Gateway running, telegram channel reports `running, connected, mode:polling, works` via `openclaw channels status --probe`
- ZERO TCP from gateway PID to 149.154.x or 91.108.x (Telegram backbone)
- `pending_update_count > 0` at telegram side, growing over time
- No `getUpdates` / `polling` log entries for hours
- Outbound `sendMessage` works fine (state-drift: gateway reports healthy while inbound is dead)
- Multiple gateway restarts (`systemctl --user restart openclaw-gateway`) re-enter the same wedged state
- Self-recovery eventually (~75 min in one case, indeterminate in another) — mechanism unclear; possibly when the npm package is replaced (e.g. `openclaw update`)

## Bug 1 — masked stall detection

File: `dist/monitor-polling.runtime-DjS2STzm.js` (5.4) / `monitor-polling.runtime-DBv9gGnS.js` (5.5)

Line 84:

```js
if (elapsed <= params.thresholdMs || apiElapsed <= params.thresholdMs) return null;
```

`apiElapsed` is updated by `noteApiCallSuccess()` on ANY successful API call (including outbound `sendMessage`). Result: stall-detection is suppressed during normal outbound activity, even when `getUpdates` has hung indefinitely. Should likely be `&&` or just `if (elapsed <= params.thresholdMs) return null;` — polling-elapsed alone determines the polling stall.

## Bug 2 — transport-rebuild silent failure

When stall IS detected (e.g. before any outbound activity occurs), the recovery sequence logs:

```
[telegram] Polling stall detected (no completed getUpdates for 149.99s); forcing restart.
[telegram] Polling runner stop timed out after 15s; forcing restart cycle.
[telegram][diag] polling cycle finished reason=polling stall detected
[telegram] Telegram polling runner stopped (...); restarting in 2.22s.
[telegram][diag] rebuilding transport for next polling cycle
```

…then silence. No new polling cycle starts, no error logged. `#runPollingCycle()` either never re-enters or hangs in a state that doesn't surface diagnostics.

## Cost / impact

Sky-down on inbound for 1–3 hours per occurrence. Two occurrences in a single day during 2026-05-06.

## Trigger

Both occurrences followed an external disruption (network blip from Docker WSL toggle reset; auth-profile failure from Anthropic billing exhaustion). The disruption is recoverable in itself; the polling-restart code path doesn't survive it.

## Workaround

Wait for self-recovery, or `openclaw update --tag <new-version>` to replace the npm package and force fresh JS file load.

## Suggested fix

1. Drop the `apiElapsed` check in `detectStall` — or use `&&` — so stall-detection isn't masked by outbound activity.
2. Add error/timeout handling in the transport-rebuild path so silent failures surface as logs.

## Versions affected

- `openclaw@2026.5.4`
- `openclaw@2026.5.5`

## Environment

- Node v24.13.0 (nvm), Ubuntu (WSL2 on Windows 11)
- Gateway managed by systemd-user

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Telegram polling silently wedges after stall — transport rebuild never starts new polling cycle (5.4 + 5.5) #78473

Symptom

Bug 1 — masked stall detection

Bug 2 — transport-rebuild silent failure

Cost / impact

Trigger

Workaround

Suggested fix

Versions affected

Environment

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Telegram polling silently wedges after stall — transport rebuild never starts new polling cycle (5.4 + 5.5) #78473

Description

Symptom

Bug 1 — masked stall detection

Bug 2 — transport-rebuild silent failure

Cost / impact

Trigger

Workaround

Suggested fix

Versions affected

Environment

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions