[Bug] Telegram polling silently dies for 30+ min with no error and no auto-recovery; pollingStallThresholdMs watchdog suppressed by unrelated API activity

# [Bug] Telegram polling silently dies for 30+ min with no error and no auto-recovery; `pollingStallThresholdMs` watchdog suppressed by unrelated API activity

## Summary

The Telegram channel's `getUpdates` long-poll silently stopped delivering inbound updates for **39 minutes** on a healthy gateway. No errors, no warnings, no recovery — the polling watchdog (`pollingStallThresholdMs`, default 120000ms) **never fired**. A manual `openclaw gateway restart` recovered it instantly.

The behavior is reproducible across two unrelated hosts (macOS arm64 + Windows 11 x64) running the same OpenClaw build, and reading `dist/monitor-polling.runtime-DjS2STzm.js` shows the watchdog's stall-detect logic uses an `||` that lets unrelated API activity suppress stall detection even when `getUpdates` is dead.

## Environment

- **OpenClaw:** `2026.5.4 (325df3e)` (also seen on `2026.5.3-1` before upgrade)
- **OS A:** Windows 11 x64, Node `v22.22.2`, gateway started via Task Scheduler `gateway.cmd`
- **OS B:** macOS Darwin 25.4.0 arm64, Node `v25.9.0`
- **Mode:** polling (no webhook configured)
- **Network:** US home broadband, no proxy, `api.telegram.org` directly reachable
- **Account config (excerpt):**
  ```jsonc
  {
    "enabled": true,
    "dmPolicy": "pairing",
    "groupPolicy": "allowlist",
    "streaming": { "mode": "off", "preview": { "toolProgress": false } },
    "timeoutSeconds": 180,
    "retry": { "attempts": 5, "minDelayMs": 500, "maxDelayMs": 30000, "jitter": 0.2 }
    // pollingStallThresholdMs not overridden — using default 120000
  }
  ```

## Incident timeline (Windows host, all times EDT)

| Time | Evidence |
|---|---|
| `01:02:42` | Last Telegram-channel embedded run start event recorded by gateway. |
| `01:02:42 → 01:45` | **39 min of silence.** Gateway healthy: probe ok, port listening, crons firing on schedule. Heartbeats normal. No errors, no crashes, no warnings. **Watchdog never logged a stall.** Webhook heartbeats showed `webhooks=0/0/0`. `compliance.jsonl` has zero inbound from sender during this window. |
| `01:41` | User sends a Telegram DM from a paired account. **Message never reaches the gateway.** Telegram client shows it as delivered to the bot. |
| `01:45` | `openclaw gateway restart` issued. |
| `01:45 → now` | Polling resumes immediately. Inbound updates start flowing again, including the previously-missed `01:41` message via offset replay. |

The same pattern was observed on the macOS host on a separate occasion the same week.

## Expected vs actual

**Expected:** With `pollingStallThresholdMs=120000` (2 min default), if `getUpdates` stops completing for >2 min, the polling watchdog restarts the polling runner and recovers.

**Actual:** Polling stayed wedged for 39 minutes (~19× the threshold). No `Polling stall detected` log. No transport rebuild. No restart cycle.

## Root cause hypothesis: `||` in `detectStall` allows unrelated API activity to suppress watchdog

In `extensions/telegram/src/polling-liveness.ts` (bundled at `dist/monitor-polling.runtime-DjS2STzm.js` lines 78–87):

```js
detectStall(params) {
    const now = params.now ?? this.#now();
    const activeElapsed = this.#inFlightGetUpdates > 0 && this.#lastGetUpdatesStartedAt != null
        ? now - this.#lastGetUpdatesStartedAt : 0;
    const idleElapsed = this.#inFlightGetUpdates > 0
        ? 0
        : now - (this.#lastGetUpdatesFinishedAt ?? this.#lastGetUpdatesAt);
    const elapsed = this.#inFlightGetUpdates > 0 ? activeElapsed : idleElapsed;
    const apiElapsed = now - (this.#latestInFlightApiStartedAt == null
        ? this.#lastApiActivityAt
        : Math.max(this.#lastApiActivityAt, this.#latestInFlightApiStartedAt));
    if (elapsed <= params.thresholdMs || apiElapsed <= params.thresholdMs) return null;
    ...
}
```

Two issues compound here:

### 1. `||` instead of `&&`

The stall is suppressed if **either** `getUpdates`-elapsed **or** any-API-elapsed is below threshold. So any non-`getUpdates` API call (e.g. an outbound `sendMessage` from a cron, or a `getMe` health check) within the last 2 min keeps `apiElapsed` low and **silences the watchdog even when `getUpdates` is fully dead**.

This contradicts the watchdog's documented purpose: detect *getUpdates liveness death* and restart polling. Outbound-sender health does not imply inbound is working — they're independent code paths from Telegram's perspective.

### 2. `noteGetUpdatesError` and `noteGetUpdatesSuccess` both bump `lastApiActivityAt`

```js
noteGetUpdatesSuccess(result, at) { ...; this.#lastApiActivityAt = at; ... }
noteGetUpdatesError(err, at)      { ...; this.#lastApiActivityAt = at; ... }
```

If `getUpdates` is in a tight retry-error loop (server returning 502s, transient socket close), every error bumps `lastApiActivityAt`, again silencing the watchdog. The watchdog should treat *successful* API activity as the only liveness signal, not error events.

### Why "no logs, no errors" in the incident

If `getUpdates` was either:

- (a) hung in-flight on a TCP zombie socket — `inFlightGetUpdates>0` but the request never returns. `lastApiActivityAt` is whatever the last completed activity was. After 2 min, `apiElapsed > thresholdMs` → stall *should* fire — **unless** any other internal `bot.api.*` call (including grammY's own internal getMe / setMyCommands / etc.) was made within the window, suppressing the check; or
- (b) returning empty arrays via grammY's own internal long-poll loop after the runner's internal task got into a bad state where the per-cycle `bot.api.config.use` middleware no longer sees the calls. Then `lastApiActivityAt` stays at whatever it was when the bot was last "alive," and again any other API touch hides the stall.

Both are consistent with the observed evidence (no error logs, no recovery, immediate fix on restart).

## Suggested fix

1. **Change `||` to `&&`** on the stall-detect gate, so the watchdog fires whenever *getUpdates* itself is stale, regardless of unrelated API activity:
   ```js
   if (elapsed <= params.thresholdMs && apiElapsed <= params.thresholdMs) return null;
   ```
   Or — preferably — drop `apiElapsed` from the gate entirely. Inbound-update liveness is the only thing the watchdog is supposed to protect. Outbound API health is orthogonal.

2. **Stop bumping `lastApiActivityAt` from `noteGetUpdatesError`.** Errors are not liveness; they are signals the runner is failing. Otherwise a tight error-retry loop perpetually suppresses the watchdog.

3. **Add a 30-minute "last-resort" tier** independent of the per-cycle watchdog: if `lastTransportActivityAt` is older than `TELEGRAM_POLLING_STALE_TRANSPORT_MS` (30 min, the same threshold `collectTelegramPollingRuntimeIssues` already uses for status-issue surfacing), force a session-level restart via `setMyCommands`-style health probe + `runner.stop()`. The status-issue threshold and the watchdog disagree by 15× — nothing acts on the 30-min staleness today; it only surfaces in `openclaw channels status`.

4. **Emit a structured warning** the first time the watchdog detects a stall, even if it ends up being a false alarm. Right now stall events are logged via `opts.log` to whatever stdout/stderr the gateway has — under Task Scheduler on Windows that's discarded by default, which contributed to debugging difficulty here.

## Repro outline

I don't have a deterministic in-vitro repro for the underlying network stall yet (it's a transient zombie-socket condition that surfaced naturally over multiple weeks). I can repro the **watchdog suppression** trivially by:

1. Start gateway with Telegram polling enabled.
2. While polling is running, drop the upstream TCP connection mid-`getUpdates` (e.g. block port 443 to api.telegram.org, or use `pfctl`/`netsh` to drop SYN-ACK).
3. Concurrently fire any non-`getUpdates` API call every 60s — e.g. a cron that does `bot.api.getMe()` or `sendMessage` to a chat the network reaches. (You can simulate this with a separate process holding the same token, but in practice gateway-internal traffic alone will trigger it in many deployments.)
4. Observe: the inbound poll is dead but `pollingStallThresholdMs` never fires, no transport rebuild, no `[telegram] Polling stall detected (...)` log.

A cleaner unit-test repro: in `polling-liveness.test.ts`, set `inFlightGetUpdates=1`, `lastGetUpdatesStartedAt=now-180000`, but call `noteApiCallStarted()` at `now-30000`. `detectStall({ thresholdMs: 120000 })` returns `null` → bug. Should return a stall.

## Cross-references / not duplicates

- #50040 — "Telegram delivery reliability: polling stalls can lead to silent outbound message loss" — related class but inverse direction (outbound)
- #71066 — "Telegram subsystem: getUpdates polling silently non-functional" — region-specific and *log-spammy* (sticky IPv4 fallback warnings every 7s); this report is the **silent** case with no logs
- #75498 — "macOS 26.4.1 / 2026.4.29 Telegram Web UI-only replies with partial streaming, polling stall, and session modelOverride pollution" — co-mentions polling stall *and* the same `modelOverrideSource:"auto"` persistence we hit; we're filing the modelOverride half separately
- #73602, #73323, #61616 — adjacent stall-class reports on WSL2 / Windows / cross-OS

## What I'm asking for

- Confirm or refute the `||` analysis in `polling-liveness.ts`
- Decide whether the watchdog should care about `apiElapsed` at all; my read is no
- Add a `[telegram][diag]` log line on stall *detection* even if no restart fires (currently `[telegram][diag] polling cycle finished/error reason=...` only logs on cycle exit)

Happy to test patches against the affected hosts.


Time	Evidence
`01:02:42`	Last Telegram-channel embedded run start event recorded by gateway.
`01:02:42 → 01:45`	39 min of silence. Gateway healthy: probe ok, port listening, crons firing on schedule. Heartbeats normal. No errors, no crashes, no warnings. Watchdog never logged a stall. Webhook heartbeats showed `webhooks=0/0/0`. `compliance.jsonl` has zero inbound from sender during this window.
`01:41`	User sends a Telegram DM from a paired account. Message never reaches the gateway. Telegram client shows it as delivered to the bot.
`01:45`	`openclaw gateway restart` issued.
`01:45 → now`	Polling resumes immediately. Inbound updates start flowing again, including the previously-missed `01:41` message via offset replay.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bug] Telegram polling silently dies for 30+ min with no error and no auto-recovery; pollingStallThresholdMs watchdog suppressed by unrelated API activity #78422

[Bug] Telegram polling silently dies for 30+ min with no error and no auto-recovery; `pollingStallThresholdMs` watchdog suppressed by unrelated API activity

Summary

Environment

Incident timeline (Windows host, all times EDT)

Expected vs actual

Root cause hypothesis: `||` in `detectStall` allows unrelated API activity to suppress watchdog

1. `||` instead of `&&`

2. `noteGetUpdatesError` and `noteGetUpdatesSuccess` both bump `lastApiActivityAt`

Why "no logs, no errors" in the incident

Suggested fix

Repro outline

Cross-references / not duplicates

What I'm asking for

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

[Bug] Telegram polling silently dies for 30+ min with no error and no auto-recovery; pollingStallThresholdMs watchdog suppressed by unrelated API activity #78422

Description

[Bug] Telegram polling silently dies for 30+ min with no error and no auto-recovery; pollingStallThresholdMs watchdog suppressed by unrelated API activity

Summary

Environment

Incident timeline (Windows host, all times EDT)

Expected vs actual

Root cause hypothesis: || in detectStall allows unrelated API activity to suppress watchdog

1. || instead of &&

2. noteGetUpdatesError and noteGetUpdatesSuccess both bump lastApiActivityAt

Why "no logs, no errors" in the incident

Suggested fix

Repro outline

Cross-references / not duplicates

What I'm asking for

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

[Bug] Telegram polling silently dies for 30+ min with no error and no auto-recovery; `pollingStallThresholdMs` watchdog suppressed by unrelated API activity

Root cause hypothesis: `||` in `detectStall` allows unrelated API activity to suppress watchdog

1. `||` instead of `&&`

2. `noteGetUpdatesError` and `noteGetUpdatesSuccess` both bump `lastApiActivityAt`