Telegram cascade on v2026.5.7: polling stall + outbound HTTP agent never rebuilt; multi-account start-account causes event-loop starvation

## Summary

On `v2026.5.7` (Windows 11, 8 Telegram bot accounts) the gateway entered a Telegram delivery cascade with two distinct failure modes back-to-back:

1. **Pre-restart:** 148s `getUpdates` polling stall, followed by ~50 consecutive `sendChatAction` "Network request failed" errors firing every ~3s with no backoff. **Network was healthy throughout** — DNS, TCP, TLS to `api.telegram.org` all OK from the same host at the same time. The polling-cycle's internal transport rebuild fired (`closing stale transport before rebuild` → `rebuilding transport for next polling cycle`) but did **not** repair the `sendChatAction` HTTP agent — outbound failures continued unabated until a full gateway restart.
2. **Post-restart:** During `channels.telegram.start-account` (8 bots starting in parallel) the gateway hit event-loop starvation (`eventLoopDelayMaxMs=8384.4ms`, `timer delayed 5604ms, likely event-loop starvation`), causing multiple `getMe` calls to time out at 10–15s.

This appears to be a combination of, or interaction between, several already-tracked issues:

- #50040 — polling stalls causing silent outbound loss (still OPEN)
- #56096 — `sendChatAction` infinite retry loop with no backoff (still OPEN)
- #80362 — `Network request for 'X' failed!` regex too strict; drops outbound (still OPEN)
- Closed but symptomatically identical: #76164, #76172, #76258

The new signal here is that **both halves reproduce cleanly on v2026.5.7** — the closed `.4.24/.4.25` issues were not, in fact, fully resolved on `.5.7`; and the inner polling-rebuild explicitly does not heal the outbound HTTP agent.

## Environment

- OpenClaw: `webchat v2026.5.7` (per `[ws]` handshake banner)
- OS: Windows 11 Pro N (10.0.26200)
- Node: bundled
- Plugins active: `browser, device-pair, file-transfer, google, memory-core, microsoft, phone-control, talk-voice, telegram`
- Telegram accounts: 8 (`alex`, `tony`, `alex-demo`, `cllawway`, `alan`, `donald-clawmp`, `clawd`, `daniel-clawstley`)
- IPv4-only forced via `--require force-ipv4.js` (NODE_OPTIONS)

## Pre-restart cascade (excerpt)

```
11:30:22.669 [telegram] Polling stall detected (no completed getUpdates for 148.89s); forcing restart.
              [diag inFlight=0 outcome=ok durationMs=7074 offset=0 apiElapsedMs=2487]
11:30:22.692 [telegram] sendChatAction failed: Network request for 'sendChatAction' failed!
11:30:25.710 [telegram] sendChatAction failed: Network request for 'sendChatAction' failed!
11:30:28.693 [telegram] sendChatAction failed: Network request for 'sendChatAction' failed!
...  (50+ consecutive failures at ~3s cadence)
11:30:37.676 [telegram] Polling runner stop timed out after 15s; forcing restart cycle.
11:30:54.756 [telegram] [diag] closing stale transport before rebuild
11:30:54.759 [telegram] [diag] rebuilding transport for next polling cycle
11:30:55.715 [telegram] sendChatAction failed: Network request for 'sendChatAction' failed!  ← rebuild did NOT fix outbound
... (failures continue past rebuild, only fixed by full gateway restart)
```

### Network was healthy during the cascade

Run from the same host while the failures were still firing:

```
Test-NetConnection api.telegram.org:443 → True
Resolve-DnsName    api.telegram.org      → 149.154.166.110 (TTL 249)
curl https://api.telegram.org/           → http=302 dns=0.008s connect=0.115s total=0.398s
```

So the failures were not network-level — strongly suggests a stale persistent HTTP/keep-alive socket inside the `sendChatAction` HTTP agent that the polling-rebuild path doesn't touch.

## Post-restart event-loop starvation

```
11:33:47.768 [gateway] resolving authentication…
11:35:06.071 [gateway] http server listening (9 plugins; 78.2s)
11:35:16.715 [gateway] startup model warmup timed out after 5000ms; continuing without waiting
11:35:19.942 [telegram] [alex] starting provider (@alexclawdmozibot)
11:35:20.280 [telegram] [tony] starting provider (@TonyClawbinsBot)
11:35:20.283 [telegram] [alex-demo] starting provider (@alexclawdmozidemobot)
11:35:20.285 [telegram] [cllawway] starting provider (@cllawawaybot)
11:35:20.288 [telegram] [alan] starting provider (@alanclawttsbot)
11:35:20.291 [telegram] [donald-clawmp] starting provider (@donaldclawmpbot)
11:35:20.294 [telegram] [clawd] starting provider (@clawdminsterbot)
11:35:20.297 [telegram] [daniel-clawstley] starting provider (@danielclawdslibot)

11:38:10.220 [diagnostic] liveness warning: reasons=event_loop_delay
              eventLoopDelayP99Ms=23.8 eventLoopDelayMaxMs=8384.4
              eventLoopUtilization=0.672 cpuCoreRatio=0.639
              phase=channels.telegram.start-account
11:38:31.430 [fetch-timeout] fetch timeout after 10000ms (elapsed 10682ms) operation=fetchWithTimeout
              url=https://api.telegram.org/bot823795…/getMe
11:38:47.038 [fetch-timeout] fetch timeout after 10000ms (elapsed 15604ms) timer delayed 5604ms,
              likely event-loop starvation operation=fetchWithTimeout
              url=https://api.telegram.org/bot818818…/getMe
11:39:09.345 [fetch-timeout] fetch timeout after 10000ms (elapsed 10133ms) operation=fetchWithTimeout
              url=https://api.telegram.org/bot858735…/getMe
```

8 bots starting in parallel + first-call agent-model resolution + session-locks recovery + model warmup all bunched into the same tick window. The 5.6s timer delay confirms the loop was actually parked, not just slow.

## Reproduction

Run a gateway with ≥8 Telegram accounts on Windows. Leave it idle overnight, then send an inbound message. Most days nothing happens; on a stall day the pre-restart cascade above appears. Restart the gateway and you'll see the post-restart starvation pattern reliably during `start-account` fanout.

## Proposed fixes (for triage)

1. **Outbound HTTP agent rebuild.** When polling-rebuild fires, also dispose the outbound (`sendChatAction`/`sendMessage`) agent — or share a single Undici/grammy transport instance so a single rebuild fixes both directions. The current code path leaves outbound permanently broken until full restart.
2. **Backoff on `sendChatAction` failures.** Per #56096 — the 3s/3s/3s repeat with no jitter or escalation is a hot loop.
3. **Bound multi-account start-account fanout.** Throttle to N=2–3 concurrent `start-account` calls instead of all 8 at once, or move the first `getMe` off the hot path so model warmup + agent prep don't compound.
4. **Recognize bare grammy error string** per #80362 so `Network request for 'sendChatAction' failed!` is classified recoverable and a retry-with-fresh-socket can fire instead of dropping the send.

## Related

- #50040 (open) — polling stalls + silent outbound loss
- #56096 (open) — sendChatAction retry loop, no backoff
- #80362 (open) — strict regex drops outbound on bare grammy failures
- #76164 / #76172 / #76258 (closed) — symptomatically identical for `.4.24/.4.25`; this report demonstrates the same patterns reproduce on `v2026.5.7`
- #79380 (open) — Pi 4 CPU spin regression `.4.23 → .4.25+` (may share start-account fanout root cause)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Telegram cascade on v2026.5.7: polling stall + outbound HTTP agent never rebuilt; multi-account start-account causes event-loop starvation #80695

Summary

Environment

Pre-restart cascade (excerpt)

Network was healthy during the cascade

Post-restart event-loop starvation

Reproduction

Proposed fixes (for triage)

Related

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Telegram cascade on v2026.5.7: polling stall + outbound HTTP agent never rebuilt; multi-account start-account causes event-loop starvation #80695

Description

Summary

Environment

Pre-restart cascade (excerpt)

Network was healthy during the cascade

Post-restart event-loop starvation

Reproduction

Proposed fixes (for triage)

Related

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions