[Bug]: Gateway runtime degradation: pricing fetch 60s timeouts, Telegram polling stalls, slow RPC — chronic across 4.23/4.25/4.26 on Windows 11 + Node 24

### Bug type

Regression (worked before, now fails)

### Beta release blocker

No

### Summary

Gateway long-running Node process exhibits multi-subsystem network/timer degradation (model-pricing fetch 60s timeouts, Telegram polling stalls 127–266s, RPC slowdowns 8–83s) reproducible across 2026.4.23, 2026.4.25, and 2026.4.26 on Windows 11 build 26100.8115 + Node 24.14.1. From a standalone Node process on the same machine, fetch() to the same endpoints completes in 100–800ms.

### Steps to reproduce

1. `npm i -g openclaw@2026.4.26 --omit=optional`
2. `openclaw doctor --fix` (gateway auto-restarts; bundled deps installed cleanly)
3. Configure Telegram channel: `channels.telegram.enabled=true`, valid `botToken`, `dmPolicy: allowlist`, `plugins.entries.telegram.enabled=true`
4. `openclaw gateway start` → /health returns 200 within ~30s, log shows "ready (2 plugins: memory-core, telegram)"
5. Wait 2–5 minutes
6. First `Polling stall detected` and `pricing fetch failed (timeout 60s)` log lines appear
7. Cycle recurs every 2–3 minutes thereafter; `getUpdates` and `sendMessage` calls fail with bare `Network request for '...' failed!`

### Expected behavior

Gateway RPC and outbound HTTP fetches complete in <1s consistently, matching the timing observed when the same Node 24.14.1 binary issues fetch() to the same endpoints from a standalone process on the same host:
- `fetch('https://api.telegram.org/bot<token>/getMe')` → 106ms (IPv6-first), 116ms (IPv4-first)
- PowerShell curl to api.telegram.org → 0.1s
- PowerShell curl to openrouter.ai/api/v1/models → 0.8s
Telegram polling and sendMessage should run continuously without 100s+ stalls.

### Actual behavior

Multi-subsystem network/timer degradation observed simultaneously inside the long-running gateway process:

```
gateway/model-pricing      | OpenRouter pricing fetch failed (timeout 60s): TimeoutError
gateway/model-pricing      | LiteLLM pricing fetch failed (timeout 60s): TimeoutError
gateway/channels/telegram  | [telegram] Polling stall detected (active getUpdates stuck for 127.45s); forcing restart.
gateway/channels/telegram  | polling cycle finished reason=polling stall detected ... durationMs=127457 error=Network request for 'getUpdates' failed!
gateway/channels/telegram  | telegram sendMessage failed: Network request for 'sendMessage' failed!
gateway/channels/telegram  | telegram message processing failed: HttpError: Network request for 'sendMessage' failed!
gateway/ws                 | res ✓ models.list 55798ms      (normally <500ms)
gateway/ws                 | res ✓ models.list 83581ms
gateway/ws                 | res ✓ doctor.memory.status 35988ms
diagnostic                 | stuck session: state=processing age=282s queueDepth=1
```

In a single 1-hour observation window: **6 polling stalls, 4 sendMessage failures, 14 pricing-fetch 60s timeouts**, plus multiple `models.list` / `doctor.memory.status` / `node.list` RPCs clocking 8–83s where they normally finish in <500ms.

Direct probes from PowerShell `curl https://api.telegram.org/bot<token>/getMe` and from a separate `node -e "fetch(...)"` to the SAME endpoints succeed in 0.1–0.8s consistently throughout these gateway-internal stalls.

### OpenClaw version

2026.4.26 (be8c246) — also reproduced on 2026.4.25 (aa36ee6) and 2026.4.23 (a979721)

### Operating system

Windows 11 build 26100.8115

### Install method

npm global (--omit=optional); Node v24.14.1; PowerShell 5.1

### Model

xiaomi/mimo-v2.5-pro (primary); reproduces regardless of model — pricing fetch + Telegram getUpdates stall independent of LLM choice

### Provider / routing chain

openclaw -> Telegram polling (bundled grammyjs runner) -> api.telegram.org; openclaw -> xiaomi (mimo via api.xiaomimimo.com); openclaw -> openrouter.ai/api/v1/models + LiteLLM public pricing JSON (gateway-internal hardcoded fetches)

### Additional provider/model setup details

No proxy configured (no HTTP_PROXY / HTTPS_PROXY / ALL_PROXY env vars). UK home broadband, no VPN, no corporate firewall. Fallback chain: zai/glm-5.1, xiaomi/mimo-v2.5, minimax/MiniMax-M2.7. All providers reachable when probed from a standalone Node process; degradation is gateway-internal only.

### Logs, screenshots, and evidence

```shell
**What does NOT explain it (each tested):**

| Hypothesis | Evidence against |
|---|---|
| Bot token / Telegram API issue | `curl https://api.telegram.org/bot<token>/getMe` returns ok=true in 0.1s, consistently |
| Public network slow | Standalone `node -e "fetch(...)"` hits api.telegram.org and openrouter.ai/api/v1/models in 100–800ms |
| IPv6 vs IPv4 | Both `--dns-result-order=ipv4first` and default IPv6-first succeed via standalone Node fetch in <120ms; DNS resolves both A and AAAA cleanly |
| Bundled plugin runtime deps missing | `openclaw doctor --fix` reports all deps installed |
| `fetchWithSsrFGuard` connection pool | Verified in dist/fetch-guard-C10MVwBt.js the SSRF guard creates a per-call dispatcher and disposes on completion. Pricing code (dist/usage-format-ZhKID6__.js) uses raw fetch + AbortSignal.timeout(60000), not SSRF wrapper, and still times out |
| OS-level network state corruption | Full Windows reboot (cold boot to gateway start) reproduces chronic within ~30 minutes |
| 4.25 / 4.26 regression | Identical signatures on 2026.4.23 (a979721) before any 4.25/4.26 install |
| Node 24 specific | Same Node 24 binary fetches fine from a standalone process — only the long-running gateway process degrades |

**Process resource snapshot at degradation point (PID 16776):**
- working set 616 MB / private 811 MB
- 45 threads
- **3337 handles** (notably high)
- 25 min uptime

**Workaround attempts that did NOT help:**
- `openclaw doctor --fix` (3 cycles)
- `openclaw gateway restart` (10+ cycles)
- Hard kill (`Stop-Process -Force` on PID owning :18789 + tray) → clean restart
- Full Windows 11 reboot
- Downgrade 4.25 → 4.23 → back to 4.25 → 4.26
- `channels.telegram.pollTimeoutMs: 5000` (vs default 30000)
- Force IPv4 via `NODE_OPTIONS=--dns-result-order=ipv4first`
- Removed unused providers (arcee/openrouter)
- `openclaw sessions cleanup --enforce --fix-missing`

Logs are sanitized of bot tokens / API keys; happy to share unredacted logs privately.
```

### Impact and severity

**Affected:** All gateway-internal outbound HTTP — Telegram polling/sendMessage, model-pricing fetch, in-process gateway RPC (models.list, doctor.memory.status, node.list).
**Severity:** High — Telegram bot replies blocked or delayed 5+ minutes; gateway RPC slow enough that openclaw-sweep tools fail or partial-result. User has to fall back to Tray UI / webchat for any reliable use.
**Frequency:** Always — chronic recurs every 2–3 minutes once gateway is up >5 min, on this Windows 11 + Node 24.14.1 host across 2026.4.23, 2026.4.25, and 2026.4.26.
**Consequence:** Telegram channel effectively unusable; missed/delayed messages; gateway needs constant restart; `/health` flickers between 200 and timeout.

### Additional information

**Hypotheses (ranked) for maintainers:**
1. **Shared global undici dispatcher / Agent state degrades over time.** Multiple subsystems (model-pricing, Telegram grammyjs runner, doctor.memory.status) all use shared global undici and all start failing together. Hand-off / keep-alive socket reaping appears to break — getUpdates requests sit 127–266s past their AbortSignal timeout, suggesting the abort/timer layer is no longer firing as expected.
2. Telegram grammyjs polling runner long-poll keep-alive sockets go stale; runner's stall detector only catches it after 127–197s. Plausibly correlates with pricing-fetch / RPC slowdown if all three share the same global dispatcher.
3. Event-loop starvation during channels-and-sidecars phase — `models.list 55–83s`, `node.list 8.9s`, `doctor.memory.status 35s` suggests a long-running synchronous task is blocking the loop, which would also explain pricing-fetch timers not firing.

**Note on Codex / parallel diagnostic:** An independent agent (Codex) ran a parallel diagnostic on the same machine and concurs the runtime degradation is process-internal, not network-side. `openclaw-sweep` runId `8924e8d6-d776-4ed5-94be-a87fd194372b` available on request.

**Last known good version:** unknown — bug present in oldest version we could test (2026.4.23). Not a recent regression.

**Happy to provide:** full gateway log (`C:\tmp\openclaw\openclaw-2026-04-2*.log`), `--inspect` profile, `OPENCLAW_DEBUG_INGRESS_TIMING=1` / `OPENCLAW_DEBUG_HEALTH=1` traces, `doctor --deep` output (216s runtime; no actionable network-layer findings), or any other diagnostic that helps narrow which layer (undici Agent, grammyjs runner, gateway event loop, Windows-specific socket behavior) is degrading.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bug]: Gateway runtime degradation: pricing fetch 60s timeouts, Telegram polling stalls, slow RPC — chronic across 4.23/4.25/4.26 on Windows 11 + Node 24 #73323

Bug type

Beta release blocker

Summary

Steps to reproduce

Expected behavior

Actual behavior

OpenClaw version

Operating system

Install method

Model

Provider / routing chain

Additional provider/model setup details

Logs, screenshots, and evidence

Impact and severity

Additional information

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

[Bug]: Gateway runtime degradation: pricing fetch 60s timeouts, Telegram polling stalls, slow RPC — chronic across 4.23/4.25/4.26 on Windows 11 + Node 24 #73323

Description

Bug type

Beta release blocker

Summary

Steps to reproduce

Expected behavior

Actual behavior

OpenClaw version

Operating system

Install method

Model

Provider / routing chain

Additional provider/model setup details

Logs, screenshots, and evidence

Impact and severity

Additional information

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions