Skip to content

[Bug]: Gateway runtime degradation: pricing fetch 60s timeouts, Telegram polling stalls, slow RPC — chronic across 4.23/4.25/4.26 on Windows 11 + Node 24 #73323

@maoruilun0411-del

Description

@maoruilun0411-del

Bug type

Regression (worked before, now fails)

Beta release blocker

No

Summary

Gateway long-running Node process exhibits multi-subsystem network/timer degradation (model-pricing fetch 60s timeouts, Telegram polling stalls 127–266s, RPC slowdowns 8–83s) reproducible across 2026.4.23, 2026.4.25, and 2026.4.26 on Windows 11 build 26100.8115 + Node 24.14.1. From a standalone Node process on the same machine, fetch() to the same endpoints completes in 100–800ms.

Steps to reproduce

  1. npm i -g openclaw@2026.4.26 --omit=optional
  2. openclaw doctor --fix (gateway auto-restarts; bundled deps installed cleanly)
  3. Configure Telegram channel: channels.telegram.enabled=true, valid botToken, dmPolicy: allowlist, plugins.entries.telegram.enabled=true
  4. openclaw gateway start → /health returns 200 within ~30s, log shows "ready (2 plugins: memory-core, telegram)"
  5. Wait 2–5 minutes
  6. First Polling stall detected and pricing fetch failed (timeout 60s) log lines appear
  7. Cycle recurs every 2–3 minutes thereafter; getUpdates and sendMessage calls fail with bare Network request for '...' failed!

Expected behavior

Gateway RPC and outbound HTTP fetches complete in <1s consistently, matching the timing observed when the same Node 24.14.1 binary issues fetch() to the same endpoints from a standalone process on the same host:

  • fetch('https://api.telegram.org/bot<token>/getMe') → 106ms (IPv6-first), 116ms (IPv4-first)
  • PowerShell curl to api.telegram.org → 0.1s
  • PowerShell curl to openrouter.ai/api/v1/models → 0.8s
    Telegram polling and sendMessage should run continuously without 100s+ stalls.

Actual behavior

Multi-subsystem network/timer degradation observed simultaneously inside the long-running gateway process:

gateway/model-pricing      | OpenRouter pricing fetch failed (timeout 60s): TimeoutError
gateway/model-pricing      | LiteLLM pricing fetch failed (timeout 60s): TimeoutError
gateway/channels/telegram  | [telegram] Polling stall detected (active getUpdates stuck for 127.45s); forcing restart.
gateway/channels/telegram  | polling cycle finished reason=polling stall detected ... durationMs=127457 error=Network request for 'getUpdates' failed!
gateway/channels/telegram  | telegram sendMessage failed: Network request for 'sendMessage' failed!
gateway/channels/telegram  | telegram message processing failed: HttpError: Network request for 'sendMessage' failed!
gateway/ws                 | res ✓ models.list 55798ms      (normally <500ms)
gateway/ws                 | res ✓ models.list 83581ms
gateway/ws                 | res ✓ doctor.memory.status 35988ms
diagnostic                 | stuck session: state=processing age=282s queueDepth=1

In a single 1-hour observation window: 6 polling stalls, 4 sendMessage failures, 14 pricing-fetch 60s timeouts, plus multiple models.list / doctor.memory.status / node.list RPCs clocking 8–83s where they normally finish in <500ms.

Direct probes from PowerShell curl https://api.telegram.org/bot<token>/getMe and from a separate node -e "fetch(...)" to the SAME endpoints succeed in 0.1–0.8s consistently throughout these gateway-internal stalls.

OpenClaw version

2026.4.26 (be8c246) — also reproduced on 2026.4.25 (aa36ee6) and 2026.4.23 (a979721)

Operating system

Windows 11 build 26100.8115

Install method

npm global (--omit=optional); Node v24.14.1; PowerShell 5.1

Model

xiaomi/mimo-v2.5-pro (primary); reproduces regardless of model — pricing fetch + Telegram getUpdates stall independent of LLM choice

Provider / routing chain

openclaw -> Telegram polling (bundled grammyjs runner) -> api.telegram.org; openclaw -> xiaomi (mimo via api.xiaomimimo.com); openclaw -> openrouter.ai/api/v1/models + LiteLLM public pricing JSON (gateway-internal hardcoded fetches)

Additional provider/model setup details

No proxy configured (no HTTP_PROXY / HTTPS_PROXY / ALL_PROXY env vars). UK home broadband, no VPN, no corporate firewall. Fallback chain: zai/glm-5.1, xiaomi/mimo-v2.5, minimax/MiniMax-M2.7. All providers reachable when probed from a standalone Node process; degradation is gateway-internal only.

Logs, screenshots, and evidence

**What does NOT explain it (each tested):**

| Hypothesis | Evidence against |
|---|---|
| Bot token / Telegram API issue | `curl https://api.telegram.org/bot<token>/getMe` returns ok=true in 0.1s, consistently |
| Public network slow | Standalone `node -e "fetch(...)"` hits api.telegram.org and openrouter.ai/api/v1/models in 100–800ms |
| IPv6 vs IPv4 | Both `--dns-result-order=ipv4first` and default IPv6-first succeed via standalone Node fetch in <120ms; DNS resolves both A and AAAA cleanly |
| Bundled plugin runtime deps missing | `openclaw doctor --fix` reports all deps installed |
| `fetchWithSsrFGuard` connection pool | Verified in dist/fetch-guard-C10MVwBt.js the SSRF guard creates a per-call dispatcher and disposes on completion. Pricing code (dist/usage-format-ZhKID6__.js) uses raw fetch + AbortSignal.timeout(60000), not SSRF wrapper, and still times out |
| OS-level network state corruption | Full Windows reboot (cold boot to gateway start) reproduces chronic within ~30 minutes |
| 4.25 / 4.26 regression | Identical signatures on 2026.4.23 (a979721) before any 4.25/4.26 install |
| Node 24 specific | Same Node 24 binary fetches fine from a standalone process — only the long-running gateway process degrades |

**Process resource snapshot at degradation point (PID 16776):**
- working set 616 MB / private 811 MB
- 45 threads
- **3337 handles** (notably high)
- 25 min uptime

**Workaround attempts that did NOT help:**
- `openclaw doctor --fix` (3 cycles)
- `openclaw gateway restart` (10+ cycles)
- Hard kill (`Stop-Process -Force` on PID owning :18789 + tray) → clean restart
- Full Windows 11 reboot
- Downgrade 4.25 → 4.23 → back to 4.25 → 4.26
- `channels.telegram.pollTimeoutMs: 5000` (vs default 30000)
- Force IPv4 via `NODE_OPTIONS=--dns-result-order=ipv4first`
- Removed unused providers (arcee/openrouter)
- `openclaw sessions cleanup --enforce --fix-missing`

Logs are sanitized of bot tokens / API keys; happy to share unredacted logs privately.

Impact and severity

Affected: All gateway-internal outbound HTTP — Telegram polling/sendMessage, model-pricing fetch, in-process gateway RPC (models.list, doctor.memory.status, node.list).
Severity: High — Telegram bot replies blocked or delayed 5+ minutes; gateway RPC slow enough that openclaw-sweep tools fail or partial-result. User has to fall back to Tray UI / webchat for any reliable use.
Frequency: Always — chronic recurs every 2–3 minutes once gateway is up >5 min, on this Windows 11 + Node 24.14.1 host across 2026.4.23, 2026.4.25, and 2026.4.26.
Consequence: Telegram channel effectively unusable; missed/delayed messages; gateway needs constant restart; /health flickers between 200 and timeout.

Additional information

Hypotheses (ranked) for maintainers:

  1. Shared global undici dispatcher / Agent state degrades over time. Multiple subsystems (model-pricing, Telegram grammyjs runner, doctor.memory.status) all use shared global undici and all start failing together. Hand-off / keep-alive socket reaping appears to break — getUpdates requests sit 127–266s past their AbortSignal timeout, suggesting the abort/timer layer is no longer firing as expected.
  2. Telegram grammyjs polling runner long-poll keep-alive sockets go stale; runner's stall detector only catches it after 127–197s. Plausibly correlates with pricing-fetch / RPC slowdown if all three share the same global dispatcher.
  3. Event-loop starvation during channels-and-sidecars phase — models.list 55–83s, node.list 8.9s, doctor.memory.status 35s suggests a long-running synchronous task is blocking the loop, which would also explain pricing-fetch timers not firing.

Note on Codex / parallel diagnostic: An independent agent (Codex) ran a parallel diagnostic on the same machine and concurs the runtime degradation is process-internal, not network-side. openclaw-sweep runId 8924e8d6-d776-4ed5-94be-a87fd194372b available on request.

Last known good version: unknown — bug present in oldest version we could test (2026.4.23). Not a recent regression.

Happy to provide: full gateway log (C:\tmp\openclaw\openclaw-2026-04-2*.log), --inspect profile, OPENCLAW_DEBUG_INGRESS_TIMING=1 / OPENCLAW_DEBUG_HEALTH=1 traces, doctor --deep output (216s runtime; no actionable network-layer findings), or any other diagnostic that helps narrow which layer (undici Agent, grammyjs runner, gateway event loop, Windows-specific socket behavior) is degrading.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions