Skip to content

[Bug]: undici HTTP/2 hang on Windows extends from Telegram polling into the LLM model dispatcher (related to #66885) #73831

@joeywrightphoto

Description

@joeywrightphoto

[Bug]: undici HTTP/2 hang on Windows extends from Telegram polling into the LLM model dispatcher (related to #66885 / #10795 / #4847)

Summary

On Windows running OpenClaw 2026.4.23 and 2026.4.26 with Node 24.13.0, all outbound fetch-based HTTP calls intermittently hang for 90–200 seconds before failing. This affects:

  1. Telegram getUpdates long-polling (already noted in [Bug]: Telegram polling stall + subagent announce timeout on Windows (4.12) — undici HTTP/2 root cause #66885)
  2. Telegram sendMessage outbound (logged as Network request for 'sendMessage' failed!)
  3. Model dispatcher LLM calls (e.g. openai/claude-opus-4-7) — LLM request timed out after the configured 97s

The third one is new — #66885 only mentions telegram polling and subagent announce, but the same undici socket pool hang is now blocking actual model invocations on the main agent. After we layered every reasonable client-side mitigation, telegram bot-internal commands like /status work (no LLM), but any real agent run on a long prompt times out.

Affected versions

  • 2026.4.26 (be8c246) — first observed today (2026-04-28). Telegram polling stalls every 10–15 min, sendMessage failures, model timeouts.
  • 2026.4.23 (a979721) — same behavior after rolling back. Bug is not version-specific within this range.

Environment

  • OS: Windows 10.0.26200 (x64)
  • Node: 24.13.0
  • OpenClaw user config: agents.defaults.model.primary = openai/claude-opus-4-7, channels.telegram.streaming.mode=partial
  • Network: Tailscale + LAN, behind Comcast NAT, IPv6 already disabled at adapter binding (Disable-NetAdapterBinding -Name Ethernet -ComponentID ms_tcpip6 shows Enabled: False)
  • Comparable Mac on identical 2026.4.26: zero stalls. Issue is Windows-specific.

Mitigations already applied (none fully resolve)

  1. channels.telegram.streaming.mode=partial, autoSelectFamily=true, dnsResultOrder=ipv4first (set by OpenClaw runtime — see [telegram/network] dnsResultOrder=ipv4first (default-node22) log line)
  2. Add-MpPreference -ExclusionPath for the openclaw npm node_modules path
  3. Add-MpPreference -ExclusionProcess "node.exe"
  4. ✅ Inserted set "NODE_OPTIONS=--dns-result-order=ipv4first" into gateway.cmd before the node.exe launch line (process-level, not just runtime hint)
  5. Disable-NetAdapterBinding -ComponentID ms_tcpip6 on the active Ethernet adapter (was already disabled)
  6. ✅ Hard reboot of the Windows host to flush stuck undici sockets
  7. ✅ Full gateway restart (multiple times)

After all of the above, /status and other bot-internal commands respond instantly. Long prompts to the main agent still time out at 97s on the Anthropic call.

Logs

Telegram polling stall pattern (recurring all afternoon, ~every 10–15 min)

[telegram] Polling stall detected (active getUpdates stuck for 178.44s); forcing restart.
   [diag inFlight=1 outcome=started startedAt=1777406174509 finishedAt=1777406174509 durationMs=30356 offset=0]
[telegram][diag] polling cycle finished reason=polling stall detected
   error=Network request for 'getUpdates' failed!
Telegram polling runner stopped (polling stall detected); restarting in 3.78s.
[telegram][diag] rebuilding transport for next polling cycle

Telegram sendMessage failure pattern (slash command replies dropped)

telegram sendMessage failed: Network request for 'sendMessage' failed!
telegram slash block reply failed: HttpError: Network request for 'sendMessage' failed!
telegram sendMessage failed: Network request for 'sendMessage' failed!
telegram slash final reply failed: HttpError: Network request for 'sendMessage' failed!

(and intermittently, the same path succeeds: telegram sendMessage ok chat=… message=17754 2 seconds later.)

NEW: model dispatcher timeout (this is the part #66885 doesn't cover)

lane task error: lane=session:agent:main:main durationMs=96963 error="FailoverError: LLM request timed out."
lane task error: lane=main durationMs=7477 error="FailoverError: openrouter (openai/gpt-5.5) returned a billing error..."
Embedded agent failed before reply: All models failed (2):
   openai/claude-opus-4-7: LLM request timed out. (timeout)
   openrouter/openai/gpt-5.5: 402 This request requires more credits…

The 96963 ms duration matches the [default] starting providerLLM request timed out envelope perfectly — same undici hang shape as the telegram stalls, but on the model call.

Direct API tests bypassing OpenClaw (PowerShell, same machine, same network)

Invoke-RestMethod  https://api.telegram.org/bot$bot/getMe          → 404 ms ✅
Invoke-RestMethod  https://api.telegram.org/bot$bot/getUpdates?…  → 399 ms ✅ (returned 2 pending updates)

So api.telegram.org, DNS, TLS, the bot token, and Windows TCP all work. The hang is inside undici's connection pool when the same calls go through Node's built-in fetch.

Suggested root cause (from forum / prior issue references)

Per #66885 and #10795: Node 22+ undici implements Happy Eyeballs but ignores net.setDefaultAutoSelectFamily. When allowH2: true (default) and the host advertises HTTP/2 + IPv6, undici can keep an HTTP/2 stream half-open against an IPv6 path that Windows can't actually route. The dispatcher sits in inFlight until the watchdog kills it.

#66885 fixed this for web_fetch in 4.7 by setting allowH2: false on that dispatcher. The same fix appears not to be applied to:

Suggested fix

Apply the same allowH2: false (and explicit autoSelectFamily: false in the underlying Agent) to:

  1. The Telegram channel's outbound HTTP client (covering both getUpdates and sendMessage)
  2. The model dispatcher used by agents/harness for provider calls

Both should use a shared undici Agent configured to:

new Agent({
  allowH2: false,
  connect: { autoSelectFamily: false, family: 4 },
})

…or accept a user-supplied dispatcher via env (UNDICI_HTTP1_ONLY=1 or similar) so Windows users without IPv6 routability can opt in without code changes.

Why this matters

The current state on Windows is that:

  • Bot-internal commands work (no LLM call)
  • Cron jobs and any prompt to a main agent intermittently time out at the model call layer
  • The watchdog masks the issue for telegram (eventually retries) but not for model calls (one shot, 97s, fail)

Mac users on identical OpenClaw versions are entirely unaffected because their IPv6 stack is healthy enough to negotiate HTTP/2.

Related issues

cc @steipete — given you tracked #71325 to landing, this may be in your area.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions