Skip to content

fix(gateway): prevent probe timeout from deferred ESM module evaluation#845

Open
BingqingLyu wants to merge 5 commits intomainfrom
fork-pr-48270-fix-probe-event-loop-starvation
Open

fix(gateway): prevent probe timeout from deferred ESM module evaluation#845
BingqingLyu wants to merge 5 commits intomainfrom
fork-pr-48270-fix-probe-event-loop-starvation

Conversation

@BingqingLyu
Copy link
Copy Markdown
Owner

@BingqingLyu BingqingLyu commented Apr 27, 2026

Summary

  • Fixes gateway probe always reporting timeout on Windows after upgrading to 2026.3.13
  • Adds waitForEventLoopReady() before opening the probe WebSocket to ensure deferred ESM module evaluation has completed

Root cause

The auth-profiles ESM bundle triggers deferred synchronous work (primarily AJV schema compilation) that blocks the Node.js event loop for ~7 seconds after the top-level import() promise resolves. This blocking starts after the first event loop cycle completes — setTimeout(0) fires on time, but setTimeout(100) is delayed by 7+ seconds.

The probe's resolveProbeBudgetMs caps local loopback budget at 800ms and the overall default is 3000ms. Both expire while the event loop is blocked, because the WebSocket's open/message callbacks cannot fire until the synchronous work finishes.

Evidence from debugging on a Windows 10 machine with Node 24.14:

Test Connect time
Raw net.connect after import 3ms
http.request after import ~7000ms
ws WebSocket after import ~7000ms
ws WebSocket without import 8ms
Event loop stall detected via setInterval(100) 7234ms

The gateway status command (which uses callGateway with a 10s timeout) was unaffected because its budget outlasts the stall.

Fix

waitForEventLoopReady() schedules 20ms timers and checks for abnormal drift (> 200ms). It resolves only after two consecutive on-time callbacks, guaranteeing the deferred evaluation has finished. On systems without the blocking issue, this adds only ~40ms overhead.

A longer-term fix would be to lazy-compile AJV schemas instead of evaluating them at module scope, which would eliminate the event loop stall entirely.

Test plan

  • Verified openclaw gateway probe returns Reachable: yes (21ms latency) on the affected Windows machine after patch
  • Existing probe.test.ts uses mocked GatewayClient, so waitForEventLoopReady completes instantly — no test breakage expected
  • CI tests pass

Related issues

Fixes openclaw#45940 — False negative from openclaw gateway probe on Windows
Fixes openclaw#46226 — Gateway probe shows 3000ms budget but uses 800ms internally — false timeout on healthy local loopback
Related openclaw#46316devices list / nodes status timeout while gateway status shows RPC probe: ok (regression in 2026.3.12/2026.3.13)
Related openclaw#46000 — Windows local gateway reissues operator device token without operator.read on 2026.3.13, breaking status/probe/health
Related openclaw#47640, openclaw#47307

https://www.answeroverflow.com/m/1482583046749163692

🤖 Generated with Claude Code

wongcode and others added 5 commits March 16, 2026 10:21
On Windows (and potentially other platforms with slower module evaluation),
the auth-profiles ESM bundle triggers deferred synchronous work (primarily
AJV schema compilation) that blocks the event loop for ~7 seconds *after*
the top-level import promise resolves. The probe's 800ms loopback budget
expires during this window because WebSocket data callbacks cannot fire,
causing `gateway probe` to always report "timeout" on 2026.3.13.

Add `waitForEventLoopReady()` that schedules short timers and watches for
abnormal drift, resolving only after two consecutive on-time callbacks.
This guarantees deferred module evaluation has finished before opening
any network connections. On unaffected systems this adds ~40ms overhead.

Fixes: probe timeout regression on Windows after upgrading to 2026.3.13
Related: openclaw#47640, openclaw#47307

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Move waitForEventLoopReady into a shared module (event-loop-ready.ts)
and call it in executeGatewayRequestWithScopes in addition to
probeGateway.  This fixes commands like `cron list`, `devices list`,
and any other CLI path that goes through callGateway — they hit the
same deferred ESM module evaluation stall that was causing probe
timeouts.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Addresses review feedback: if the event loop remains starved beyond the
deadline (default 10 s), resolve anyway so that callers' own timeout
logic can take over rather than hanging indefinitely.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Move event-loop-ready import before method-scopes to satisfy
alphabetical import ordering enforced by the formatter.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Pass the caller-supplied timeoutMs to waitForEventLoopReady so the
readiness preflight respects the probe/call timeout budget instead of
using the 10 s default.  This prevents commands with tight budgets
(e.g. 800 ms loopback probe) from exceeding their timeout contract.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

2 participants