Summary
Provider auth pre-warming blocks the Node.js event loop for 60-90 seconds during gateway startup, causing all concurrent I/O operations (HTTP API calls, MCP server startup, WebSocket connections) to time out. Affects all tested remote model providers.
Environment
- OS: Windows 10 Pro 22H2 (19045)
- OpenClaw: 2026.5.22 (a374c3a)
- Node.js: v24.14.1
- Shell: bash (via VS Code)
Reproduction Steps
- Start OpenClaw gateway with any remote model provider (e.g., DashScope, CTYun).
- Observe logs during the first 2 minutes of startup.
Reproduction rate: 100% across multiple restarts and model switches.
Tested Providers (All Affected)
| Provider |
Model |
API Latency |
Auth Pre-Warm |
Event Loop Max |
| DashScope |
qwen-plus |
0.9s |
86,014 ms |
67,109 ms |
| DashScope |
qwen-vl-max |
~1s |
78,464 ms |
61,942 ms |
| CTYun (Tianyi) |
GLM-5-Pro |
3.0s |
90,402 ms |
70,733 ms |
Model APIs respond quickly via direct curl/PowerShell (0.2-1.0s). The issue is in OpenClaw's auth pre-warming, not upstream APIs.
Cascade Failure Chain
Gateway Start
|
└─ Auth Pre-Warming (event loop blocked 60-90s)
├─ [TIMEOUT] Feishu tenant_access_token API (30s)
├─ [TIMEOUT] Feishu bot identity ping (30s, 5 retries over 15 min)
├─ [TIMEOUT] MCP server startup (30s, no tools for agent)
├─ [DELAY] Health check: 3-22s (normal <0.01s)
├─ [DELAY] Control UI: agents.list 14-16s, models.list 16-23s
├─ [DELAY] Message response: 80-126s (normal <10s)
└─ [CPU] 97% utilization, 666MB memory for gateway process
Key Log Evidence
provider auth state pre-warmed in 90402ms eventLoopMax=70732.7ms
liveness warning: reasons=event_loop_delay,event_loop_utilization,cpu
eventLoopDelayP99Ms=22968 eventLoopUtilization=0.973 cpuCoreRatio=0.967
feishu[default]: bot info probe timed out after 30000ms; continuing startup
(repeats 5 times over 15 minutes before bot identity resolves)
failed to start server "windows-automation" (...): MCP server connection timed out after 30000ms
stalled session: age=162s classification=stalled_agent_run
Feishu APIs (tenant_access_token, bot ping) respond in ~0.2s when tested directly with curl, but time out (30s) when called from the Node.js process during auth pre-warming.
Message Response Timeline (Real Example)
| Time |
Event |
| 09:00:16 |
Message received from Feishu |
| 09:01:24 |
Core plugin tools loaded (+68s) |
| 09:01:57 |
MCP server timeout (+33s, total 101s) |
| 09:02:08 |
Stream ready (+11s, total 112s) |
| 09:02:22 |
Reply sent (+14s, total 126s) |
Per-message overhead (post-auth-warming):
- tool-policy: 2.3s
- image-tool: 1.2s
- plugin-tools: 1.6s
- system-prompt: 3.3-5.1s
- session-resource-loader: 4.1-7.2s
Total overhead before model call: 15-17s
Workaround
- Wait 2-3 minutes after gateway start before sending messages.
- Feishu bot identity may need 5-15 minutes to recover via background retry.
Expected Behavior
Auth pre-warming should be non-blocking (async) or use worker threads to avoid starving the event loop. A 60-90s synchronous block in the main event loop makes the gateway unusable during startup and degrades reliability permanently.
Affected Components
| Component |
Severity |
Impact |
| Feishu |
CRITICAL |
bot identity cannot resolve, messages undeliverable for 5-15 min |
| WeCom |
Moderate |
WebSocket-based, somewhat resilient but messages delayed 47-99s |
| MCP Servers |
CRITICAL |
all MCP servers fail to start (30s timeout) |
| Control UI |
Moderate |
API calls delayed 14-25s |
| Health Check |
Minor |
occasionally slow (3-22s vs normal <0.01s) |
Additional Context
- Issue persists across multiple restarts and different model providers.
- No error logs from the model APIs themselves; the calls succeed once the event loop becomes free.
- The problem appears to be architectural: the gateway's startup sequence does not follow Node.js non-blocking design principles.
Summary
Provider auth pre-warming blocks the Node.js event loop for 60-90 seconds during gateway startup, causing all concurrent I/O operations (HTTP API calls, MCP server startup, WebSocket connections) to time out. Affects all tested remote model providers.
Environment
Reproduction Steps
Reproduction rate: 100% across multiple restarts and model switches.
Tested Providers (All Affected)
Cascade Failure Chain
Key Log Evidence
Message Response Timeline (Real Example)
Per-message overhead (post-auth-warming):
Total overhead before model call: 15-17s
Workaround
Expected Behavior
Auth pre-warming should be non-blocking (async) or use worker threads to avoid starving the event loop. A 60-90s synchronous block in the main event loop makes the gateway unusable during startup and degrades reliability permanently.
Affected Components
Additional Context