Skip to content

[BUG] Auth pre-warming blocks event loop for 60-90s, causing cascading timeouts #86506

@Macxiaxia

Description

@Macxiaxia

Summary

Provider auth pre-warming blocks the Node.js event loop for 60-90 seconds during gateway startup, causing all concurrent I/O operations (HTTP API calls, MCP server startup, WebSocket connections) to time out. Affects all tested remote model providers.

Environment

  • OS: Windows 10 Pro 22H2 (19045)
  • OpenClaw: 2026.5.22 (a374c3a)
  • Node.js: v24.14.1
  • Shell: bash (via VS Code)

Reproduction Steps

  1. Start OpenClaw gateway with any remote model provider (e.g., DashScope, CTYun).
  2. Observe logs during the first 2 minutes of startup.

Reproduction rate: 100% across multiple restarts and model switches.

Tested Providers (All Affected)

Provider Model API Latency Auth Pre-Warm Event Loop Max
DashScope qwen-plus 0.9s 86,014 ms 67,109 ms
DashScope qwen-vl-max ~1s 78,464 ms 61,942 ms
CTYun (Tianyi) GLM-5-Pro 3.0s 90,402 ms 70,733 ms

Model APIs respond quickly via direct curl/PowerShell (0.2-1.0s). The issue is in OpenClaw's auth pre-warming, not upstream APIs.

Cascade Failure Chain

Gateway Start
  |
  └─ Auth Pre-Warming (event loop blocked 60-90s)
       ├─ [TIMEOUT] Feishu tenant_access_token API (30s)
       ├─ [TIMEOUT] Feishu bot identity ping (30s, 5 retries over 15 min)
       ├─ [TIMEOUT] MCP server startup (30s, no tools for agent)
       ├─ [DELAY] Health check: 3-22s (normal <0.01s)
       ├─ [DELAY] Control UI: agents.list 14-16s, models.list 16-23s
       ├─ [DELAY] Message response: 80-126s (normal <10s)
       └─ [CPU] 97% utilization, 666MB memory for gateway process

Key Log Evidence

provider auth state pre-warmed in 90402ms eventLoopMax=70732.7ms
liveness warning: reasons=event_loop_delay,event_loop_utilization,cpu 
   eventLoopDelayP99Ms=22968 eventLoopUtilization=0.973 cpuCoreRatio=0.967

feishu[default]: bot info probe timed out after 30000ms; continuing startup
   (repeats 5 times over 15 minutes before bot identity resolves)

failed to start server "windows-automation" (...): MCP server connection timed out after 30000ms

stalled session: age=162s classification=stalled_agent_run

Feishu APIs (tenant_access_token, bot ping) respond in ~0.2s when tested directly with curl, but time out (30s) when called from the Node.js process during auth pre-warming.

Message Response Timeline (Real Example)

Time Event
09:00:16 Message received from Feishu
09:01:24 Core plugin tools loaded (+68s)
09:01:57 MCP server timeout (+33s, total 101s)
09:02:08 Stream ready (+11s, total 112s)
09:02:22 Reply sent (+14s, total 126s)

Per-message overhead (post-auth-warming):

  • tool-policy: 2.3s
  • image-tool: 1.2s
  • plugin-tools: 1.6s
  • system-prompt: 3.3-5.1s
  • session-resource-loader: 4.1-7.2s

Total overhead before model call: 15-17s

Workaround

  • Wait 2-3 minutes after gateway start before sending messages.
  • Feishu bot identity may need 5-15 minutes to recover via background retry.

Expected Behavior

Auth pre-warming should be non-blocking (async) or use worker threads to avoid starving the event loop. A 60-90s synchronous block in the main event loop makes the gateway unusable during startup and degrades reliability permanently.

Affected Components

Component Severity Impact
Feishu CRITICAL bot identity cannot resolve, messages undeliverable for 5-15 min
WeCom Moderate WebSocket-based, somewhat resilient but messages delayed 47-99s
MCP Servers CRITICAL all MCP servers fail to start (30s timeout)
Control UI Moderate API calls delayed 14-25s
Health Check Minor occasionally slow (3-22s vs normal <0.01s)

Additional Context

  • Issue persists across multiple restarts and different model providers.
  • No error logs from the model APIs themselves; the calls succeed once the event loop becomes free.
  • The problem appears to be architectural: the gateway's startup sequence does not follow Node.js non-blocking design principles.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions