[BUG] Auth pre-warming blocks event loop for 60-90s, causing cascading timeouts

## Summary
Provider auth pre-warming blocks the Node.js event loop for 60-90 seconds during gateway startup, causing all concurrent I/O operations (HTTP API calls, MCP server startup, WebSocket connections) to time out. Affects all tested remote model providers.

## Environment
- **OS**: Windows 10 Pro 22H2 (19045)
- **OpenClaw**: 2026.5.22 (a374c3a)
- **Node.js**: v24.14.1
- **Shell**: bash (via VS Code)

## Reproduction Steps
1. Start OpenClaw gateway with any remote model provider (e.g., DashScope, CTYun).
2. Observe logs during the first 2 minutes of startup.

**Reproduction rate**: 100% across multiple restarts and model switches.

## Tested Providers (All Affected)
| Provider       | Model           | API Latency | Auth Pre-Warm | Event Loop Max |
|----------------|----------------|-------------|---------------|----------------|
| DashScope      | qwen-plus      | 0.9s        | 86,014 ms     | 67,109 ms      |
| DashScope      | qwen-vl-max    | ~1s         | 78,464 ms     | 61,942 ms      |
| CTYun (Tianyi) | GLM-5-Pro      | 3.0s        | 90,402 ms     | 70,733 ms      |

> Model APIs respond quickly via direct curl/PowerShell (0.2-1.0s). The issue is in OpenClaw's auth pre-warming, not upstream APIs.

## Cascade Failure Chain
```
Gateway Start
  |
  └─ Auth Pre-Warming (event loop blocked 60-90s)
       ├─ [TIMEOUT] Feishu tenant_access_token API (30s)
       ├─ [TIMEOUT] Feishu bot identity ping (30s, 5 retries over 15 min)
       ├─ [TIMEOUT] MCP server startup (30s, no tools for agent)
       ├─ [DELAY] Health check: 3-22s (normal <0.01s)
       ├─ [DELAY] Control UI: agents.list 14-16s, models.list 16-23s
       ├─ [DELAY] Message response: 80-126s (normal <10s)
       └─ [CPU] 97% utilization, 666MB memory for gateway process
```

## Key Log Evidence
```
provider auth state pre-warmed in 90402ms eventLoopMax=70732.7ms
liveness warning: reasons=event_loop_delay,event_loop_utilization,cpu 
   eventLoopDelayP99Ms=22968 eventLoopUtilization=0.973 cpuCoreRatio=0.967

feishu[default]: bot info probe timed out after 30000ms; continuing startup
   (repeats 5 times over 15 minutes before bot identity resolves)

failed to start server "windows-automation" (...): MCP server connection timed out after 30000ms

stalled session: age=162s classification=stalled_agent_run
```

> Feishu APIs (`tenant_access_token`, `bot ping`) respond in ~0.2s when tested directly with curl, but time out (30s) when called from the Node.js process during auth pre-warming.

## Message Response Timeline (Real Example)
| Time     | Event                                      |
|----------|--------------------------------------------|
| 09:00:16 | Message received from Feishu               |
| 09:01:24 | Core plugin tools loaded (+68s)            |
| 09:01:57 | MCP server timeout (+33s, total 101s)      |
| 09:02:08 | Stream ready (+11s, total 112s)            |
| 09:02:22 | Reply sent (+14s, total 126s)              |

**Per-message overhead (post-auth-warming)**:
- tool-policy: 2.3s
- image-tool: 1.2s
- plugin-tools: 1.6s
- system-prompt: 3.3-5.1s
- session-resource-loader: 4.1-7.2s

**Total overhead before model call**: 15-17s

## Workaround
- Wait 2-3 minutes after gateway start before sending messages.
- Feishu bot identity may need 5-15 minutes to recover via background retry.

## Expected Behavior
Auth pre-warming should be **non-blocking (async)** or use **worker threads** to avoid starving the event loop. A 60-90s synchronous block in the main event loop makes the gateway unusable during startup and degrades reliability permanently.

## Affected Components
| Component      | Severity | Impact                                                       |
|----------------|----------|--------------------------------------------------------------|
| Feishu         | CRITICAL | bot identity cannot resolve, messages undeliverable for 5-15 min |
| WeCom          | Moderate | WebSocket-based, somewhat resilient but messages delayed 47-99s |
| MCP Servers    | CRITICAL | all MCP servers fail to start (30s timeout)                  |
| Control UI     | Moderate | API calls delayed 14-25s                                     |
| Health Check   | Minor    | occasionally slow (3-22s vs normal <0.01s)                   |

## Additional Context
- Issue persists across multiple restarts and different model providers.
- No error logs from the model APIs themselves; the calls succeed once the event loop becomes free.
- The problem appears to be architectural: the gateway's startup sequence does not follow Node.js non-blocking design principles.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[BUG] Auth pre-warming blocks event loop for 60-90s, causing cascading timeouts #86506

Summary

Environment

Reproduction Steps

Tested Providers (All Affected)

Cascade Failure Chain

Key Log Evidence

Message Response Timeline (Real Example)

Workaround

Expected Behavior

Affected Components

Additional Context

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Provider	Model	API Latency	Auth Pre-Warm	Event Loop Max
DashScope	qwen-plus	0.9s	86,014 ms	67,109 ms
DashScope	qwen-vl-max	~1s	78,464 ms	61,942 ms
CTYun (Tianyi)	GLM-5-Pro	3.0s	90,402 ms	70,733 ms

Time	Event
09:00:16	Message received from Feishu
09:01:24	Core plugin tools loaded (+68s)
09:01:57	MCP server timeout (+33s, total 101s)
09:02:08	Stream ready (+11s, total 112s)
09:02:22	Reply sent (+14s, total 126s)

Component	Severity	Impact
Feishu	CRITICAL	bot identity cannot resolve, messages undeliverable for 5-15 min
WeCom	Moderate	WebSocket-based, somewhat resilient but messages delayed 47-99s
MCP Servers	CRITICAL	all MCP servers fail to start (30s timeout)
Control UI	Moderate	API calls delayed 14-25s
Health Check	Minor	occasionally slow (3-22s vs normal <0.01s)

Uh oh!

[BUG] Auth pre-warming blocks event loop for 60-90s, causing cascading timeouts #86506

Description

Summary

Environment

Reproduction Steps

Tested Providers (All Affected)

Cascade Failure Chain

Key Log Evidence

Message Response Timeline (Real Example)

Workaround

Expected Behavior

Affected Components

Additional Context

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions