Bug Report: WebSocket disconnect (code 1006) during Gateway startup due to event loop starvation
Summary
When the Gateway starts with a QQ Bot channel configured, the Control UI WebSocket
connection drops with code 1006 multiple times during the first 5 minutes.
Root cause: channels.qqbot.start-account and model-prewarm consume the
Node.js event loop for extended periods without yielding, starving WebSocket
keepalive pings. The built-in liveness detector confirms this.
Environment
- OpenClaw version: 2026.5.7
- Node.js: v24.15.0
- OS: Windows 10.0.26200 (x64)
- Channels configured: qqbot (enabled)
- Model: deepseek/deepseek-v4-pro (thinking=high)
- Plugins loaded (7): browser, device-pair, file-transfer, memory-core,
phone-control, qqbot, talk-voice
- Gateway bind: loopback-only (127.0.0.1:18789)
Timeline (key events, times in local UTC+8)
09:58:47 Gateway started (PID 14284)
09:58:52 HTTP server listening
09:58:54 ┌─ "gateway ready"
├─ QQ Bot: Starting gateway (appId=<APPID>)
├─ QQ Bot: Registering approval.native runtime context
├─ Browser control listening on :18791
└─ Heartbeat started (interval: 30min)
09:58:55 ┌─ Control UI connect attempt #1
│ REJECTED: code=1013 reason="gateway starting"
│ (cause=startup-sidecars-pending, durationMs=1215)
└─ This is expected behavior
09:58:55 QQ Bot: Access token obtained successfully
09:58:56 ┌─ Control UI connect attempt #2
│ REJECTED: code=1005 (still starting)
│ (handshake=pending, durationMs=230)
└─ Still expected
09:58:56 Control UI connected (conn=A) ← first successful connection
09:58:58 QQ Bot: Connecting to wss://api.sgroup.qq.com/websocket
09:58:58 QQ Bot: WebSocket connected
09:58:58 QQ Bot: Gateway ready
09:58:58 npm registry fetch timeout (2.5s) — non-blocking
Normal startup API calls (all ~1.1-1.2 seconds — already slow)
09:58:57 health 284ms (cached)
09:58:58 commands.list 1006ms
09:58:58 node.list 1188ms ← notably slow for a local API
09:58:58 device.pair.list 1190ms
09:58:58 models.list 1197ms
09:58:58 sessions.list 1202ms
09:58:58 chat.history 1211ms
Observation: Even the first batch of API calls takes 1.0-1.2 seconds each,
despite Gateway being "ready". Something is occupying the event loop.
🔴 Disconnect #1 (code 1006 — abnormal)
10:00:04 webchat disconnected code=1006 conn=A
(connection A lasted ~68 seconds)
10:00:05 webchat connected conn=B (auto-reconnect, ~850ms gap)
Reconnect API calls (slower, ~1.9-2.3s each)
10:00:07 health 722ms
10:00:08 commands.list 1910ms
10:00:08 models.list 1914ms
10:00:08 sessions.list 1922ms
10:00:09 node.list 2330ms ← getting worse
10:00:09 device.pair.list 2334ms
10:00:09 chat.history 2347ms
🔴 Disconnect #2 (code 1006 — abnormal)
10:01:47 node.list 7940ms ← ~8 seconds for a local API call!
10:01:48 webchat disconnected code=1006 conn=B
(connection B lasted ~103 seconds)
10:01:50 webchat connected conn=C (auto-reconnect, ~2s gap)
🩺 Liveness Warning (the smoking gun)
At 10:01:51 (3 seconds after disconnect #2), the Gateway's own diagnostics fired:
liveness warning:
reasons = event_loop_delay
interval = 30s
eventLoopDelayMaxMs = 7944
eventLoopDelayP99Ms = 990.9
eventLoopUtilization = 0.604 (60.4% of CPU time spent NOT processing events)
cpuCoreRatio = 0.618
phase = channels.qqbot.start-account ← culprit!
recentPhases = sidecars.subagent-recovery:4ms,
sidecars.main-session-recovery:2ms,
sidecars.session-locks:33ms,
sidecars.model-prewarm:2222ms, ← also blocking!
post-ready.maintenance:333ms
active work = agent:main:main (processing/model_call, age=1s)
queued work = agent:main:main (processing/model_call, age=1s)
Reconnect #2 API calls (still slow, ~1.7-2.1s)
10:01:52 health 630ms
10:01:53 commands.list 1724ms
10:01:53 chat.history 1727ms
10:01:53 models.list 1730ms
10:01:53 sessions.list 1735ms
10:01:53 node.list 2104ms
10:01:53 device.pair.list 2108ms
🔴 Disconnect #3 (code 1001 — browser going away)
10:04:42 webchat connected conn=D (new browser tab/window)
10:04:44 health 672ms
10:04:45 commands.list 1869ms
10:04:45 models.list 1872ms
10:04:45 sessions.list 1877ms
10:04:45 node.list 2246ms
10:04:45 device.pair.list 2249ms
10:04:45 chat.history 2253ms
10:04:50 webchat disconnected code=1001 conn=D (graceful close)
10:04:52 liveness warning eventLoopDelayMaxMs=2246 util=23.3%
Persistent liveness warnings (proving it's not a one-time startup spike)
10:05:22 liveness: eventLoopDelayMaxMs=1056 util=5.1%
phase=channels.qqbot.start-account ← still in this phase!
10:07:22 liveness: eventLoopDelayMaxMs=1066 util=4.8%
phase=channels.qqbot.start-account
10:14:23 liveness: eventLoopDelayMaxMs=13639 ← 13.6 seconds!
util=58.5%
phase=channels.qqbot.start-account
10:16:23 liveness: eventLoopDelayMaxMs=1449 util=16%
phase=channels.qqbot.start-account
The phase is always channels.qqbot.start-account, and
recentPhases consistently includes sidecars.model-prewarm:2222ms.
Root Cause Analysis
┌─────────────────────────────────────────────────────────┐
│ Node.js Single-Threaded Event Loop │
│ │
│ ┌──────────────────────────────────────┐ │
│ │ QQ Bot start-account (sync blocks) │ ← phase stuck │
│ │ model-prewarm: 2222ms (sync block) │ here │
│ │ │ │
│ │ ██████████████████████████████████ │ Event loop │
│ │ ██████████████████████████████████ │ utilization │
│ │ ██████████████████████████████████ │ up to 60.4% │
│ └──────────────────────────────────────┘ │
│ │
│ WebSocket keepalive (ping/pong) ── queued, starved │
│ ↓ │
│ Browser WS timeout (~10s) → code 1006 disconnect │
│ ↓ │
│ Auto-reconnect → triggers node.list, sessions.list etc │
│ ↓ │
│ These also queue behind the blocking phase → still slow│
│ ↓ │
│ Browser times out again → disconnect → repeat 🔁 │
└─────────────────────────────────────────────────────────┘
Why API calls are slow
API calls like node.list, sessions.list, device.pair.list are not
intrinsically slow — they're waiting in the event loop queue behind:
channels.qqbot.start-account — appears to contain long synchronous
operations that don't yield to the event loop
sidecars.model-prewarm:2222ms — a 2.2-second synchronous model
warmup that blocks the entire process
The measured API call duration includes queue wait time, not just execution
time. This is why node.list reports 7940ms — it spent most of that time
just waiting for the event loop to become available.
Impact
- Duration: First ~5 minutes after Gateway restart
- Frequency: ~2-4 WebSocket disconnects during this window
- Data loss: No. Server-side session data is preserved. Only the UI
refreshes (browser auto-reconnects as a new WebSocket session)
- User visible: Control UI chat history appears to clear on reconnect
(because it's a new WS session), but reloading the page restores it
- Reproducibility: Every Gateway restart with QQ Bot channel enabled
Suggested Fixes
1. Break up channels.qqbot.start-account into async chunks
The start-account phase should periodically yield control back to the event
loop using setImmediate(), process.nextTick(), or await new Promise(r => setTimeout(r, 0)):
// Instead of:
for (const item of largeArray) {
heavySyncOperation(item);
}
// Use:
for (const item of largeArray) {
heavySyncOperation(item);
if (i % 10 === 0) await new Promise(r => setImmediate(r));
}
2. Make model-prewarm non-blocking
A 2222ms synchronous operation should be moved to a Worker thread or broken
into async chunks. It should never block the event loop for >2 seconds.
3. Increase WebSocket keepalive tolerance (mitigation only)
As a secondary mitigation, the Control UI client could increase its WS
timeout tolerance during the startup window, or the server-side WS could
use a longer ping interval during startup phases.
4. Consider moving QQ Bot init to post-ready
If start-account doesn't need to complete before Gateway reports "ready",
it could be deferred to run after the ready signal, reducing the window
where Control UI connections are vulnerable.
Note on plugins.allow
Setting plugins.allow: ["qqbot"] was tested as a mitigation — it eliminates
the startup warning about plugin discovery but does not resolve this issue.
The bottleneck is not plugin scanning.
Diagnosis performed on 2026-05-16. Log excerpts redacted: hostname, username,
and AppID replaced with placeholders. No message content, API keys, or tokens
are included.
Bug Report: WebSocket disconnect (code 1006) during Gateway startup due to event loop starvation
Summary
When the Gateway starts with a QQ Bot channel configured, the Control UI WebSocket
connection drops with code 1006 multiple times during the first 5 minutes.
Root cause:
channels.qqbot.start-accountandmodel-prewarmconsume theNode.js event loop for extended periods without yielding, starving WebSocket
keepalive pings. The built-in liveness detector confirms this.
Environment
phone-control, qqbot, talk-voice
Timeline (key events, times in local UTC+8)
Normal startup API calls (all ~1.1-1.2 seconds — already slow)
🔴 Disconnect #1 (code 1006 — abnormal)
Reconnect API calls (slower, ~1.9-2.3s each)
🔴 Disconnect #2 (code 1006 — abnormal)
🩺 Liveness Warning (the smoking gun)
At 10:01:51 (3 seconds after disconnect #2), the Gateway's own diagnostics fired:
Reconnect #2 API calls (still slow, ~1.7-2.1s)
🔴 Disconnect #3 (code 1001 — browser going away)
Persistent liveness warnings (proving it's not a one-time startup spike)
The
phaseis alwayschannels.qqbot.start-account, andrecentPhasesconsistently includessidecars.model-prewarm:2222ms.Root Cause Analysis
Why API calls are slow
API calls like
node.list,sessions.list,device.pair.listare notintrinsically slow — they're waiting in the event loop queue behind:
channels.qqbot.start-account— appears to contain long synchronousoperations that don't yield to the event loop
sidecars.model-prewarm:2222ms— a 2.2-second synchronous modelwarmup that blocks the entire process
The measured API call duration includes queue wait time, not just execution
time. This is why
node.listreports 7940ms — it spent most of that timejust waiting for the event loop to become available.
Impact
refreshes (browser auto-reconnects as a new WebSocket session)
(because it's a new WS session), but reloading the page restores it
Suggested Fixes
1. Break up
channels.qqbot.start-accountinto async chunksThe start-account phase should periodically yield control back to the event
loop using
setImmediate(),process.nextTick(), orawait new Promise(r => setTimeout(r, 0)):2. Make
model-prewarmnon-blockingA 2222ms synchronous operation should be moved to a Worker thread or broken
into async chunks. It should never block the event loop for >2 seconds.
3. Increase WebSocket keepalive tolerance (mitigation only)
As a secondary mitigation, the Control UI client could increase its WS
timeout tolerance during the startup window, or the server-side WS could
use a longer ping interval during startup phases.
4. Consider moving QQ Bot init to post-ready
If
start-accountdoesn't need to complete before Gateway reports "ready",it could be deferred to run after the ready signal, reducing the window
where Control UI connections are vulnerable.
Note on plugins.allow
Setting
plugins.allow: ["qqbot"]was tested as a mitigation — it eliminatesthe startup warning about plugin discovery but does not resolve this issue.
The bottleneck is not plugin scanning.
Diagnosis performed on 2026-05-16. Log excerpts redacted: hostname, username,
and AppID replaced with placeholders. No message content, API keys, or tokens
are included.