Skip to content

WebSocket disconnect (code 1006) during Gateway startup due to event loop starvation #82398

@Nameless1949

Description

@Nameless1949

Bug Report: WebSocket disconnect (code 1006) during Gateway startup due to event loop starvation

Summary

When the Gateway starts with a QQ Bot channel configured, the Control UI WebSocket
connection drops with code 1006 multiple times during the first 5 minutes.
Root cause: channels.qqbot.start-account and model-prewarm consume the
Node.js event loop for extended periods without yielding, starving WebSocket
keepalive pings. The built-in liveness detector confirms this.

Environment

  • OpenClaw version: 2026.5.7
  • Node.js: v24.15.0
  • OS: Windows 10.0.26200 (x64)
  • Channels configured: qqbot (enabled)
  • Model: deepseek/deepseek-v4-pro (thinking=high)
  • Plugins loaded (7): browser, device-pair, file-transfer, memory-core,
    phone-control, qqbot, talk-voice
  • Gateway bind: loopback-only (127.0.0.1:18789)

Timeline (key events, times in local UTC+8)

09:58:47  Gateway started (PID 14284)
09:58:52  HTTP server listening
09:58:54  ┌─ "gateway ready"
          ├─ QQ Bot: Starting gateway (appId=<APPID>)
          ├─ QQ Bot: Registering approval.native runtime context
          ├─ Browser control listening on :18791
          └─ Heartbeat started (interval: 30min)
09:58:55  ┌─ Control UI connect attempt #1
          │  REJECTED: code=1013 reason="gateway starting"
          │  (cause=startup-sidecars-pending, durationMs=1215)
          └─ This is expected behavior
09:58:55  QQ Bot: Access token obtained successfully
09:58:56  ┌─ Control UI connect attempt #2
          │  REJECTED: code=1005 (still starting)
          │  (handshake=pending, durationMs=230)
          └─ Still expected
09:58:56  Control UI connected (conn=A)  ← first successful connection
09:58:58  QQ Bot: Connecting to wss://api.sgroup.qq.com/websocket
09:58:58  QQ Bot: WebSocket connected
09:58:58  QQ Bot: Gateway ready
09:58:58  npm registry fetch timeout (2.5s) — non-blocking

Normal startup API calls (all ~1.1-1.2 seconds — already slow)

09:58:57  health             284ms  (cached)
09:58:58  commands.list      1006ms
09:58:58  node.list          1188ms  ← notably slow for a local API
09:58:58  device.pair.list   1190ms
09:58:58  models.list        1197ms
09:58:58  sessions.list      1202ms
09:58:58  chat.history       1211ms

Observation: Even the first batch of API calls takes 1.0-1.2 seconds each,
despite Gateway being "ready". Something is occupying the event loop.

🔴 Disconnect #1 (code 1006 — abnormal)

10:00:04  webchat disconnected  code=1006  conn=A
          (connection A lasted ~68 seconds)
10:00:05  webchat connected     conn=B  (auto-reconnect, ~850ms gap)

Reconnect API calls (slower, ~1.9-2.3s each)

10:00:07  health             722ms
10:00:08  commands.list      1910ms
10:00:08  models.list        1914ms
10:00:08  sessions.list      1922ms
10:00:09  node.list          2330ms  ← getting worse
10:00:09  device.pair.list   2334ms
10:00:09  chat.history       2347ms

🔴 Disconnect #2 (code 1006 — abnormal)

10:01:47  node.list          7940ms  ← ~8 seconds for a local API call!
10:01:48  webchat disconnected  code=1006  conn=B
          (connection B lasted ~103 seconds)
10:01:50  webchat connected     conn=C  (auto-reconnect, ~2s gap)

🩺 Liveness Warning (the smoking gun)

At 10:01:51 (3 seconds after disconnect #2), the Gateway's own diagnostics fired:

liveness warning:
  reasons              = event_loop_delay
  interval             = 30s
  eventLoopDelayMaxMs  = 7944
  eventLoopDelayP99Ms  = 990.9
  eventLoopUtilization = 0.604   (60.4% of CPU time spent NOT processing events)
  cpuCoreRatio         = 0.618

  phase          = channels.qqbot.start-account   ← culprit!
  recentPhases   = sidecars.subagent-recovery:4ms,
                   sidecars.main-session-recovery:2ms,
                   sidecars.session-locks:33ms,
                   sidecars.model-prewarm:2222ms,   ← also blocking!
                   post-ready.maintenance:333ms

  active work    = agent:main:main (processing/model_call, age=1s)
  queued work    = agent:main:main (processing/model_call, age=1s)

Reconnect #2 API calls (still slow, ~1.7-2.1s)

10:01:52  health             630ms
10:01:53  commands.list      1724ms
10:01:53  chat.history       1727ms
10:01:53  models.list        1730ms
10:01:53  sessions.list      1735ms
10:01:53  node.list          2104ms
10:01:53  device.pair.list   2108ms

🔴 Disconnect #3 (code 1001 — browser going away)

10:04:42  webchat connected     conn=D  (new browser tab/window)
10:04:44  health             672ms
10:04:45  commands.list      1869ms
10:04:45  models.list        1872ms
10:04:45  sessions.list      1877ms
10:04:45  node.list          2246ms
10:04:45  device.pair.list   2249ms
10:04:45  chat.history       2253ms
10:04:50  webchat disconnected  code=1001  conn=D  (graceful close)
10:04:52  liveness warning   eventLoopDelayMaxMs=2246  util=23.3%

Persistent liveness warnings (proving it's not a one-time startup spike)

10:05:22  liveness: eventLoopDelayMaxMs=1056  util=5.1%
          phase=channels.qqbot.start-account  ← still in this phase!
10:07:22  liveness: eventLoopDelayMaxMs=1066  util=4.8%
          phase=channels.qqbot.start-account
10:14:23  liveness: eventLoopDelayMaxMs=13639 ← 13.6 seconds!
          util=58.5%
          phase=channels.qqbot.start-account
10:16:23  liveness: eventLoopDelayMaxMs=1449  util=16%
          phase=channels.qqbot.start-account

The phase is always channels.qqbot.start-account, and
recentPhases consistently includes sidecars.model-prewarm:2222ms.

Root Cause Analysis

┌─────────────────────────────────────────────────────────┐
│              Node.js Single-Threaded Event Loop          │
│                                                         │
│  ┌──────────────────────────────────────┐               │
│  │  QQ Bot start-account (sync blocks)  │ ← phase stuck │
│  │  model-prewarm: 2222ms (sync block)  │   here        │
│  │                                      │               │
│  │  ██████████████████████████████████   │ Event loop    │
│  │  ██████████████████████████████████   │ utilization   │
│  │  ██████████████████████████████████   │ up to 60.4%   │
│  └──────────────────────────────────────┘               │
│                                                         │
│  WebSocket keepalive (ping/pong) ── queued, starved     │
│       ↓                                                 │
│  Browser WS timeout (~10s) → code 1006 disconnect       │
│       ↓                                                 │
│  Auto-reconnect → triggers node.list, sessions.list etc │
│       ↓                                                 │
│  These also queue behind the blocking phase → still slow│
│       ↓                                                 │
│  Browser times out again → disconnect → repeat 🔁       │
└─────────────────────────────────────────────────────────┘

Why API calls are slow

API calls like node.list, sessions.list, device.pair.list are not
intrinsically slow — they're waiting in the event loop queue behind:

  1. channels.qqbot.start-account — appears to contain long synchronous
    operations that don't yield to the event loop
  2. sidecars.model-prewarm:2222ms — a 2.2-second synchronous model
    warmup that blocks the entire process

The measured API call duration includes queue wait time, not just execution
time. This is why node.list reports 7940ms — it spent most of that time
just waiting for the event loop to become available.

Impact

  • Duration: First ~5 minutes after Gateway restart
  • Frequency: ~2-4 WebSocket disconnects during this window
  • Data loss: No. Server-side session data is preserved. Only the UI
    refreshes (browser auto-reconnects as a new WebSocket session)
  • User visible: Control UI chat history appears to clear on reconnect
    (because it's a new WS session), but reloading the page restores it
  • Reproducibility: Every Gateway restart with QQ Bot channel enabled

Suggested Fixes

1. Break up channels.qqbot.start-account into async chunks

The start-account phase should periodically yield control back to the event
loop using setImmediate(), process.nextTick(), or await new Promise(r => setTimeout(r, 0)):

// Instead of:
for (const item of largeArray) {
  heavySyncOperation(item);
}

// Use:
for (const item of largeArray) {
  heavySyncOperation(item);
  if (i % 10 === 0) await new Promise(r => setImmediate(r));
}

2. Make model-prewarm non-blocking

A 2222ms synchronous operation should be moved to a Worker thread or broken
into async chunks. It should never block the event loop for >2 seconds.

3. Increase WebSocket keepalive tolerance (mitigation only)

As a secondary mitigation, the Control UI client could increase its WS
timeout tolerance during the startup window, or the server-side WS could
use a longer ping interval during startup phases.

4. Consider moving QQ Bot init to post-ready

If start-account doesn't need to complete before Gateway reports "ready",
it could be deferred to run after the ready signal, reducing the window
where Control UI connections are vulnerable.

Note on plugins.allow

Setting plugins.allow: ["qqbot"] was tested as a mitigation — it eliminates
the startup warning about plugin discovery but does not resolve this issue.
The bottleneck is not plugin scanning.


Diagnosis performed on 2026-05-16. Log excerpts redacted: hostname, username,
and AppID replaced with placeholders. No message content, API keys, or tokens
are included.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions