Bug Description
When a stalled agent run starves the Node.js event loop long enough to drop the Slack WebSocket heartbeat, the gateway's stopChannel() cleanup path hits the 5000ms timeout and leaves manuallyStopped set for the Slack channel account. The gateway process stays alive but the Slack socket never reconnects — manuallyStopped.has(rKey) is true, so the auto-restart loop exits immediately without scheduling a reconnect.
Environment
- OpenClaw version: 2026.5.3-1 (2eae30e)
- Platform: macOS (Darwin, Apple Silicon)
- Channel: Slack (socket mode, two accounts:
default + archivist)
Failure chain
stalled model call (~10 min, auditor:main, lmstudio-lab1)
→ event loop blocked (P99 delay 7692ms, utilization 0.922)
→ Slack SDK WS heartbeat fails → connection drops
→ health monitor aborts stalled session → calls stopChannel()
→ stopChannel(): manuallyStopped.add(rKey) ← poison pill set
→ waitForChannelStopGracefully() times out at 5000ms (loop still starved)
→ timeout branch: setRuntime(running: true), return ← no cleanup
→ event loop clears, gateway process continues alive
→ auto-restart loop: manuallyStopped.has(rKey) === true → returns, no reconnect
→ Slack dead indefinitely; only fix is launchctl kickstart -k
Relevant log sequence
gateway.err.log:
[diagnostic] liveness warning: reasons=event_loop_delay,cpu interval=33s eventLoopDelayP99Ms=7692.4 eventLoopDelayMaxMs=7893.7 eventLoopUtilization=0.922 cpuCoreRatio=0.944
[slack] [default] channel stop exceeded 5000ms after abort; continuing shutdown
gateway.log (after the above — no further Slack events until manual kickstart):
[ws] ⇄ res ✓ health ... ← gateway WS still alive
[ws] ⇄ res ✓ health ...
... (silence from Slack)
Code location
server-channels-DtnF0i8E.js (compiled), stopChannel(), line ~512:
// CHANNEL_STOP_ABORT_TIMEOUT_MS = 5e3
if (!await waitForChannelStopGracefully(task, CHANNEL_STOP_ABORT_TIMEOUT_MS)) {
log.warn?.(`[${id}] channel stop exceeded ${CHANNEL_STOP_ABORT_TIMEOUT_MS}ms after abort; continuing shutdown`);
setRuntime(channelId, id, {
accountId: id,
running: true, // ← should not be true; connection is dead
restartPending: false,
lastError: `channel stop timed out after ${CHANNEL_STOP_ABORT_TIMEOUT_MS}ms`
});
return; // ← exits without store.aborts.delete / store.tasks.delete
// and manuallyStopped remains set from line ~495
}
// happy path clears aborts, tasks, sets running:false
store.aborts.delete(id);
store.tasks.delete(id);
manuallyStopped.add(rKey) is called unconditionally at the top of stopChannel() (line ~495), before the timeout check. On the timeout path it is never cleared, so the auto-restart loop at line ~354 sees manuallyStopped.has(rKey) === true and returns without reconnecting.
Expected behavior
When waitForChannelStopGracefully times out, the channel should either:
Option A (minimal fix): Remove rKey from manuallyStopped in the timeout branch, set running: false, and let the auto-restart loop reconnect.
Option B (explicit reconnect): After the timeout, schedule a reconnect attempt directly (bypassing manuallyStopped) with a short delay to let the event loop recover.
Either option prevents the "ghost alive" state where the gateway is running but the Slack socket is permanently dead.
Workaround
Until fixed, a watchdog cron job running launchctl kickstart -k gui/<uid>/ai.openclaw.gateway on detection of the pattern (last channel stop exceeded timestamp > last socket mode connected timestamp in the logs) recovers the socket automatically.
Related
Bug Description
When a stalled agent run starves the Node.js event loop long enough to drop the Slack WebSocket heartbeat, the gateway's
stopChannel()cleanup path hits the 5000ms timeout and leavesmanuallyStoppedset for the Slack channel account. The gateway process stays alive but the Slack socket never reconnects —manuallyStopped.has(rKey)istrue, so the auto-restart loop exits immediately without scheduling a reconnect.Environment
default+archivist)Failure chain
Relevant log sequence
gateway.err.log:
gateway.log (after the above — no further Slack events until manual kickstart):
Code location
server-channels-DtnF0i8E.js(compiled),stopChannel(), line ~512:manuallyStopped.add(rKey)is called unconditionally at the top ofstopChannel()(line ~495), before the timeout check. On the timeout path it is never cleared, so the auto-restart loop at line ~354 seesmanuallyStopped.has(rKey) === trueand returns without reconnecting.Expected behavior
When
waitForChannelStopGracefullytimes out, the channel should either:Option A (minimal fix): Remove
rKeyfrommanuallyStoppedin the timeout branch, setrunning: false, and let the auto-restart loop reconnect.Option B (explicit reconnect): After the timeout, schedule a reconnect attempt directly (bypassing
manuallyStopped) with a short delay to let the event loop recover.Either option prevents the "ghost alive" state where the gateway is running but the Slack socket is permanently dead.
Workaround
Until fixed, a watchdog cron job running
launchctl kickstart -k gui/<uid>/ai.openclaw.gatewayon detection of the pattern (lastchannel stop exceededtimestamp > lastsocket mode connectedtimestamp in the logs) recovers the socket automatically.Related