Skip to content

[Bug]: Slack socket permanently dead after event-loop starvation — manuallyStopped suppresses auto-reconnect #77651

@Gusty3055

Description

@Gusty3055

Bug Description

When a stalled agent run starves the Node.js event loop long enough to drop the Slack WebSocket heartbeat, the gateway's stopChannel() cleanup path hits the 5000ms timeout and leaves manuallyStopped set for the Slack channel account. The gateway process stays alive but the Slack socket never reconnects — manuallyStopped.has(rKey) is true, so the auto-restart loop exits immediately without scheduling a reconnect.

Environment

  • OpenClaw version: 2026.5.3-1 (2eae30e)
  • Platform: macOS (Darwin, Apple Silicon)
  • Channel: Slack (socket mode, two accounts: default + archivist)

Failure chain

stalled model call (~10 min, auditor:main, lmstudio-lab1)
  → event loop blocked (P99 delay 7692ms, utilization 0.922)
  → Slack SDK WS heartbeat fails → connection drops
  → health monitor aborts stalled session → calls stopChannel()
  → stopChannel(): manuallyStopped.add(rKey)          ← poison pill set
  → waitForChannelStopGracefully() times out at 5000ms (loop still starved)
  → timeout branch: setRuntime(running: true), return  ← no cleanup
  → event loop clears, gateway process continues alive
  → auto-restart loop: manuallyStopped.has(rKey) === true → returns, no reconnect
  → Slack dead indefinitely; only fix is launchctl kickstart -k

Relevant log sequence

gateway.err.log:

[diagnostic] liveness warning: reasons=event_loop_delay,cpu interval=33s eventLoopDelayP99Ms=7692.4 eventLoopDelayMaxMs=7893.7 eventLoopUtilization=0.922 cpuCoreRatio=0.944
[slack] [default] channel stop exceeded 5000ms after abort; continuing shutdown

gateway.log (after the above — no further Slack events until manual kickstart):

[ws] ⇄ res ✓ health ...   ← gateway WS still alive
[ws] ⇄ res ✓ health ...
... (silence from Slack)

Code location

server-channels-DtnF0i8E.js (compiled), stopChannel(), line ~512:

// CHANNEL_STOP_ABORT_TIMEOUT_MS = 5e3
if (!await waitForChannelStopGracefully(task, CHANNEL_STOP_ABORT_TIMEOUT_MS)) {
    log.warn?.(`[${id}] channel stop exceeded ${CHANNEL_STOP_ABORT_TIMEOUT_MS}ms after abort; continuing shutdown`);
    setRuntime(channelId, id, {
        accountId: id,
        running: true,          // ← should not be true; connection is dead
        restartPending: false,
        lastError: `channel stop timed out after ${CHANNEL_STOP_ABORT_TIMEOUT_MS}ms`
    });
    return;  // ← exits without store.aborts.delete / store.tasks.delete
             //   and manuallyStopped remains set from line ~495
}
// happy path clears aborts, tasks, sets running:false
store.aborts.delete(id);
store.tasks.delete(id);

manuallyStopped.add(rKey) is called unconditionally at the top of stopChannel() (line ~495), before the timeout check. On the timeout path it is never cleared, so the auto-restart loop at line ~354 sees manuallyStopped.has(rKey) === true and returns without reconnecting.

Expected behavior

When waitForChannelStopGracefully times out, the channel should either:

Option A (minimal fix): Remove rKey from manuallyStopped in the timeout branch, set running: false, and let the auto-restart loop reconnect.

Option B (explicit reconnect): After the timeout, schedule a reconnect attempt directly (bypassing manuallyStopped) with a short delay to let the event loop recover.

Either option prevents the "ghost alive" state where the gateway is running but the Slack socket is permanently dead.

Workaround

Until fixed, a watchdog cron job running launchctl kickstart -k gui/<uid>/ai.openclaw.gateway on detection of the pattern (last channel stop exceeded timestamp > last socket mode connected timestamp in the logs) recovers the socket automatically.

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions