[Bug]: Slack socket permanently dead after event-loop starvation — manuallyStopped suppresses auto-reconnect

## Bug Description

When a stalled agent run starves the Node.js event loop long enough to drop the Slack WebSocket heartbeat, the gateway's `stopChannel()` cleanup path hits the 5000ms timeout and leaves `manuallyStopped` set for the Slack channel account. The gateway process stays alive but the Slack socket never reconnects — `manuallyStopped.has(rKey)` is `true`, so the auto-restart loop exits immediately without scheduling a reconnect.

## Environment

- **OpenClaw version:** 2026.5.3-1 (2eae30e)
- **Platform:** macOS (Darwin, Apple Silicon)
- **Channel:** Slack (socket mode, two accounts: `default` + `archivist`)

## Failure chain

```
stalled model call (~10 min, auditor:main, lmstudio-lab1)
  → event loop blocked (P99 delay 7692ms, utilization 0.922)
  → Slack SDK WS heartbeat fails → connection drops
  → health monitor aborts stalled session → calls stopChannel()
  → stopChannel(): manuallyStopped.add(rKey)          ← poison pill set
  → waitForChannelStopGracefully() times out at 5000ms (loop still starved)
  → timeout branch: setRuntime(running: true), return  ← no cleanup
  → event loop clears, gateway process continues alive
  → auto-restart loop: manuallyStopped.has(rKey) === true → returns, no reconnect
  → Slack dead indefinitely; only fix is launchctl kickstart -k
```

## Relevant log sequence

**gateway.err.log:**
```
[diagnostic] liveness warning: reasons=event_loop_delay,cpu interval=33s eventLoopDelayP99Ms=7692.4 eventLoopDelayMaxMs=7893.7 eventLoopUtilization=0.922 cpuCoreRatio=0.944
[slack] [default] channel stop exceeded 5000ms after abort; continuing shutdown
```

**gateway.log** (after the above — no further Slack events until manual kickstart):
```
[ws] ⇄ res ✓ health ...   ← gateway WS still alive
[ws] ⇄ res ✓ health ...
... (silence from Slack)
```

## Code location

`server-channels-DtnF0i8E.js` (compiled), `stopChannel()`, line ~512:

```js
// CHANNEL_STOP_ABORT_TIMEOUT_MS = 5e3
if (!await waitForChannelStopGracefully(task, CHANNEL_STOP_ABORT_TIMEOUT_MS)) {
    log.warn?.(`[${id}] channel stop exceeded ${CHANNEL_STOP_ABORT_TIMEOUT_MS}ms after abort; continuing shutdown`);
    setRuntime(channelId, id, {
        accountId: id,
        running: true,          // ← should not be true; connection is dead
        restartPending: false,
        lastError: `channel stop timed out after ${CHANNEL_STOP_ABORT_TIMEOUT_MS}ms`
    });
    return;  // ← exits without store.aborts.delete / store.tasks.delete
             //   and manuallyStopped remains set from line ~495
}
// happy path clears aborts, tasks, sets running:false
store.aborts.delete(id);
store.tasks.delete(id);
```

`manuallyStopped.add(rKey)` is called unconditionally at the top of `stopChannel()` (line ~495), before the timeout check. On the timeout path it is never cleared, so the auto-restart loop at line ~354 sees `manuallyStopped.has(rKey) === true` and returns without reconnecting.

## Expected behavior

When `waitForChannelStopGracefully` times out, the channel should either:

**Option A (minimal fix):** Remove `rKey` from `manuallyStopped` in the timeout branch, set `running: false`, and let the auto-restart loop reconnect.

**Option B (explicit reconnect):** After the timeout, schedule a reconnect attempt directly (bypassing `manuallyStopped`) with a short delay to let the event loop recover.

Either option prevents the "ghost alive" state where the gateway is running but the Slack socket is permanently dead.

## Workaround

Until fixed, a watchdog cron job running `launchctl kickstart -k gui/<uid>/ai.openclaw.gateway` on detection of the pattern (last `channel stop exceeded` timestamp > last `socket mode connected` timestamp in the logs) recovers the socket automatically.

## Related

- Issue #77634 (Discord fetch timeout blocking event loop) — same root category (event-loop starvation), different failure surface.
- Issue #77626 (Liveness-based turn timeouts) — would mitigate the stalled model call trigger.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bug]: Slack socket permanently dead after event-loop starvation — manuallyStopped suppresses auto-reconnect #77651

Bug Description

Environment

Failure chain

Relevant log sequence

Code location

Expected behavior

Workaround

Related

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

[Bug]: Slack socket permanently dead after event-loop starvation — manuallyStopped suppresses auto-reconnect #77651

Description

Bug Description

Environment

Failure chain

Relevant log sequence

Code location

Expected behavior

Workaround

Related

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions