[Bug]: `channels.slack.start-account` phase blocks event loop 5+ minutes while a model_call is in flight (Windows, 2026.5.4)

## Summary

On Windows, the Slack channel `start-account` startup phase can block the gateway event loop for **5+ minutes** when a model call is in flight at the same time. While blocked, no other channel can dispatch, no agent run can progress, and the gateway appears externally "stopped" even though port 18789 is still listening and `/health` still responds.

This is *not* the same as #77651 (manuallyStopped poison after stop-timeout) or the now-fixed startup runtime-deps stalls from #75747 — the trigger here is mid-life Slack account (re)start that overlaps with a long model call, not boot, and the phase reported by the liveness diagnostic is `channels.slack.start-account`, not `gateway.startup` or `plugins.runtime-deps`.

## Environment

- **OpenClaw version:** `2026.5.4`
- **OS:** Windows 10 Pro (build 19041 / 1909)
- **Node:** v20.x (bundled in pnpm global path)
- **Channel:** Slack (Socket Mode)
- **Install:** Scheduled Task (`OpenClaw Gateway`) running as `InteractiveToken`, gateway bound to `127.0.0.1:18789`
- **Concurrent activity at time of stall:** one active embedded agent run, model call in flight (`processing/model_call`, queue depth 1)

## Liveness diagnostic at the failure

Most recent occurrence (today's gateway log):

```
[diagnostic] liveness warning: reasons=event_loop_delay
  phase=channels.slack.start-account 303223ms
  work=[active=agent:main:main(processing/model_call,q=1,age=105s last=model_call:started)]
  eventLoopDelayMaxMs=1795.2
  eventLoopUtilization=...
```

Earlier occurrence on the same host (May 1 stability log):

```
[diagnostic] liveness warning: reasons=event_loop_delay
  phase=channels.slack.start-account 136969ms
  work=[active=agent:main:main(processing/model_call,...)]
  eventLoopDelayMaxMs=...
```

Two data points so far: 303 seconds (5m 3s) and 136 seconds (2m 16s). Both show `phase=channels.slack.start-account` and both happen while a model call is in flight.

## Symptom from the operator's side

- "OpenClaw stops working on occasion" — gateway looks alive externally (`/health` 200, port 18789 open) but no Slack messages dispatch and no replies come back.
- After several minutes, traffic resumes on its own (no manual restart needed). The gateway never crashed, it just refused to do anything for the duration of the blocked phase.
- Pattern is intermittent; correlates with periods when the agent is doing a long-running model call.

## Adjacent observations (may be related)

- `OpenClaw-KillOrphanMCP` scheduled task on the same host has been firing meaningful kills: 21 orphan child processes killed on one day, 9 the next. So MCP child leakage is non-trivial on this host.
- Gateway working set hits ~1.4–1.6 GB during these periods (heapTotal ~705 MB, heapUsed ~619 MB). Multiple `event_loop_delay` warnings precede an unhandled rejection in some sessions.
- 30+ `node` processes regularly present — combination of gateway, embedded agent runs, and MCP children. Cleanup script is on a 6-hour cadence.

## Hypothesis

The `channels.slack.start-account` phase is doing synchronous work on the main thread that is being preempted by a competing in-flight model call (the `agent:main:main` work item in the diagnostic). When the model call holds the loop, the Slack startup phase can't finish — and the Slack startup phase appears to itself hold something the rest of the gateway needs (or at least consumes enough event-loop time to starve everything else).

Either the Slack `start-account` phase needs to be moved off the main thread / made async-safe, or it needs a hard timeout that doesn't allow it to block other channel dispatch indefinitely.

## Repro

I have not been able to reliably trigger this on demand, but the conditions appear to be:

1. Gateway running for at least several hours, with concurrent Slack account activity.
2. An agent run with at least one in-flight model call (long thinking turn).
3. Slack channel re-init or `start-account` triggered during step 2.

If a maintainer has a way to deliberately fire `start-account` while a model call is in flight, that should reproduce the block.

## What's clearly *not* the cause

- It's not bundled-runtime-deps staging on boot — that path was refactored out of startup per #75747. The block phase here is `channels.slack.start-account`, fired well after gateway ready.
- It's not the manuallyStopped poison flag from #77651 — Slack recovers on its own, no `kickstart -k` needed.
- It's not OS-level (no Windows Event Log task termination, no SIGTERM, no crash log).

## Asks

- Confirm whether `channels.slack.start-account` is documented to be allowed to block the event loop for minutes at a time, or whether this is unintended.
- If unintended: consider one of (a) a hard upper bound on the phase duration, (b) yielding the loop more aggressively during the phase, (c) running the phase off the main thread.
- A diagnostic that names *which* sub-step inside `start-account` is consuming the time would make this much easier to root-cause from the user side.

## Workarounds I'm using locally (not asks for upstream fixes)

- Scheduled-task `RestartOnFailure` and removed the 72h `ExecutionTimeLimit` so the gateway recovers from any clean crash.
- Tightened the orphan-MCP cleanup cadence from PT6H → PT1H to keep child-process count from compounding between sweeps.
- Added a 5-minute health probe that captures gateway port + `/health` + node/chrome process counts to JSONL so the next occurrence will have full context (process counts and HTTP latency leading up to the block).

If the next occurrence on the new probe captures more useful data, I'll attach it to this issue.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bug]: `channels.slack.start-account` phase blocks event loop 5+ minutes while a model_call is in flight (Windows, 2026.5.4) #78435

Summary

Environment

Liveness diagnostic at the failure

Symptom from the operator's side

Adjacent observations (may be related)

Hypothesis

Repro

What's clearly not the cause

Asks

Workarounds I'm using locally (not asks for upstream fixes)

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

[Bug]: channels.slack.start-account phase blocks event loop 5+ minutes while a model_call is in flight (Windows, 2026.5.4) #78435

Description

Summary

Environment

Liveness diagnostic at the failure

Symptom from the operator's side

Adjacent observations (may be related)

Hypothesis

Repro

What's clearly not the cause

Asks

Workarounds I'm using locally (not asks for upstream fixes)

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

[Bug]: `channels.slack.start-account` phase blocks event loop 5+ minutes while a model_call is in flight (Windows, 2026.5.4) #78435