Summary
On Windows, the Slack channel start-account startup phase can block the gateway event loop for 5+ minutes when a model call is in flight at the same time. While blocked, no other channel can dispatch, no agent run can progress, and the gateway appears externally "stopped" even though port 18789 is still listening and /health still responds.
This is not the same as #77651 (manuallyStopped poison after stop-timeout) or the now-fixed startup runtime-deps stalls from #75747 — the trigger here is mid-life Slack account (re)start that overlaps with a long model call, not boot, and the phase reported by the liveness diagnostic is channels.slack.start-account, not gateway.startup or plugins.runtime-deps.
Environment
- OpenClaw version:
2026.5.4
- OS: Windows 10 Pro (build 19041 / 1909)
- Node: v20.x (bundled in pnpm global path)
- Channel: Slack (Socket Mode)
- Install: Scheduled Task (
OpenClaw Gateway) running as InteractiveToken, gateway bound to 127.0.0.1:18789
- Concurrent activity at time of stall: one active embedded agent run, model call in flight (
processing/model_call, queue depth 1)
Liveness diagnostic at the failure
Most recent occurrence (today's gateway log):
[diagnostic] liveness warning: reasons=event_loop_delay
phase=channels.slack.start-account 303223ms
work=[active=agent:main:main(processing/model_call,q=1,age=105s last=model_call:started)]
eventLoopDelayMaxMs=1795.2
eventLoopUtilization=...
Earlier occurrence on the same host (May 1 stability log):
[diagnostic] liveness warning: reasons=event_loop_delay
phase=channels.slack.start-account 136969ms
work=[active=agent:main:main(processing/model_call,...)]
eventLoopDelayMaxMs=...
Two data points so far: 303 seconds (5m 3s) and 136 seconds (2m 16s). Both show phase=channels.slack.start-account and both happen while a model call is in flight.
Symptom from the operator's side
- "OpenClaw stops working on occasion" — gateway looks alive externally (
/health 200, port 18789 open) but no Slack messages dispatch and no replies come back.
- After several minutes, traffic resumes on its own (no manual restart needed). The gateway never crashed, it just refused to do anything for the duration of the blocked phase.
- Pattern is intermittent; correlates with periods when the agent is doing a long-running model call.
Adjacent observations (may be related)
OpenClaw-KillOrphanMCP scheduled task on the same host has been firing meaningful kills: 21 orphan child processes killed on one day, 9 the next. So MCP child leakage is non-trivial on this host.
- Gateway working set hits ~1.4–1.6 GB during these periods (heapTotal ~705 MB, heapUsed ~619 MB). Multiple
event_loop_delay warnings precede an unhandled rejection in some sessions.
- 30+
node processes regularly present — combination of gateway, embedded agent runs, and MCP children. Cleanup script is on a 6-hour cadence.
Hypothesis
The channels.slack.start-account phase is doing synchronous work on the main thread that is being preempted by a competing in-flight model call (the agent:main:main work item in the diagnostic). When the model call holds the loop, the Slack startup phase can't finish — and the Slack startup phase appears to itself hold something the rest of the gateway needs (or at least consumes enough event-loop time to starve everything else).
Either the Slack start-account phase needs to be moved off the main thread / made async-safe, or it needs a hard timeout that doesn't allow it to block other channel dispatch indefinitely.
Repro
I have not been able to reliably trigger this on demand, but the conditions appear to be:
- Gateway running for at least several hours, with concurrent Slack account activity.
- An agent run with at least one in-flight model call (long thinking turn).
- Slack channel re-init or
start-account triggered during step 2.
If a maintainer has a way to deliberately fire start-account while a model call is in flight, that should reproduce the block.
What's clearly not the cause
Asks
- Confirm whether
channels.slack.start-account is documented to be allowed to block the event loop for minutes at a time, or whether this is unintended.
- If unintended: consider one of (a) a hard upper bound on the phase duration, (b) yielding the loop more aggressively during the phase, (c) running the phase off the main thread.
- A diagnostic that names which sub-step inside
start-account is consuming the time would make this much easier to root-cause from the user side.
Workarounds I'm using locally (not asks for upstream fixes)
- Scheduled-task
RestartOnFailure and removed the 72h ExecutionTimeLimit so the gateway recovers from any clean crash.
- Tightened the orphan-MCP cleanup cadence from PT6H → PT1H to keep child-process count from compounding between sweeps.
- Added a 5-minute health probe that captures gateway port +
/health + node/chrome process counts to JSONL so the next occurrence will have full context (process counts and HTTP latency leading up to the block).
If the next occurrence on the new probe captures more useful data, I'll attach it to this issue.
Summary
On Windows, the Slack channel
start-accountstartup phase can block the gateway event loop for 5+ minutes when a model call is in flight at the same time. While blocked, no other channel can dispatch, no agent run can progress, and the gateway appears externally "stopped" even though port 18789 is still listening and/healthstill responds.This is not the same as #77651 (manuallyStopped poison after stop-timeout) or the now-fixed startup runtime-deps stalls from #75747 — the trigger here is mid-life Slack account (re)start that overlaps with a long model call, not boot, and the phase reported by the liveness diagnostic is
channels.slack.start-account, notgateway.startuporplugins.runtime-deps.Environment
2026.5.4OpenClaw Gateway) running asInteractiveToken, gateway bound to127.0.0.1:18789processing/model_call, queue depth 1)Liveness diagnostic at the failure
Most recent occurrence (today's gateway log):
Earlier occurrence on the same host (May 1 stability log):
Two data points so far: 303 seconds (5m 3s) and 136 seconds (2m 16s). Both show
phase=channels.slack.start-accountand both happen while a model call is in flight.Symptom from the operator's side
/health200, port 18789 open) but no Slack messages dispatch and no replies come back.Adjacent observations (may be related)
OpenClaw-KillOrphanMCPscheduled task on the same host has been firing meaningful kills: 21 orphan child processes killed on one day, 9 the next. So MCP child leakage is non-trivial on this host.event_loop_delaywarnings precede an unhandled rejection in some sessions.nodeprocesses regularly present — combination of gateway, embedded agent runs, and MCP children. Cleanup script is on a 6-hour cadence.Hypothesis
The
channels.slack.start-accountphase is doing synchronous work on the main thread that is being preempted by a competing in-flight model call (theagent:main:mainwork item in the diagnostic). When the model call holds the loop, the Slack startup phase can't finish — and the Slack startup phase appears to itself hold something the rest of the gateway needs (or at least consumes enough event-loop time to starve everything else).Either the Slack
start-accountphase needs to be moved off the main thread / made async-safe, or it needs a hard timeout that doesn't allow it to block other channel dispatch indefinitely.Repro
I have not been able to reliably trigger this on demand, but the conditions appear to be:
start-accounttriggered during step 2.If a maintainer has a way to deliberately fire
start-accountwhile a model call is in flight, that should reproduce the block.What's clearly not the cause
channels.slack.start-account, fired well after gateway ready.kickstart -kneeded.Asks
channels.slack.start-accountis documented to be allowed to block the event loop for minutes at a time, or whether this is unintended.start-accountis consuming the time would make this much easier to root-cause from the user side.Workarounds I'm using locally (not asks for upstream fixes)
RestartOnFailureand removed the 72hExecutionTimeLimitso the gateway recovers from any clean crash./health+ node/chrome process counts to JSONL so the next occurrence will have full context (process counts and HTTP latency leading up to the block).If the next occurrence on the new probe captures more useful data, I'll attach it to this issue.