Skip to content

[Bug]: channels.slack.start-account phase blocks event loop 5+ minutes while a model_call is in flight (Windows, 2026.5.4) #78435

@litlmike

Description

@litlmike

Summary

On Windows, the Slack channel start-account startup phase can block the gateway event loop for 5+ minutes when a model call is in flight at the same time. While blocked, no other channel can dispatch, no agent run can progress, and the gateway appears externally "stopped" even though port 18789 is still listening and /health still responds.

This is not the same as #77651 (manuallyStopped poison after stop-timeout) or the now-fixed startup runtime-deps stalls from #75747 — the trigger here is mid-life Slack account (re)start that overlaps with a long model call, not boot, and the phase reported by the liveness diagnostic is channels.slack.start-account, not gateway.startup or plugins.runtime-deps.

Environment

  • OpenClaw version: 2026.5.4
  • OS: Windows 10 Pro (build 19041 / 1909)
  • Node: v20.x (bundled in pnpm global path)
  • Channel: Slack (Socket Mode)
  • Install: Scheduled Task (OpenClaw Gateway) running as InteractiveToken, gateway bound to 127.0.0.1:18789
  • Concurrent activity at time of stall: one active embedded agent run, model call in flight (processing/model_call, queue depth 1)

Liveness diagnostic at the failure

Most recent occurrence (today's gateway log):

[diagnostic] liveness warning: reasons=event_loop_delay
  phase=channels.slack.start-account 303223ms
  work=[active=agent:main:main(processing/model_call,q=1,age=105s last=model_call:started)]
  eventLoopDelayMaxMs=1795.2
  eventLoopUtilization=...

Earlier occurrence on the same host (May 1 stability log):

[diagnostic] liveness warning: reasons=event_loop_delay
  phase=channels.slack.start-account 136969ms
  work=[active=agent:main:main(processing/model_call,...)]
  eventLoopDelayMaxMs=...

Two data points so far: 303 seconds (5m 3s) and 136 seconds (2m 16s). Both show phase=channels.slack.start-account and both happen while a model call is in flight.

Symptom from the operator's side

  • "OpenClaw stops working on occasion" — gateway looks alive externally (/health 200, port 18789 open) but no Slack messages dispatch and no replies come back.
  • After several minutes, traffic resumes on its own (no manual restart needed). The gateway never crashed, it just refused to do anything for the duration of the blocked phase.
  • Pattern is intermittent; correlates with periods when the agent is doing a long-running model call.

Adjacent observations (may be related)

  • OpenClaw-KillOrphanMCP scheduled task on the same host has been firing meaningful kills: 21 orphan child processes killed on one day, 9 the next. So MCP child leakage is non-trivial on this host.
  • Gateway working set hits ~1.4–1.6 GB during these periods (heapTotal ~705 MB, heapUsed ~619 MB). Multiple event_loop_delay warnings precede an unhandled rejection in some sessions.
  • 30+ node processes regularly present — combination of gateway, embedded agent runs, and MCP children. Cleanup script is on a 6-hour cadence.

Hypothesis

The channels.slack.start-account phase is doing synchronous work on the main thread that is being preempted by a competing in-flight model call (the agent:main:main work item in the diagnostic). When the model call holds the loop, the Slack startup phase can't finish — and the Slack startup phase appears to itself hold something the rest of the gateway needs (or at least consumes enough event-loop time to starve everything else).

Either the Slack start-account phase needs to be moved off the main thread / made async-safe, or it needs a hard timeout that doesn't allow it to block other channel dispatch indefinitely.

Repro

I have not been able to reliably trigger this on demand, but the conditions appear to be:

  1. Gateway running for at least several hours, with concurrent Slack account activity.
  2. An agent run with at least one in-flight model call (long thinking turn).
  3. Slack channel re-init or start-account triggered during step 2.

If a maintainer has a way to deliberately fire start-account while a model call is in flight, that should reproduce the block.

What's clearly not the cause

Asks

  • Confirm whether channels.slack.start-account is documented to be allowed to block the event loop for minutes at a time, or whether this is unintended.
  • If unintended: consider one of (a) a hard upper bound on the phase duration, (b) yielding the loop more aggressively during the phase, (c) running the phase off the main thread.
  • A diagnostic that names which sub-step inside start-account is consuming the time would make this much easier to root-cause from the user side.

Workarounds I'm using locally (not asks for upstream fixes)

  • Scheduled-task RestartOnFailure and removed the 72h ExecutionTimeLimit so the gateway recovers from any clean crash.
  • Tightened the orphan-MCP cleanup cadence from PT6H → PT1H to keep child-process count from compounding between sweeps.
  • Added a 5-minute health probe that captures gateway port + /health + node/chrome process counts to JSONL so the next occurrence will have full context (process counts and HTTP latency leading up to the block).

If the next occurrence on the new probe captures more useful data, I'll attach it to this issue.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions