Summary
Slack Socket Mode health currently treats old inbound app-event activity (lastEventAt) as evidence that the websocket is stale. On quiet Slack workspaces this creates false-positive stale-socket restarts even when the Slack SDK is not reporting websocket keepalive failures.
This should be treated as a stability bug/refactor target, not only a configuration tuning issue.
Evidence
An investigation of a live OpenClaw host on 2026.4.15 found two separate behaviors:
- Real Slack/socket transport instability on some days:
- Slack SDK keepalive warnings: missing pongs/pings
- HTTP 408s from Socket Mode connection attempts
no active connection / client is not ready
- Slack Web API HTTP failures
- OpenClaw-created stale-socket churn on quiet days:
- Several days had no matching Slack SDK keepalive failures in the inspected error logs.
- The gateway still restarted Slack for
stale-socket on a regular cadence.
- In the inspected host,
gateway.channelStaleEventThresholdMinutes was 120, and restart cadence matched that threshold plus the health-monitor interval.
That indicates at least some stale restarts are false positives caused by the health policy, not real websocket failures.
Current Code Path
The relevant source path appears to be:
-
extensions/slack/src/monitor/provider.ts
-
publishSlackConnectedStatus() seeds lastEventAt when Slack connects.
-
trackEvent() updates lastEventAt / lastInboundAt only from accepted inbound Slack app events.
-
Socket Mode uses new App({ socketMode: true, ... }), so Bolt constructs the SocketModeReceiver implicitly.
-
src/gateway/channel-status-patches.ts
-
createConnectedChannelStatusPatch() sets connected, lastConnectedAt, and lastEventAt to the connect timestamp.
-
src/gateway/channel-health-policy.ts
-
evaluateChannelHealth() returns stale-socket when connected === true and now - lastEventAt > staleEventThresholdMs.
-
src/gateway/channel-health-monitor.ts
-
The monitor restarts channels marked unhealthy, turning this stale-event policy into a stop/start loop.
-
extensions/slack/src/monitor/reconnect-policy.ts
-
waitForSlackSocketDisconnect() waits on disconnected, unable_to_socket_mode_start, and error.
-
It does not model lower-level SDK close / reconnecting lifecycle.
Why This Is Flawed
The stale-socket policy is intended to detect half-dead websockets, but it uses normal inbound Slack app-event recency as the proxy. These are different signals:
- A quiet Slack workspace can have no app events for hours while the Socket Mode websocket is healthy.
- A degraded websocket should be detected by socket lifecycle/keepalive state, not by user/message traffic volume.
- Re-enabling agent heartbeats or posting periodic Slack messages can mask the symptom, but does not fix the underlying transport health model.
There is also a reconnect ownership mismatch:
- Slack SDK
SocketModeClient defaults autoReconnectEnabled to true.
- On websocket
close, the SDK can emit reconnecting and start a new socket internally.
- OpenClaw's outer reconnect loop waits for higher-level
disconnected / error / start-failure events, so it can miss the SDK's normal reconnect lifecycle.
- Bolt's
SocketModeReceiver does accept autoReconnectEnabled, clientPingTimeout, serverPingTimeout, and pingPongLoggingEnabled, but OpenClaw's implicit new App({ socketMode: true }) path does not pass those options.
Proposed Direction
Prefer one of these two designs:
Option A: OpenClaw owns reconnects
- Construct Bolt
SocketModeReceiver explicitly for Slack socket mode.
- Pass
autoReconnectEnabled: false.
- Let OpenClaw's reconnect loop be the single lifecycle owner.
- Publish explicit status transitions for connecting, connected, disconnecting, disconnected, reconnecting, and failed.
- Add tests for SDK
close / reconnect behavior, not only disconnected.
Option B: Slack SDK owns reconnects
- Keep SDK auto-reconnect enabled.
- Subscribe to/report SDK lifecycle events such as
connecting, connected, authenticated, reconnecting, close, disconnected, and error.
- Update gateway health to use those lifecycle signals instead of restarting solely because
lastEventAt is old.
Health Model Refactor
Regardless of reconnect ownership:
- Keep
lastEventAt as an app-event freshness diagnostic only.
- Add distinct socket/transport status fields, for example:
socketState
lastSocketConnectedAt
lastSocketClosedAt
lastSocketErrorAt
lastReconnectAttemptAt
lastReconnectAt
lastKeepaliveAt or equivalent SDK-observable liveness
- Do not automatically classify Slack as
stale-socket based only on quiet inbound app traffic.
- Classify outbound Slack failures against current transport state so operators can distinguish Slack Web API failure, socket reconnecting, app-event quietness, and agent/orchestration stalls.
Acceptance Criteria
- Quiet Slack workspaces do not restart repeatedly only because no inbound events arrived.
- Real Socket Mode keepalive failures still move the channel into a degraded/reconnecting/disconnected state.
/status or channel status output can distinguish:
- connected but event-quiet
- reconnecting
- disconnected
- Web API send failure
- stale/failed transport
- Tests cover SDK
close / reconnecting behavior and the health monitor's treatment of old lastEventAt on Slack.
Summary
Slack Socket Mode health currently treats old inbound app-event activity (
lastEventAt) as evidence that the websocket is stale. On quiet Slack workspaces this creates false-positivestale-socketrestarts even when the Slack SDK is not reporting websocket keepalive failures.This should be treated as a stability bug/refactor target, not only a configuration tuning issue.
Evidence
An investigation of a live OpenClaw host on
2026.4.15found two separate behaviors:no active connection/client is not readystale-socketon a regular cadence.gateway.channelStaleEventThresholdMinuteswas120, and restart cadence matched that threshold plus the health-monitor interval.That indicates at least some stale restarts are false positives caused by the health policy, not real websocket failures.
Current Code Path
The relevant source path appears to be:
extensions/slack/src/monitor/provider.tspublishSlackConnectedStatus()seedslastEventAtwhen Slack connects.trackEvent()updateslastEventAt/lastInboundAtonly from accepted inbound Slack app events.Socket Mode uses
new App({ socketMode: true, ... }), so Bolt constructs theSocketModeReceiverimplicitly.src/gateway/channel-status-patches.tscreateConnectedChannelStatusPatch()setsconnected,lastConnectedAt, andlastEventAtto the connect timestamp.src/gateway/channel-health-policy.tsevaluateChannelHealth()returnsstale-socketwhenconnected === trueandnow - lastEventAt > staleEventThresholdMs.src/gateway/channel-health-monitor.tsThe monitor restarts channels marked unhealthy, turning this stale-event policy into a stop/start loop.
extensions/slack/src/monitor/reconnect-policy.tswaitForSlackSocketDisconnect()waits ondisconnected,unable_to_socket_mode_start, anderror.It does not model lower-level SDK
close/reconnectinglifecycle.Why This Is Flawed
The stale-socket policy is intended to detect half-dead websockets, but it uses normal inbound Slack app-event recency as the proxy. These are different signals:
There is also a reconnect ownership mismatch:
SocketModeClientdefaultsautoReconnectEnabledtotrue.close, the SDK can emitreconnectingand start a new socket internally.disconnected/error/ start-failure events, so it can miss the SDK's normal reconnect lifecycle.SocketModeReceiverdoes acceptautoReconnectEnabled,clientPingTimeout,serverPingTimeout, andpingPongLoggingEnabled, but OpenClaw's implicitnew App({ socketMode: true })path does not pass those options.Proposed Direction
Prefer one of these two designs:
Option A: OpenClaw owns reconnects
SocketModeReceiverexplicitly for Slack socket mode.autoReconnectEnabled: false.close/ reconnect behavior, not onlydisconnected.Option B: Slack SDK owns reconnects
connecting,connected,authenticated,reconnecting,close,disconnected, anderror.lastEventAtis old.Health Model Refactor
Regardless of reconnect ownership:
lastEventAtas an app-event freshness diagnostic only.socketStatelastSocketConnectedAtlastSocketClosedAtlastSocketErrorAtlastReconnectAttemptAtlastReconnectAtlastKeepaliveAtor equivalent SDK-observable livenessstale-socketbased only on quiet inbound app traffic.Acceptance Criteria
/statusor channel status output can distinguish:close/reconnectingbehavior and the health monitor's treatment of oldlastEventAton Slack.