Skip to content

Slack Socket Mode stale-socket health uses inbound app-event recency as transport liveness #69157

@bill492

Description

@bill492

Summary

Slack Socket Mode health currently treats old inbound app-event activity (lastEventAt) as evidence that the websocket is stale. On quiet Slack workspaces this creates false-positive stale-socket restarts even when the Slack SDK is not reporting websocket keepalive failures.

This should be treated as a stability bug/refactor target, not only a configuration tuning issue.

Evidence

An investigation of a live OpenClaw host on 2026.4.15 found two separate behaviors:

  1. Real Slack/socket transport instability on some days:
  • Slack SDK keepalive warnings: missing pongs/pings
  • HTTP 408s from Socket Mode connection attempts
  • no active connection / client is not ready
  • Slack Web API HTTP failures
  1. OpenClaw-created stale-socket churn on quiet days:
  • Several days had no matching Slack SDK keepalive failures in the inspected error logs.
  • The gateway still restarted Slack for stale-socket on a regular cadence.
  • In the inspected host, gateway.channelStaleEventThresholdMinutes was 120, and restart cadence matched that threshold plus the health-monitor interval.

That indicates at least some stale restarts are false positives caused by the health policy, not real websocket failures.

Current Code Path

The relevant source path appears to be:

  • extensions/slack/src/monitor/provider.ts

  • publishSlackConnectedStatus() seeds lastEventAt when Slack connects.

  • trackEvent() updates lastEventAt / lastInboundAt only from accepted inbound Slack app events.

  • Socket Mode uses new App({ socketMode: true, ... }), so Bolt constructs the SocketModeReceiver implicitly.

  • src/gateway/channel-status-patches.ts

  • createConnectedChannelStatusPatch() sets connected, lastConnectedAt, and lastEventAt to the connect timestamp.

  • src/gateway/channel-health-policy.ts

  • evaluateChannelHealth() returns stale-socket when connected === true and now - lastEventAt > staleEventThresholdMs.

  • src/gateway/channel-health-monitor.ts

  • The monitor restarts channels marked unhealthy, turning this stale-event policy into a stop/start loop.

  • extensions/slack/src/monitor/reconnect-policy.ts

  • waitForSlackSocketDisconnect() waits on disconnected, unable_to_socket_mode_start, and error.

  • It does not model lower-level SDK close / reconnecting lifecycle.

Why This Is Flawed

The stale-socket policy is intended to detect half-dead websockets, but it uses normal inbound Slack app-event recency as the proxy. These are different signals:

  • A quiet Slack workspace can have no app events for hours while the Socket Mode websocket is healthy.
  • A degraded websocket should be detected by socket lifecycle/keepalive state, not by user/message traffic volume.
  • Re-enabling agent heartbeats or posting periodic Slack messages can mask the symptom, but does not fix the underlying transport health model.

There is also a reconnect ownership mismatch:

  • Slack SDK SocketModeClient defaults autoReconnectEnabled to true.
  • On websocket close, the SDK can emit reconnecting and start a new socket internally.
  • OpenClaw's outer reconnect loop waits for higher-level disconnected / error / start-failure events, so it can miss the SDK's normal reconnect lifecycle.
  • Bolt's SocketModeReceiver does accept autoReconnectEnabled, clientPingTimeout, serverPingTimeout, and pingPongLoggingEnabled, but OpenClaw's implicit new App({ socketMode: true }) path does not pass those options.

Proposed Direction

Prefer one of these two designs:

Option A: OpenClaw owns reconnects

  • Construct Bolt SocketModeReceiver explicitly for Slack socket mode.
  • Pass autoReconnectEnabled: false.
  • Let OpenClaw's reconnect loop be the single lifecycle owner.
  • Publish explicit status transitions for connecting, connected, disconnecting, disconnected, reconnecting, and failed.
  • Add tests for SDK close / reconnect behavior, not only disconnected.

Option B: Slack SDK owns reconnects

  • Keep SDK auto-reconnect enabled.
  • Subscribe to/report SDK lifecycle events such as connecting, connected, authenticated, reconnecting, close, disconnected, and error.
  • Update gateway health to use those lifecycle signals instead of restarting solely because lastEventAt is old.

Health Model Refactor

Regardless of reconnect ownership:

  • Keep lastEventAt as an app-event freshness diagnostic only.
  • Add distinct socket/transport status fields, for example:
  • socketState
  • lastSocketConnectedAt
  • lastSocketClosedAt
  • lastSocketErrorAt
  • lastReconnectAttemptAt
  • lastReconnectAt
  • lastKeepaliveAt or equivalent SDK-observable liveness
  • Do not automatically classify Slack as stale-socket based only on quiet inbound app traffic.
  • Classify outbound Slack failures against current transport state so operators can distinguish Slack Web API failure, socket reconnecting, app-event quietness, and agent/orchestration stalls.

Acceptance Criteria

  • Quiet Slack workspaces do not restart repeatedly only because no inbound events arrived.
  • Real Socket Mode keepalive failures still move the channel into a degraded/reconnecting/disconnected state.
  • /status or channel status output can distinguish:
  • connected but event-quiet
  • reconnecting
  • disconnected
  • Web API send failure
  • stale/failed transport
  • Tests cover SDK close / reconnecting behavior and the health monitor's treatment of old lastEventAt on Slack.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions