Slack Socket Mode stale-socket health uses inbound app-event recency as transport liveness

## Summary

Slack Socket Mode health currently treats old inbound app-event activity (`lastEventAt`) as evidence that the websocket is stale. On quiet Slack workspaces this creates false-positive `stale-socket` restarts even when the Slack SDK is not reporting websocket keepalive failures.

This should be treated as a stability bug/refactor target, not only a configuration tuning issue.

## Evidence

An investigation of a live OpenClaw host on `2026.4.15` found two separate behaviors:

1. Real Slack/socket transport instability on some days:
 - Slack SDK keepalive warnings: missing pongs/pings
 - HTTP 408s from Socket Mode connection attempts
 - `no active connection` / `client is not ready`
 - Slack Web API HTTP failures

2. OpenClaw-created stale-socket churn on quiet days:
 - Several days had no matching Slack SDK keepalive failures in the inspected error logs.
 - The gateway still restarted Slack for `stale-socket` on a regular cadence.
 - In the inspected host, `gateway.channelStaleEventThresholdMinutes` was `120`, and restart cadence matched that threshold plus the health-monitor interval.

That indicates at least some stale restarts are false positives caused by the health policy, not real websocket failures.

## Current Code Path

The relevant source path appears to be:

- `extensions/slack/src/monitor/provider.ts`
 - `publishSlackConnectedStatus()` seeds `lastEventAt` when Slack connects.
 - `trackEvent()` updates `lastEventAt` / `lastInboundAt` only from accepted inbound Slack app events.
 - Socket Mode uses `new App({ socketMode: true, ... })`, so Bolt constructs the `SocketModeReceiver` implicitly.

- `src/gateway/channel-status-patches.ts`
 - `createConnectedChannelStatusPatch()` sets `connected`, `lastConnectedAt`, and `lastEventAt` to the connect timestamp.

- `src/gateway/channel-health-policy.ts`
 - `evaluateChannelHealth()` returns `stale-socket` when `connected === true` and `now - lastEventAt > staleEventThresholdMs`.

- `src/gateway/channel-health-monitor.ts`
 - The monitor restarts channels marked unhealthy, turning this stale-event policy into a stop/start loop.

- `extensions/slack/src/monitor/reconnect-policy.ts`
 - `waitForSlackSocketDisconnect()` waits on `disconnected`, `unable_to_socket_mode_start`, and `error`.
 - It does not model lower-level SDK `close` / `reconnecting` lifecycle.

## Why This Is Flawed

The stale-socket policy is intended to detect half-dead websockets, but it uses normal inbound Slack app-event recency as the proxy. These are different signals:

- A quiet Slack workspace can have no app events for hours while the Socket Mode websocket is healthy.
- A degraded websocket should be detected by socket lifecycle/keepalive state, not by user/message traffic volume.
- Re-enabling agent heartbeats or posting periodic Slack messages can mask the symptom, but does not fix the underlying transport health model.

There is also a reconnect ownership mismatch:

- Slack SDK `SocketModeClient` defaults `autoReconnectEnabled` to `true`.
- On websocket `close`, the SDK can emit `reconnecting` and start a new socket internally.
- OpenClaw's outer reconnect loop waits for higher-level `disconnected` / `error` / start-failure events, so it can miss the SDK's normal reconnect lifecycle.
- Bolt's `SocketModeReceiver` does accept `autoReconnectEnabled`, `clientPingTimeout`, `serverPingTimeout`, and `pingPongLoggingEnabled`, but OpenClaw's implicit `new App({ socketMode: true })` path does not pass those options.

## Proposed Direction

Prefer one of these two designs:

### Option A: OpenClaw owns reconnects

- Construct Bolt `SocketModeReceiver` explicitly for Slack socket mode.
- Pass `autoReconnectEnabled: false`.
- Let OpenClaw's reconnect loop be the single lifecycle owner.
- Publish explicit status transitions for connecting, connected, disconnecting, disconnected, reconnecting, and failed.
- Add tests for SDK `close` / reconnect behavior, not only `disconnected`.

### Option B: Slack SDK owns reconnects

- Keep SDK auto-reconnect enabled.
- Subscribe to/report SDK lifecycle events such as `connecting`, `connected`, `authenticated`, `reconnecting`, `close`, `disconnected`, and `error`.
- Update gateway health to use those lifecycle signals instead of restarting solely because `lastEventAt` is old.

## Health Model Refactor

Regardless of reconnect ownership:

- Keep `lastEventAt` as an app-event freshness diagnostic only.
- Add distinct socket/transport status fields, for example:
 - `socketState`
 - `lastSocketConnectedAt`
 - `lastSocketClosedAt`
 - `lastSocketErrorAt`
 - `lastReconnectAttemptAt`
 - `lastReconnectAt`
 - `lastKeepaliveAt` or equivalent SDK-observable liveness
- Do not automatically classify Slack as `stale-socket` based only on quiet inbound app traffic.
- Classify outbound Slack failures against current transport state so operators can distinguish Slack Web API failure, socket reconnecting, app-event quietness, and agent/orchestration stalls.

## Acceptance Criteria

- Quiet Slack workspaces do not restart repeatedly only because no inbound events arrived.
- Real Socket Mode keepalive failures still move the channel into a degraded/reconnecting/disconnected state.
- `/status` or channel status output can distinguish:
 - connected but event-quiet
 - reconnecting
 - disconnected
 - Web API send failure
 - stale/failed transport
- Tests cover SDK `close` / `reconnecting` behavior and the health monitor's treatment of old `lastEventAt` on Slack.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Slack Socket Mode stale-socket health uses inbound app-event recency as transport liveness #69157

Summary

Evidence

Current Code Path

Why This Is Flawed

Proposed Direction

Option A: OpenClaw owns reconnects

Option B: Slack SDK owns reconnects

Health Model Refactor

Acceptance Criteria

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Slack Socket Mode stale-socket health uses inbound app-event recency as transport liveness #69157

Description

Summary

Evidence

Current Code Path

Why This Is Flawed

Proposed Direction

Option A: OpenClaw owns reconnects

Option B: Slack SDK owns reconnects

Health Model Refactor

Acceptance Criteria

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions