Skip to content

feat: detect stale Slack sockets and auto-restart#30153

Merged
Takhoffman merged 2 commits intoopenclaw:mainfrom
derankin:slack-stale-socket-detection
Mar 1, 2026
Merged

feat: detect stale Slack sockets and auto-restart#30153
Takhoffman merged 2 commits intoopenclaw:mainfrom
derankin:slack-stale-socket-detection

Conversation

@derankin
Copy link
Contributor

@derankin derankin commented Mar 1, 2026

Summary

  • Track event liveness on every inbound Slack event (messages, reactions, member join/leave, channel create/rename/id-change, pins) by calling setStatus({ lastEventAt, lastInboundAt }) to populate the already-existing ChannelAccountSnapshot fields
  • Extend the channel health monitor to detect stale sockets: if a channel has been running longer than a configurable threshold (default 30 min) and lastEventAt is older than that threshold, flag it as unhealthy and trigger an automatic restart
  • Log restart reason as stale-socket for observability, distinct from existing stuck, stopped, and gave-up reasons

Problem

Slack Socket Mode WebSocket connections can silently stop delivering events while still appearing connected — health checks pass, the socket stays open, but no messages arrive. This "half-dead socket" problem causes all messages across all channels to go unanswered until a manual restart.

The existing health monitor only checked running and connected flags, so it never detected this failure mode.

Approach

Layer 1 — Event liveness tracking (src/slack/monitor/):

Every inbound Slack event handler now calls a trackEvent() callback that updates lastEventAt and lastInboundAt on the channel account snapshot via setStatus. The MonitorSlackOpts type gains setStatus and getStatus callbacks, and the Slack channel plugin passes them through from the gateway context.

Files changed: types.ts, provider.ts, message-handler.ts, events.ts, events/messages.ts, events/reactions.ts, events/members.ts, events/channels.ts, events/pins.ts, extensions/slack/src/channel.ts

Layer 2 — Health monitor stale socket detection (src/gateway/channel-health-monitor.ts):

isChannelHealthy() now accepts lastEventAt and lastStartAt from the snapshot and checks whether the channel has gone too long without events. The logic:

  1. If channel uptime > stale threshold AND last event age > stale threshold → unhealthy
  2. New channels or recently started channels get a natural grace period (uptime must exceed the threshold first)
  3. Channels that never populated lastEventAt/lastStartAt (non-Slack channels) are unaffected

The threshold is configurable via staleEventThresholdMs on ChannelHealthMonitorDeps (default: 30 minutes). Existing cooldown (2 cycles) and rate limiting (3 restarts/hour) prevent restart storms.

Test plan

  • All 16 existing health monitor tests pass unchanged
  • 5 new tests for stale socket detection:
    • Restarts channel with no events past stale threshold
    • Skips channels with recent events
    • Skips channels still within startup grace window
    • Restarts channel that never received any event past threshold
    • Respects custom staleEventThresholdMs
  • Manual testing: deploy to a live gateway and verify lastEventAt is populated in health snapshots
  • Manual testing: simulate stale socket by blocking Slack events and verify auto-restart triggers

@openclaw-barnacle openclaw-barnacle bot added channel: slack Channel integration: slack gateway Gateway runtime size: M labels Mar 1, 2026
@greptile-apps
Copy link
Contributor

greptile-apps bot commented Mar 1, 2026

Greptile Summary

Implemented stale Slack socket detection to automatically restart channels when the WebSocket connection silently stops delivering events. The implementation correctly tracks event liveness across message handlers and most event types (reactions, members, channels, pins), with comprehensive test coverage.

Key changes:

  • Added trackEvent callback that updates lastEventAt and lastInboundAt on every inbound Slack event
  • Extended health monitor with stale detection logic: channels running longer than threshold (default 30 min) without events are automatically restarted
  • Added 5 new tests covering various stale socket scenarios

Architecture:

  • Event tracking flows through monitorSlackProvider → event handlers → trackEvent()setStatus()
  • Health monitor checks both uptime and event age before marking as unhealthy
  • Graceful handling of non-Slack channels (skip stale detection if fields not present)
  • Existing rate limiting and cooldown mechanisms prevent restart storms

Minor observation: Interaction events (button clicks, modal submissions) don't currently track liveness, which could be a consideration for interaction-heavy workspaces.

Confidence Score: 4/5

  • This PR is safe to merge with minimal risk - it adds defensive monitoring without modifying core event handling logic
  • Score reflects solid implementation with comprehensive tests and no critical issues found. The stale socket detection logic is correct, event tracking is properly wired through most event types, and existing health monitor safeguards (cooldown, rate limiting) prevent runaway restarts. Minor enhancement opportunity exists for interaction event tracking, but this doesn't affect correctness or safety.
  • No files require special attention - implementation is straightforward and well-tested

Last reviewed commit: f5a6ce7

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

11 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

ctx: SlackMonitorContext;
account: ResolvedSlackAccount;
handleSlackMessage: SlackMessageHandler;
/** Called on each inbound event to update liveness tracking. */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider passing trackEvent to registerSlackInteractionEvents so that Block Kit interactions (button clicks, modal submissions) also update event liveness. Currently, workspaces that primarily use interactive features won't benefit from stale socket detection as effectively.

Prompt To Fix With AI
This is a comment left during a code review.
Path: src/slack/monitor/events.ts
Line: 15

Comment:
Consider passing `trackEvent` to `registerSlackInteractionEvents` so that Block Kit interactions (button clicks, modal submissions) also update event liveness. Currently, workspaces that primarily use interactive features won't benefit from stale socket detection as effectively.

How can I resolve this? If you propose a fix, please make it concise.

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

derankin and others added 2 commits March 1, 2026 10:44
Slack Socket Mode connections can silently stop delivering events while
still appearing connected (health checks pass, WebSocket stays open).
This "half-dead socket" problem causes messages to go unanswered.

This commit adds two layers of protection:

1. **Event liveness tracking**: Every inbound Slack event (messages,
   reactions, member joins/leaves, channel events, pins) now calls
   `setStatus({ lastEventAt, lastInboundAt })` to update the channel
   account snapshot with the timestamp of the last received event.

2. **Health monitor stale socket detection**: The channel health monitor
   now checks `lastEventAt` against a configurable threshold (default
   30 minutes). If a channel has been running longer than the threshold
   and hasn't received any events in that window, it is flagged as
   unhealthy and automatically restarted — the same way disconnected
   or crashed channels are already handled.

The restart reason is logged as "stale-socket" for observability, and
the existing cooldown/rate-limit logic (3 restarts/hour max) prevents
restart storms.
@Takhoffman Takhoffman force-pushed the slack-stale-socket-detection branch from f5a6ce7 to 9c0f336 Compare March 1, 2026 16:53
@Takhoffman Takhoffman merged commit a28a4b1 into openclaw:main Mar 1, 2026
26 of 27 checks passed
@Takhoffman
Copy link
Contributor

Landed salvage patch for stale-socket liveness tracking hardening.

What changed

  • trackEvent now runs only after mismatch and validity gates pass for Slack channel/member/pin/reaction/message handlers.
  • Added regression tests to lock behavior:
    • mismatched events do not update liveness
    • accepted events do update liveness

Validation

  • pnpm check
  • pnpm vitest run src/slack/monitor/events/channels.test.ts src/slack/monitor/events/members.test.ts src/slack/monitor/events/pins.test.ts src/slack/monitor/events/reactions.test.ts src/slack/monitor/message-handler.test.ts

Impact

  • Prevents false-positive liveness updates from dropped/irrelevant inbound events.
  • Reduces risk of masking stale Slack sockets and delayed auto-restart.

Residual risk

  • Some event classes outside this path may still contribute liveness differently; monitor restart telemetry for first 24h.

zooqueen added a commit to hanzoai/bot that referenced this pull request Mar 1, 2026
ansh pushed a commit to vibecode/openclaw that referenced this pull request Mar 2, 2026
* feat: detect stale Slack sockets and auto-restart

Slack Socket Mode connections can silently stop delivering events while
still appearing connected (health checks pass, WebSocket stays open).
This "half-dead socket" problem causes messages to go unanswered.

This commit adds two layers of protection:

1. **Event liveness tracking**: Every inbound Slack event (messages,
   reactions, member joins/leaves, channel events, pins) now calls
   `setStatus({ lastEventAt, lastInboundAt })` to update the channel
   account snapshot with the timestamp of the last received event.

2. **Health monitor stale socket detection**: The channel health monitor
   now checks `lastEventAt` against a configurable threshold (default
   30 minutes). If a channel has been running longer than the threshold
   and hasn't received any events in that window, it is flagged as
   unhealthy and automatically restarted — the same way disconnected
   or crashed channels are already handled.

The restart reason is logged as "stale-socket" for observability, and
the existing cooldown/rate-limit logic (3 restarts/hour max) prevents
restart storms.

* Slack: gate liveness tracking to accepted events

---------

Co-authored-by: Tak Hoffman <781889+Takhoffman@users.noreply.github.com>
steipete pushed a commit to Sid-Qin/openclaw that referenced this pull request Mar 2, 2026
* feat: detect stale Slack sockets and auto-restart

Slack Socket Mode connections can silently stop delivering events while
still appearing connected (health checks pass, WebSocket stays open).
This "half-dead socket" problem causes messages to go unanswered.

This commit adds two layers of protection:

1. **Event liveness tracking**: Every inbound Slack event (messages,
   reactions, member joins/leaves, channel events, pins) now calls
   `setStatus({ lastEventAt, lastInboundAt })` to update the channel
   account snapshot with the timestamp of the last received event.

2. **Health monitor stale socket detection**: The channel health monitor
   now checks `lastEventAt` against a configurable threshold (default
   30 minutes). If a channel has been running longer than the threshold
   and hasn't received any events in that window, it is flagged as
   unhealthy and automatically restarted — the same way disconnected
   or crashed channels are already handled.

The restart reason is logged as "stale-socket" for observability, and
the existing cooldown/rate-limit logic (3 restarts/hour max) prevents
restart storms.

* Slack: gate liveness tracking to accepted events

---------

Co-authored-by: Tak Hoffman <781889+Takhoffman@users.noreply.github.com>
safzanpirani pushed a commit to safzanpirani/clawdbot that referenced this pull request Mar 2, 2026
* feat: detect stale Slack sockets and auto-restart

Slack Socket Mode connections can silently stop delivering events while
still appearing connected (health checks pass, WebSocket stays open).
This "half-dead socket" problem causes messages to go unanswered.

This commit adds two layers of protection:

1. **Event liveness tracking**: Every inbound Slack event (messages,
   reactions, member joins/leaves, channel events, pins) now calls
   `setStatus({ lastEventAt, lastInboundAt })` to update the channel
   account snapshot with the timestamp of the last received event.

2. **Health monitor stale socket detection**: The channel health monitor
   now checks `lastEventAt` against a configurable threshold (default
   30 minutes). If a channel has been running longer than the threshold
   and hasn't received any events in that window, it is flagged as
   unhealthy and automatically restarted — the same way disconnected
   or crashed channels are already handled.

The restart reason is logged as "stale-socket" for observability, and
the existing cooldown/rate-limit logic (3 restarts/hour max) prevents
restart storms.

* Slack: gate liveness tracking to accepted events

---------

Co-authored-by: Tak Hoffman <781889+Takhoffman@users.noreply.github.com>
steipete pushed a commit to Sid-Qin/openclaw that referenced this pull request Mar 2, 2026
* feat: detect stale Slack sockets and auto-restart

Slack Socket Mode connections can silently stop delivering events while
still appearing connected (health checks pass, WebSocket stays open).
This "half-dead socket" problem causes messages to go unanswered.

This commit adds two layers of protection:

1. **Event liveness tracking**: Every inbound Slack event (messages,
   reactions, member joins/leaves, channel events, pins) now calls
   `setStatus({ lastEventAt, lastInboundAt })` to update the channel
   account snapshot with the timestamp of the last received event.

2. **Health monitor stale socket detection**: The channel health monitor
   now checks `lastEventAt` against a configurable threshold (default
   30 minutes). If a channel has been running longer than the threshold
   and hasn't received any events in that window, it is flagged as
   unhealthy and automatically restarted — the same way disconnected
   or crashed channels are already handled.

The restart reason is logged as "stale-socket" for observability, and
the existing cooldown/rate-limit logic (3 restarts/hour max) prevents
restart storms.

* Slack: gate liveness tracking to accepted events

---------

Co-authored-by: Tak Hoffman <781889+Takhoffman@users.noreply.github.com>
amitmiran137 pushed a commit to amitmiran137/openclaw that referenced this pull request Mar 2, 2026
* feat: detect stale Slack sockets and auto-restart

Slack Socket Mode connections can silently stop delivering events while
still appearing connected (health checks pass, WebSocket stays open).
This "half-dead socket" problem causes messages to go unanswered.

This commit adds two layers of protection:

1. **Event liveness tracking**: Every inbound Slack event (messages,
   reactions, member joins/leaves, channel events, pins) now calls
   `setStatus({ lastEventAt, lastInboundAt })` to update the channel
   account snapshot with the timestamp of the last received event.

2. **Health monitor stale socket detection**: The channel health monitor
   now checks `lastEventAt` against a configurable threshold (default
   30 minutes). If a channel has been running longer than the threshold
   and hasn't received any events in that window, it is flagged as
   unhealthy and automatically restarted — the same way disconnected
   or crashed channels are already handled.

The restart reason is logged as "stale-socket" for observability, and
the existing cooldown/rate-limit logic (3 restarts/hour max) prevents
restart storms.

* Slack: gate liveness tracking to accepted events

---------

Co-authored-by: Tak Hoffman <781889+Takhoffman@users.noreply.github.com>
robertchang-ga pushed a commit to robertchang-ga/openclaw that referenced this pull request Mar 2, 2026
* feat: detect stale Slack sockets and auto-restart

Slack Socket Mode connections can silently stop delivering events while
still appearing connected (health checks pass, WebSocket stays open).
This "half-dead socket" problem causes messages to go unanswered.

This commit adds two layers of protection:

1. **Event liveness tracking**: Every inbound Slack event (messages,
   reactions, member joins/leaves, channel events, pins) now calls
   `setStatus({ lastEventAt, lastInboundAt })` to update the channel
   account snapshot with the timestamp of the last received event.

2. **Health monitor stale socket detection**: The channel health monitor
   now checks `lastEventAt` against a configurable threshold (default
   30 minutes). If a channel has been running longer than the threshold
   and hasn't received any events in that window, it is flagged as
   unhealthy and automatically restarted — the same way disconnected
   or crashed channels are already handled.

The restart reason is logged as "stale-socket" for observability, and
the existing cooldown/rate-limit logic (3 restarts/hour max) prevents
restart storms.

* Slack: gate liveness tracking to accepted events

---------

Co-authored-by: Tak Hoffman <781889+Takhoffman@users.noreply.github.com>
hanqizheng pushed a commit to hanqizheng/openclaw that referenced this pull request Mar 2, 2026
* feat: detect stale Slack sockets and auto-restart

Slack Socket Mode connections can silently stop delivering events while
still appearing connected (health checks pass, WebSocket stays open).
This "half-dead socket" problem causes messages to go unanswered.

This commit adds two layers of protection:

1. **Event liveness tracking**: Every inbound Slack event (messages,
   reactions, member joins/leaves, channel events, pins) now calls
   `setStatus({ lastEventAt, lastInboundAt })` to update the channel
   account snapshot with the timestamp of the last received event.

2. **Health monitor stale socket detection**: The channel health monitor
   now checks `lastEventAt` against a configurable threshold (default
   30 minutes). If a channel has been running longer than the threshold
   and hasn't received any events in that window, it is flagged as
   unhealthy and automatically restarted — the same way disconnected
   or crashed channels are already handled.

The restart reason is logged as "stale-socket" for observability, and
the existing cooldown/rate-limit logic (3 restarts/hour max) prevents
restart storms.

* Slack: gate liveness tracking to accepted events

---------

Co-authored-by: Tak Hoffman <781889+Takhoffman@users.noreply.github.com>
execute008 pushed a commit to execute008/openclaw that referenced this pull request Mar 2, 2026
* feat: detect stale Slack sockets and auto-restart

Slack Socket Mode connections can silently stop delivering events while
still appearing connected (health checks pass, WebSocket stays open).
This "half-dead socket" problem causes messages to go unanswered.

This commit adds two layers of protection:

1. **Event liveness tracking**: Every inbound Slack event (messages,
   reactions, member joins/leaves, channel events, pins) now calls
   `setStatus({ lastEventAt, lastInboundAt })` to update the channel
   account snapshot with the timestamp of the last received event.

2. **Health monitor stale socket detection**: The channel health monitor
   now checks `lastEventAt` against a configurable threshold (default
   30 minutes). If a channel has been running longer than the threshold
   and hasn't received any events in that window, it is flagged as
   unhealthy and automatically restarted — the same way disconnected
   or crashed channels are already handled.

The restart reason is logged as "stale-socket" for observability, and
the existing cooldown/rate-limit logic (3 restarts/hour max) prevents
restart storms.

* Slack: gate liveness tracking to accepted events

---------

Co-authored-by: Tak Hoffman <781889+Takhoffman@users.noreply.github.com>
dorgonman pushed a commit to kanohorizonia/openclaw that referenced this pull request Mar 3, 2026
* feat: detect stale Slack sockets and auto-restart

Slack Socket Mode connections can silently stop delivering events while
still appearing connected (health checks pass, WebSocket stays open).
This "half-dead socket" problem causes messages to go unanswered.

This commit adds two layers of protection:

1. **Event liveness tracking**: Every inbound Slack event (messages,
   reactions, member joins/leaves, channel events, pins) now calls
   `setStatus({ lastEventAt, lastInboundAt })` to update the channel
   account snapshot with the timestamp of the last received event.

2. **Health monitor stale socket detection**: The channel health monitor
   now checks `lastEventAt` against a configurable threshold (default
   30 minutes). If a channel has been running longer than the threshold
   and hasn't received any events in that window, it is flagged as
   unhealthy and automatically restarted — the same way disconnected
   or crashed channels are already handled.

The restart reason is logged as "stale-socket" for observability, and
the existing cooldown/rate-limit logic (3 restarts/hour max) prevents
restart storms.

* Slack: gate liveness tracking to accepted events

---------

Co-authored-by: Tak Hoffman <781889+Takhoffman@users.noreply.github.com>
sachinkundu pushed a commit to sachinkundu/openclaw that referenced this pull request Mar 6, 2026
* feat: detect stale Slack sockets and auto-restart

Slack Socket Mode connections can silently stop delivering events while
still appearing connected (health checks pass, WebSocket stays open).
This "half-dead socket" problem causes messages to go unanswered.

This commit adds two layers of protection:

1. **Event liveness tracking**: Every inbound Slack event (messages,
   reactions, member joins/leaves, channel events, pins) now calls
   `setStatus({ lastEventAt, lastInboundAt })` to update the channel
   account snapshot with the timestamp of the last received event.

2. **Health monitor stale socket detection**: The channel health monitor
   now checks `lastEventAt` against a configurable threshold (default
   30 minutes). If a channel has been running longer than the threshold
   and hasn't received any events in that window, it is flagged as
   unhealthy and automatically restarted — the same way disconnected
   or crashed channels are already handled.

The restart reason is logged as "stale-socket" for observability, and
the existing cooldown/rate-limit logic (3 restarts/hour max) prevents
restart storms.

* Slack: gate liveness tracking to accepted events

---------

Co-authored-by: Tak Hoffman <781889+Takhoffman@users.noreply.github.com>
zooqueen pushed a commit to hanzoai/bot that referenced this pull request Mar 6, 2026
* feat: detect stale Slack sockets and auto-restart

Slack Socket Mode connections can silently stop delivering events while
still appearing connected (health checks pass, WebSocket stays open).
This "half-dead socket" problem causes messages to go unanswered.

This commit adds two layers of protection:

1. **Event liveness tracking**: Every inbound Slack event (messages,
   reactions, member joins/leaves, channel events, pins) now calls
   `setStatus({ lastEventAt, lastInboundAt })` to update the channel
   account snapshot with the timestamp of the last received event.

2. **Health monitor stale socket detection**: The channel health monitor
   now checks `lastEventAt` against a configurable threshold (default
   30 minutes). If a channel has been running longer than the threshold
   and hasn't received any events in that window, it is flagged as
   unhealthy and automatically restarted — the same way disconnected
   or crashed channels are already handled.

The restart reason is logged as "stale-socket" for observability, and
the existing cooldown/rate-limit logic (3 restarts/hour max) prevents
restart storms.

* Slack: gate liveness tracking to accepted events

---------

Co-authored-by: Tak Hoffman <781889+Takhoffman@users.noreply.github.com>
zooqueen added a commit to hanzoai/bot that referenced this pull request Mar 6, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

channel: slack Channel integration: slack gateway Gateway runtime size: L

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants