feat: detect stale Slack sockets and auto-restart by derankin · Pull Request #30153 · openclaw/openclaw

derankin · 2026-03-01T00:11:40Z

Summary

Track event liveness on every inbound Slack event (messages, reactions, member join/leave, channel create/rename/id-change, pins) by calling setStatus({ lastEventAt, lastInboundAt }) to populate the already-existing ChannelAccountSnapshot fields
Extend the channel health monitor to detect stale sockets: if a channel has been running longer than a configurable threshold (default 30 min) and lastEventAt is older than that threshold, flag it as unhealthy and trigger an automatic restart
Log restart reason as stale-socket for observability, distinct from existing stuck, stopped, and gave-up reasons

Problem

Slack Socket Mode WebSocket connections can silently stop delivering events while still appearing connected — health checks pass, the socket stays open, but no messages arrive. This "half-dead socket" problem causes all messages across all channels to go unanswered until a manual restart.

The existing health monitor only checked running and connected flags, so it never detected this failure mode.

Approach

Layer 1 — Event liveness tracking (src/slack/monitor/):

Every inbound Slack event handler now calls a trackEvent() callback that updates lastEventAt and lastInboundAt on the channel account snapshot via setStatus. The MonitorSlackOpts type gains setStatus and getStatus callbacks, and the Slack channel plugin passes them through from the gateway context.

Files changed: types.ts, provider.ts, message-handler.ts, events.ts, events/messages.ts, events/reactions.ts, events/members.ts, events/channels.ts, events/pins.ts, extensions/slack/src/channel.ts

Layer 2 — Health monitor stale socket detection (src/gateway/channel-health-monitor.ts):

isChannelHealthy() now accepts lastEventAt and lastStartAt from the snapshot and checks whether the channel has gone too long without events. The logic:

If channel uptime > stale threshold AND last event age > stale threshold → unhealthy
New channels or recently started channels get a natural grace period (uptime must exceed the threshold first)
Channels that never populated lastEventAt/lastStartAt (non-Slack channels) are unaffected

The threshold is configurable via staleEventThresholdMs on ChannelHealthMonitorDeps (default: 30 minutes). Existing cooldown (2 cycles) and rate limiting (3 restarts/hour) prevent restart storms.

Test plan

All 16 existing health monitor tests pass unchanged
5 new tests for stale socket detection:
- Restarts channel with no events past stale threshold
- Skips channels with recent events
- Skips channels still within startup grace window
- Restarts channel that never received any event past threshold
- Respects custom staleEventThresholdMs
Manual testing: deploy to a live gateway and verify lastEventAt is populated in health snapshots
Manual testing: simulate stale socket by blocking Slack events and verify auto-restart triggers

greptile-apps · 2026-03-01T00:20:49Z

Greptile Summary

Implemented stale Slack socket detection to automatically restart channels when the WebSocket connection silently stops delivering events. The implementation correctly tracks event liveness across message handlers and most event types (reactions, members, channels, pins), with comprehensive test coverage.

Key changes:

Added trackEvent callback that updates lastEventAt and lastInboundAt on every inbound Slack event
Extended health monitor with stale detection logic: channels running longer than threshold (default 30 min) without events are automatically restarted
Added 5 new tests covering various stale socket scenarios

Architecture:

Event tracking flows through monitorSlackProvider → event handlers → trackEvent() → setStatus()
Health monitor checks both uptime and event age before marking as unhealthy
Graceful handling of non-Slack channels (skip stale detection if fields not present)
Existing rate limiting and cooldown mechanisms prevent restart storms

Minor observation: Interaction events (button clicks, modal submissions) don't currently track liveness, which could be a consideration for interaction-heavy workspaces.

Confidence Score: 4/5

This PR is safe to merge with minimal risk - it adds defensive monitoring without modifying core event handling logic
Score reflects solid implementation with comprehensive tests and no critical issues found. The stale socket detection logic is correct, event tracking is properly wired through most event types, and existing health monitor safeguards (cooldown, rate limiting) prevent runaway restarts. Minor enhancement opportunity exists for interaction event tracking, but this doesn't affect correctness or safety.
No files require special attention - implementation is straightforward and well-tested

_{Last reviewed commit: f5a6ce7}

greptile-apps

_{11 files reviewed, 1 comment}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps · 2026-03-01T00:20:53Z

src/slack/monitor/events.ts

  ctx: SlackMonitorContext;
  account: ResolvedSlackAccount;
  handleSlackMessage: SlackMessageHandler;
+  /** Called on each inbound event to update liveness tracking. */


Consider passing trackEvent to registerSlackInteractionEvents so that Block Kit interactions (button clicks, modal submissions) also update event liveness. Currently, workspaces that primarily use interactive features won't benefit from stale socket detection as effectively.

Prompt To Fix With AI

This is a comment left during a code review. Path: src/slack/monitor/events.ts Line: 15 Comment: Consider passing `trackEvent` to `registerSlackInteractionEvents` so that Block Kit interactions (button clicks, modal submissions) also update event liveness. Currently, workspaces that primarily use interactive features won't benefit from stale socket detection as effectively. How can I resolve this? If you propose a fix, please make it concise.

_{Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!}

Slack Socket Mode connections can silently stop delivering events while still appearing connected (health checks pass, WebSocket stays open). This "half-dead socket" problem causes messages to go unanswered. This commit adds two layers of protection: 1. **Event liveness tracking**: Every inbound Slack event (messages, reactions, member joins/leaves, channel events, pins) now calls `setStatus({ lastEventAt, lastInboundAt })` to update the channel account snapshot with the timestamp of the last received event. 2. **Health monitor stale socket detection**: The channel health monitor now checks `lastEventAt` against a configurable threshold (default 30 minutes). If a channel has been running longer than the threshold and hasn't received any events in that window, it is flagged as unhealthy and automatically restarted — the same way disconnected or crashed channels are already handled. The restart reason is logged as "stale-socket" for observability, and the existing cooldown/rate-limit logic (3 restarts/hour max) prevents restart storms.

Takhoffman · 2026-03-01T16:58:35Z

Landed salvage patch for stale-socket liveness tracking hardening.

What changed

trackEvent now runs only after mismatch and validity gates pass for Slack channel/member/pin/reaction/message handlers.
Added regression tests to lock behavior:
- mismatched events do not update liveness
- accepted events do update liveness

Validation

pnpm check
pnpm vitest run src/slack/monitor/events/channels.test.ts src/slack/monitor/events/members.test.ts src/slack/monitor/events/pins.test.ts src/slack/monitor/events/reactions.test.ts src/slack/monitor/message-handler.test.ts

Impact

Prevents false-positive liveness updates from dropped/irrelevant inbound events.
Reduces risk of masking stale Slack sockets and delayed auto-restart.

Residual risk

Some event classes outside this path may still contribute liveness differently; monitor restart telemetry for first 24h.

Cherry-pick of upstream a28a4b1.

* feat: detect stale Slack sockets and auto-restart Slack Socket Mode connections can silently stop delivering events while still appearing connected (health checks pass, WebSocket stays open). This "half-dead socket" problem causes messages to go unanswered. This commit adds two layers of protection: 1. **Event liveness tracking**: Every inbound Slack event (messages, reactions, member joins/leaves, channel events, pins) now calls `setStatus({ lastEventAt, lastInboundAt })` to update the channel account snapshot with the timestamp of the last received event. 2. **Health monitor stale socket detection**: The channel health monitor now checks `lastEventAt` against a configurable threshold (default 30 minutes). If a channel has been running longer than the threshold and hasn't received any events in that window, it is flagged as unhealthy and automatically restarted — the same way disconnected or crashed channels are already handled. The restart reason is logged as "stale-socket" for observability, and the existing cooldown/rate-limit logic (3 restarts/hour max) prevents restart storms. * Slack: gate liveness tracking to accepted events --------- Co-authored-by: Tak Hoffman <781889+Takhoffman@users.noreply.github.com>

Cherry-pick of upstream a28a4b1.

openclaw-barnacle bot added channel: slack Channel integration: slack gateway Gateway runtime size: M labels Mar 1, 2026

greptile-apps bot reviewed Mar 1, 2026

View reviewed changes

derankin and others added 2 commits March 1, 2026 10:44

Slack: gate liveness tracking to accepted events

9c0f336

Takhoffman force-pushed the slack-stale-socket-detection branch from f5a6ce7 to 9c0f336 Compare March 1, 2026 16:53

openclaw-barnacle bot added size: L and removed size: M labels Mar 1, 2026

Takhoffman merged commit a28a4b1 into openclaw:main Mar 1, 2026
26 of 27 checks passed

vabole mentioned this pull request Mar 1, 2026

ops: decision needed for upstream sync (1 disputed files) vabole/openclaw#4

Merged

1 task

github-actions bot mentioned this pull request Mar 1, 2026

📡 Upstream Digest — 2026-03-01 18:22 UTC curtismercier/openclaw-mods#154

Open

zooqueen added a commit to hanzoai/bot that referenced this pull request Mar 1, 2026

feat: detect stale Slack sockets and auto-restart (openclaw#30153)

312aca2

Cherry-pick of upstream a28a4b1.

Diaspar4u mentioned this pull request Mar 3, 2026

fix(gateway): skip stale-socket check for channels without event tracking #33393

Open

zooqueen added a commit to hanzoai/bot that referenced this pull request Mar 6, 2026

feat: detect stale Slack sockets and auto-restart (openclaw#30153)

5627fd7

Cherry-pick of upstream a28a4b1.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: detect stale Slack sockets and auto-restart#30153

feat: detect stale Slack sockets and auto-restart#30153
Takhoffman merged 2 commits intoopenclaw:mainfrom
derankin:slack-stale-socket-detection

derankin commented Mar 1, 2026

Uh oh!

greptile-apps bot commented Mar 1, 2026

Uh oh!

greptile-apps bot left a comment

Uh oh!

greptile-apps bot Mar 1, 2026

Uh oh!

Uh oh!

Takhoffman commented Mar 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

derankin commented Mar 1, 2026

Summary

Problem

Approach

Test plan

Uh oh!

greptile-apps bot commented Mar 1, 2026

Greptile Summary

Confidence Score: 4/5

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Mar 1, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Takhoffman commented Mar 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants