fix(slack): reconnect socket mode after DNS failure via watchdog by byungsker · Pull Request #27241 · openclaw/openclaw

byungsker · 2026-02-26T06:06:04Z

Problem

When a transient DNS failure causes the Slack socket mode WebSocket to close, SocketModeClient calls delayReconnectAttempt(this.start) to reconnect. Internally, the WebClient retries apps.connections.open up to 100 times (hence the 2 800+ retry count in the report). If those retries are exhausted before DNS recovers, the resulting RequestError is swallowed (unhandled rejection inside delayReconnectAttempt) and:

State.Disconnected is never emitted
No new State.Connected event arrives
The socket is permanently dead; the gateway requires a manual restart

This is the same class of bug as #13506 (WhatsApp).

Root cause

WebSocket closes
  → SocketModeClient.on(close) fires
  → delayReconnectAttempt(this.start) is called
    → this.start() → retrieveWSSURL() → apps.connections.open() (100 HTTP retries)
    → all retries fail (ENOTFOUND) → ErrorCode.RequestError → isRecoverable = false → throws
  → error propagates into delayReconnectAttempt().then(res) with no .catch()
  → unhandled rejection — socket silently dead forever

Fix

After a successful app.start() in socket mode, attach listeners to the underlying SocketModeClient (accessed via the internal receiver) for its "close" and "connected" events:

On "close" — start a 3-minute watchdog timer (long enough for the SDK's own HTTP retry loop to succeed during normal transient outages).
On "connected" — clear the timer (SDK reconnected on its own; nothing to do).
If the timer fires — the SDK gave up silently; call app.stop() then app.start() again on the same App instance (same event handlers, new WebSocket) and back off exponentially (1 s → 30 s) between attempts.

HTTP mode is unaffected.

Implementation notes

SocketModeClient emits "close" whenever the underlying SlackWebSocket disconnects (pong timeout, server close frame, etc.) — this is the reliable hook point.
SocketModeClient.start() resets this.shuttingDown = false and creates a fresh SlackWebSocket, so calling it again after stop() is safe.
The 3-minute watchdog is intentionally generous: SocketModeClient uses retryConfig: { retries: 100, factor: 1.3 } for apps.connections.open, so a short watchdog would race against the SDK's own recovery.
Listeners are removed in the finally block of each iteration to prevent duplicates across reconnect cycles.

Fixes #27077

When the Bolt SDK's socket mode WebSocket closes due to a transient DNS failure, SocketModeClient calls delayReconnectAttempt(this.start) to reconnect. The internal WebClient retries apps.connections.open up to 100 times; if those retries are exhausted before DNS recovers, the resulting RequestError is swallowed (unhandled rejection inside delayReconnectAttempt) and the socket goes permanently dead — neither State.Disconnected nor a new State.Connected event is ever emitted. Fix: after a successful app.start() in socket mode, attach listeners to the underlying SocketModeClient for the "close" and "connected" events. On "close", start a watchdog timer (3 min — enough time for the SDK's own retry loop to succeed in normal transient outages). If "connected" arrives the timer is cleared. If the timer fires it means the SDK gave up silently; we call app.stop() and app.start() again (same App instance, new WebSocket), then back off exponentially (1 s → 30 s) before the next attempt. For HTTP mode the existing block-until-abort behaviour is unchanged. Fixes openclaw#27077

greptile-apps · 2026-02-26T06:08:43Z

Greptile Summary

Adds a watchdog-based reconnection mechanism for Slack socket mode to recover from a known @slack/bolt SDK failure mode where DNS failures exhaust the internal HTTP retry loop, silently killing the WebSocket with no recovery path.

Wraps the socket mode lifecycle in a while loop that monitors SocketModeClient "close" / "connected" events via a 3-minute watchdog timer
If the SDK fails to reconnect within the watchdog window, forces a full app.stop() / app.start() cycle with exponential backoff (1s to 30s)
Properly cleans up event listeners in finally blocks to prevent leaks across reconnect cycles
HTTP mode is unchanged aside from moving the abortSignal wait into the else branch (previously shared)
Well-commented implementation with clear rationale for the 3-minute watchdog threshold

Confidence Score: 4/5

This PR is safe to merge — it adds a resilience layer around an existing SDK limitation with careful cleanup and no regressions to HTTP mode.
Score of 4 reflects a well-implemented, narrowly scoped fix with thorough comments and proper resource cleanup. The internal cast to access SocketModeClient is a necessary trade-off that degrades gracefully (no-op if the cast fails). The only gap is the absence of unit tests, though the reconnect behavior is difficult to test without mocking SDK internals. No logical errors or security concerns found.
No files require special attention.

_{Last reviewed commit: 13fb8f2}

markshields-tl

Review

Best root cause analysis of the three PRs targeting #17847. The insight that delayReconnectAttempt swallows the RequestError as an unhandled rejection — leaving the socket permanently dead with no event emitted — is the key finding.

What's strong:

The 3-minute watchdog timer is the right approach for catching the "silent death" where no disconnect event fires
Correctly identifies that SocketModeClient internal retry exhaustion is a terminal failure with no external signal
close → start timer, connected → clear timer pattern is clean

One concern:

This only catches the DNS-failure-triggered variant. PR #27232 covers a broader set of disconnect/error events. Ideally both approaches should be combined: event-driven reconnect for cases where events DO fire, plus a staleness watchdog for cases where they don't.

Suggestion: The codebase already tracks lastInboundAt on every inbound event. The watchdog timer could also check this — if lastInboundAt is stale by N minutes while the socket claims connected, that's a smoking gun regardless of what caused the silence.

Would love to see this merged alongside or combined with #27232.

— Mort (AI assistant reviewing on behalf of @markshields-tl)

Takhoffman · 2026-03-01T15:37:01Z

Thanks for the contribution and reliability focus.

Closing as superseded by #27232, which is the canonical reconnect fix selected for this workstream's deduped landing path.

openclaw-barnacle bot added channel: slack Channel integration: slack size: S trusted-contributor labels Feb 26, 2026

This was referenced Feb 27, 2026

[Bug]: Slack Socket Mode silently stops receiving inbound events while appearing connected #17847

Open

fix(slack): reconnect socket mode after disconnect #27232

Merged

markshields-tl reviewed Feb 27, 2026

View reviewed changes

markshields-tl mentioned this pull request Feb 27, 2026

fix(slack): fail fast when socket mode disconnects #24283

Closed

Takhoffman added the close:superseded PR close reason label Mar 1, 2026

Takhoffman closed this Mar 1, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(slack): reconnect socket mode after DNS failure via watchdog#27241

fix(slack): reconnect socket mode after DNS failure via watchdog#27241
byungsker wants to merge 1 commit intoopenclaw:mainfrom
byungsker:fix/slack-socket-mode-reconnect-dns-27077

byungsker commented Feb 26, 2026

Uh oh!

greptile-apps bot commented Feb 26, 2026

Uh oh!

markshields-tl left a comment

Uh oh!

Takhoffman commented Mar 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

byungsker commented Feb 26, 2026

Problem

Root cause

Fix

Implementation notes

Uh oh!

greptile-apps bot commented Feb 26, 2026

Greptile Summary

Confidence Score: 4/5

Uh oh!

markshields-tl left a comment

Choose a reason for hiding this comment

Review

Uh oh!

Takhoffman commented Mar 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants