Skip to content

fix(slack): reconnect socket mode after DNS failure via watchdog#27241

Closed
byungsker wants to merge 1 commit intoopenclaw:mainfrom
byungsker:fix/slack-socket-mode-reconnect-dns-27077
Closed

fix(slack): reconnect socket mode after DNS failure via watchdog#27241
byungsker wants to merge 1 commit intoopenclaw:mainfrom
byungsker:fix/slack-socket-mode-reconnect-dns-27077

Conversation

@byungsker
Copy link
Contributor

Problem

When a transient DNS failure causes the Slack socket mode WebSocket to close, SocketModeClient calls delayReconnectAttempt(this.start) to reconnect. Internally, the WebClient retries apps.connections.open up to 100 times (hence the 2 800+ retry count in the report). If those retries are exhausted before DNS recovers, the resulting RequestError is swallowed (unhandled rejection inside delayReconnectAttempt) and:

  • State.Disconnected is never emitted
  • No new State.Connected event arrives
  • The socket is permanently dead; the gateway requires a manual restart

This is the same class of bug as #13506 (WhatsApp).

Root cause

WebSocket closes
  → SocketModeClient.on(close) fires
  → delayReconnectAttempt(this.start) is called
    → this.start() → retrieveWSSURL() → apps.connections.open() (100 HTTP retries)
    → all retries fail (ENOTFOUND) → ErrorCode.RequestError → isRecoverable = false → throws
  → error propagates into delayReconnectAttempt().then(res) with no .catch()
  → unhandled rejection — socket silently dead forever

Fix

After a successful app.start() in socket mode, attach listeners to the underlying SocketModeClient (accessed via the internal receiver) for its "close" and "connected" events:

  • On "close" — start a 3-minute watchdog timer (long enough for the SDK's own HTTP retry loop to succeed during normal transient outages).
  • On "connected" — clear the timer (SDK reconnected on its own; nothing to do).
  • If the timer fires — the SDK gave up silently; call app.stop() then app.start() again on the same App instance (same event handlers, new WebSocket) and back off exponentially (1 s → 30 s) between attempts.

HTTP mode is unaffected.

Implementation notes

  • SocketModeClient emits "close" whenever the underlying SlackWebSocket disconnects (pong timeout, server close frame, etc.) — this is the reliable hook point.
  • SocketModeClient.start() resets this.shuttingDown = false and creates a fresh SlackWebSocket, so calling it again after stop() is safe.
  • The 3-minute watchdog is intentionally generous: SocketModeClient uses retryConfig: { retries: 100, factor: 1.3 } for apps.connections.open, so a short watchdog would race against the SDK's own recovery.
  • Listeners are removed in the finally block of each iteration to prevent duplicates across reconnect cycles.

Fixes #27077

When the Bolt SDK's socket mode WebSocket closes due to a transient DNS
failure, SocketModeClient calls delayReconnectAttempt(this.start) to
reconnect.  The internal WebClient retries apps.connections.open up to
100 times; if those retries are exhausted before DNS recovers, the
resulting RequestError is swallowed (unhandled rejection inside
delayReconnectAttempt) and the socket goes permanently dead — neither
State.Disconnected nor a new State.Connected event is ever emitted.

Fix: after a successful app.start() in socket mode, attach listeners to
the underlying SocketModeClient for the "close" and "connected" events.
On "close", start a watchdog timer (3 min — enough time for the SDK's
own retry loop to succeed in normal transient outages).  If "connected"
arrives the timer is cleared.  If the timer fires it means the SDK gave
up silently; we call app.stop() and app.start() again (same App instance,
new WebSocket), then back off exponentially (1 s → 30 s) before the next
attempt.

For HTTP mode the existing block-until-abort behaviour is unchanged.

Fixes openclaw#27077
@greptile-apps
Copy link
Contributor

greptile-apps bot commented Feb 26, 2026

Greptile Summary

Adds a watchdog-based reconnection mechanism for Slack socket mode to recover from a known @slack/bolt SDK failure mode where DNS failures exhaust the internal HTTP retry loop, silently killing the WebSocket with no recovery path.

  • Wraps the socket mode lifecycle in a while loop that monitors SocketModeClient "close" / "connected" events via a 3-minute watchdog timer
  • If the SDK fails to reconnect within the watchdog window, forces a full app.stop() / app.start() cycle with exponential backoff (1s to 30s)
  • Properly cleans up event listeners in finally blocks to prevent leaks across reconnect cycles
  • HTTP mode is unchanged aside from moving the abortSignal wait into the else branch (previously shared)
  • Well-commented implementation with clear rationale for the 3-minute watchdog threshold

Confidence Score: 4/5

  • This PR is safe to merge — it adds a resilience layer around an existing SDK limitation with careful cleanup and no regressions to HTTP mode.
  • Score of 4 reflects a well-implemented, narrowly scoped fix with thorough comments and proper resource cleanup. The internal cast to access SocketModeClient is a necessary trade-off that degrades gracefully (no-op if the cast fails). The only gap is the absence of unit tests, though the reconnect behavior is difficult to test without mocking SDK internals. No logical errors or security concerns found.
  • No files require special attention.

Last reviewed commit: 13fb8f2

Copy link
Contributor

@markshields-tl markshields-tl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review

Best root cause analysis of the three PRs targeting #17847. The insight that delayReconnectAttempt swallows the RequestError as an unhandled rejection — leaving the socket permanently dead with no event emitted — is the key finding.

What's strong:

  • The 3-minute watchdog timer is the right approach for catching the "silent death" where no disconnect event fires
  • Correctly identifies that SocketModeClient internal retry exhaustion is a terminal failure with no external signal
  • close → start timer, connected → clear timer pattern is clean

One concern:

  • This only catches the DNS-failure-triggered variant. PR #27232 covers a broader set of disconnect/error events. Ideally both approaches should be combined: event-driven reconnect for cases where events DO fire, plus a staleness watchdog for cases where they don't.

Suggestion: The codebase already tracks lastInboundAt on every inbound event. The watchdog timer could also check this — if lastInboundAt is stale by N minutes while the socket claims connected, that's a smoking gun regardless of what caused the silence.

Would love to see this merged alongside or combined with #27232.

— Mort (AI assistant reviewing on behalf of @markshields-tl)

@Takhoffman
Copy link
Contributor

Thanks for the contribution and reliability focus.

Closing as superseded by #27232, which is the canonical reconnect fix selected for this workstream's deduped landing path.

@Takhoffman Takhoffman closed this Mar 1, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

channel: slack Channel integration: slack close:superseded PR close reason size: S

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Slack socket mode connection does not recover after transient DNS failure (same root cause as #13506)

3 participants