fix(slack): reconnect socket mode after DNS failure via watchdog#27241
fix(slack): reconnect socket mode after DNS failure via watchdog#27241byungsker wants to merge 1 commit intoopenclaw:mainfrom
Conversation
When the Bolt SDK's socket mode WebSocket closes due to a transient DNS failure, SocketModeClient calls delayReconnectAttempt(this.start) to reconnect. The internal WebClient retries apps.connections.open up to 100 times; if those retries are exhausted before DNS recovers, the resulting RequestError is swallowed (unhandled rejection inside delayReconnectAttempt) and the socket goes permanently dead — neither State.Disconnected nor a new State.Connected event is ever emitted. Fix: after a successful app.start() in socket mode, attach listeners to the underlying SocketModeClient for the "close" and "connected" events. On "close", start a watchdog timer (3 min — enough time for the SDK's own retry loop to succeed in normal transient outages). If "connected" arrives the timer is cleared. If the timer fires it means the SDK gave up silently; we call app.stop() and app.start() again (same App instance, new WebSocket), then back off exponentially (1 s → 30 s) before the next attempt. For HTTP mode the existing block-until-abort behaviour is unchanged. Fixes openclaw#27077
Greptile SummaryAdds a watchdog-based reconnection mechanism for Slack socket mode to recover from a known
Confidence Score: 4/5
Last reviewed commit: 13fb8f2 |
markshields-tl
left a comment
There was a problem hiding this comment.
Review
Best root cause analysis of the three PRs targeting #17847. The insight that delayReconnectAttempt swallows the RequestError as an unhandled rejection — leaving the socket permanently dead with no event emitted — is the key finding.
What's strong:
- The 3-minute watchdog timer is the right approach for catching the "silent death" where no disconnect event fires
- Correctly identifies that
SocketModeClientinternal retry exhaustion is a terminal failure with no external signal close→ start timer,connected→ clear timer pattern is clean
One concern:
- This only catches the DNS-failure-triggered variant. PR #27232 covers a broader set of disconnect/error events. Ideally both approaches should be combined: event-driven reconnect for cases where events DO fire, plus a staleness watchdog for cases where they don't.
Suggestion: The codebase already tracks lastInboundAt on every inbound event. The watchdog timer could also check this — if lastInboundAt is stale by N minutes while the socket claims connected, that's a smoking gun regardless of what caused the silence.
Would love to see this merged alongside or combined with #27232.
— Mort (AI assistant reviewing on behalf of @markshields-tl)
|
Thanks for the contribution and reliability focus. Closing as superseded by #27232, which is the canonical reconnect fix selected for this workstream's deduped landing path. |
Problem
When a transient DNS failure causes the Slack socket mode WebSocket to close,
SocketModeClientcallsdelayReconnectAttempt(this.start)to reconnect. Internally, the WebClient retriesapps.connections.openup to 100 times (hence the 2 800+ retry count in the report). If those retries are exhausted before DNS recovers, the resultingRequestErroris swallowed (unhandled rejection insidedelayReconnectAttempt) and:State.Disconnectedis never emittedState.Connectedevent arrivesThis is the same class of bug as #13506 (WhatsApp).
Root cause
Fix
After a successful
app.start()in socket mode, attach listeners to the underlyingSocketModeClient(accessed via the internal receiver) for its"close"and"connected"events:"close"— start a 3-minute watchdog timer (long enough for the SDK's own HTTP retry loop to succeed during normal transient outages)."connected"— clear the timer (SDK reconnected on its own; nothing to do).app.stop()thenapp.start()again on the sameAppinstance (same event handlers, new WebSocket) and back off exponentially (1 s → 30 s) between attempts.HTTP mode is unaffected.
Implementation notes
SocketModeClientemits"close"whenever the underlyingSlackWebSocketdisconnects (pong timeout, server close frame, etc.) — this is the reliable hook point.SocketModeClient.start()resetsthis.shuttingDown = falseand creates a freshSlackWebSocket, so calling it again afterstop()is safe.SocketModeClientusesretryConfig: { retries: 100, factor: 1.3 }forapps.connections.open, so a short watchdog would race against the SDK's own recovery.finallyblock of each iteration to prevent duplicates across reconnect cycles.Fixes #27077