Skip to content

Slack socket mode connection does not recover after transient DNS failure (same root cause as #13506) #27077

@mecampbellsoup

Description

@mecampbellsoup

Summary

When the Slack socket mode WebSocket connection is lost due to a transient DNS failure (e.g., network change, WiFi dropout), the gateway process stays alive but the Slack channel becomes permanently unresponsive until a manual openclaw gateway stop && openclaw gateway install.

The Slack WebClient retries individual API calls indefinitely (observed 2800+ retries), but the underlying WebSocket is never re-established.

This is the same root cause as #13506 (WhatsApp), which was fixed in PR #9727 — but the fix was only applied to the WhatsApp channel, not Slack.

Log Evidence

Gateway 2026.2.24, macOS, socket mode, OpenAI Codex provider.

Healthy startup (21:41 UTC):

[slack] socket mode connected
[slack] users resolved: U087GL2J2PQ→U087GL2J2PQ

DNS failure begins (21:46 UTC) — agent completes LLM run but can't deliver reply:

[WARN] bolt-app http request failed getaddrinfo ENOTFOUND slack.com
[WARN] bolt-app http request failed getaddrinfo ENOTFOUND slack.com
[WARN] socket-mode:SlackWebSocket:1 A pong wasn't received from the server before the timeout of 5000ms!
[WARN] web-api:WebClient:125 http request failed getaddrinfo ENOTFOUND slack.com

DNS resolves again (verified via nslookup slack.com on host), but gateway never reconnects. WebClient retry counter climbs to 2800+ over the next hour. Socket mode connection is dead.

Only fix: openclaw gateway stop && openclaw gateway install

Root Cause

The Slack channel monitor (src/slack/monitor.ts, visible in bundled reply-Cx57rl6c.js:38654) has no reconnect loop:

try {
    await app.start();
    runtime.log?.("slack socket mode connected");
    // Blocks forever — no reconnect on socket death
    await new Promise((resolve) => {
        opts.abortSignal?.addEventListener("abort", () => resolve(), { once: true });
    });
} finally {
    await app.stop().catch(() => void 0);
}

Once app.start() succeeds and the socket later dies, recovery depends entirely on Bolt SDK internals. When the SDK's SocketModeClient fails to re-establish the WebSocket (known issues: slackapi/node-slack-sdk#1495, slackapi/bolt-js#1151), the channel is permanently dead.

Compare with the WhatsApp fix in PR #9727, which wraps the equivalent listenerFactory() call in a retry loop with backoff and maxAttempts.

Expected Behavior

After a transient DNS failure resolves, the Slack socket mode connection should be re-established automatically using the same backoff/retry strategy as WhatsApp (#9727).

Reproduction

  1. Start gateway with Slack in socket mode
  2. Verify slack socket mode connected in logs
  3. Simulate DNS failure (e.g., switch WiFi networks, or temporarily block slack.com in /etc/hosts)
  4. Restore DNS
  5. Observe: WebClient retries individual API calls but socket mode never reconnects

Environment

  • OpenClaw: 2026.2.24
  • Node: 24.13.0
  • macOS Darwin 25.3.0
  • Slack mode: socket
  • Trigger: network change (inflight WiFi → airport WiFi)

Suggested Fix

Apply the same strategy as PR #9727 to the Slack channel monitor: wrap the app.start() + socket lifetime in a reconnect loop with exponential backoff and maxAttempts. On socket death or unrecoverable Bolt SDK error, tear down and retry the full app.start() cycle.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions