fix(whatsapp): catch connection-phase errors in reconnect loop by nikolasdehor · Pull Request #26785 · openclaw/openclaw

nikolasdehor · 2026-02-25T18:39:41Z

Summary

Combines the best parts of three closed PRs (#9727 by @luizlf, #14484 by @onthway, #17487) to fix the long-standing WhatsApp reconnection bug reported in #13371.

Problem: When the WhatsApp WebSocket connection hits a DNS resolution failure (ENOTFOUND), network timeout, or TLS error during the initial connection phase, the exception propagates uncaught out of the while(true) reconnect loop in monitorWebChannel. The channel exits permanently and never reconnects — even though the network may recover seconds later.

Additionally, after reaching maxAttempts, the monitor breaks out of the loop permanently ("Stopping web monitoring"), with no way to recover short of a full gateway restart.

Root cause: The listenerFactory() call (line ~199 in monitor.ts) is not wrapped in a try-catch. The existing reconnect logic only handles disconnects that arrive after a successful connection via listener.onClose. Connection-phase errors bypass the entire backoff/retry mechanism.

Changes

Catch connection-phase errors (src/web/auto-reply/monitor.ts): Wrap listener creation in try-catch. On failure, increment reconnectAttempts, update status with lastDisconnect info, apply exponential backoff, and continue the loop.
Periodic recovery after max attempts (src/web/auto-reply/monitor.ts): Instead of break after reaching maxAttempts, switch to periodic recovery (one attempt per heartbeatSeconds interval). The existing reconnectAttempts reset logic (when uptime exceeds heartbeat interval) handles counter reset after a successful recovery.
Socket leak prevention (src/web/inbound/monitor.ts): Wrap waitForWaConnection(sock) in try-catch that calls sock.ws?.close() on failure before re-throwing. Prevents FD/connection leaks when the connection handshake fails.

Testing

Verified on a macOS gateway (v2026.2.24) that experiences regular DNS failures after sleep/wake cycles. Before this fix, the WhatsApp channel would exit permanently after the first ENOTFOUND. After applying, the channel retries with backoff and recovers once the network is available.

Prior art

PR fix(whatsapp): retry reconnect loop on initial connection failure #9727 (closed by stale triage): initial connection try-catch
PR fix(whatsapp): catch connection-phase errors in reconnect loop #14484 (closed as dup): improved version with socket leak fix and richer status reporting
PR fix: WhatsApp connection stability - continue reconnection after max attempts #17487 (closed without merge): periodic recovery after max attempts

Closes #13371

Greptile Summary

Addresses a critical WhatsApp reconnection bug where DNS failures, timeouts, or TLS errors during the initial connection phase would cause the monitor to exit permanently instead of retrying.

Key changes:

Wrapped listenerFactory call in try-catch to treat connection-phase errors as retryable disconnects with exponential backoff (lines 193-243 in monitor.ts)
Changed max-attempts behavior from permanent exit (break) to periodic recovery with heartbeat-interval retries (lines 442-463 in monitor.ts)
Added socket cleanup (sock.ws?.close()) in monitorWebInbox when waitForWaConnection fails to prevent file descriptor leaks (lines 43-53 in inbound/monitor.ts)

Potential concern:

The reconnect counter reset logic (line 383) only triggers after successful connections that later disconnect, not for connection-phase failures. This means repeated connection-phase errors won't benefit from the "healthy stretch" reset, potentially causing the backoff to max out even with long delays between attempts.

Confidence Score: 4/5

This PR is safe to merge with one minor concern about reconnect counter behavior
The fix correctly addresses the root cause of permanent exit on connection-phase errors and adds proper socket cleanup. The logic is sound and well-tested according to the PR description. Score is 4 (not 5) due to the reconnect counter reset issue that could cause suboptimal backoff behavior in edge cases with intermittent connection-phase failures.
No files require special attention - changes are straightforward error handling improvements

_{Last reviewed commit: 330cd0a}

_{(4/5) You can add custom instructions or style guidelines for the agent here!}

…event permanent exit Combines fixes from closed PRs openclaw#9727, openclaw#14484, and openclaw#17487 to address openclaw#13371. Three changes: 1. Wrap listener creation in try-catch so DNS/timeout/TLS errors during the connection phase are treated as retryable disconnects instead of crashing the reconnect while-loop. 2. After max reconnect attempts, switch to periodic recovery (one attempt per heartbeat interval) instead of breaking out of the loop permanently. The network may recover after a transient outage. 3. Close the Baileys socket on connection failure in monitorWebInbox to prevent FD/connection leaks when waitForWaConnection throws. Closes openclaw#13371

greptile-apps

_{2 files reviewed, 1 comment}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps · 2026-02-25T18:42:05Z

Additional Comments (1)

src/web/auto-reply/monitor.ts
The reconnect counter reset logic won't be reached when connection-phase errors occur (line 216), since those errors continue before reaching this code. This means connection-phase failures won't benefit from the "healthy stretch" reset that post-connection disconnects receive.

The result: if the listener repeatedly fails during connection (e.g., intermittent DNS), reconnectAttempts keeps climbing and never resets, even if there are long delays between failures. Consider applying the same "uptime reset" logic in the connection-phase catch block (after successful sleep but before continue).

Prompt To Fix With AI

This is a comment left during a code review.
Path: src/web/auto-reply/monitor.ts
Line: 381-384

Comment:
The reconnect counter reset logic won't be reached when connection-phase errors occur (line 216), since those errors `continue` before reaching this code. This means connection-phase failures won't benefit from the "healthy stretch" reset that post-connection disconnects receive.

The result: if the listener repeatedly fails during connection (e.g., intermittent DNS), `reconnectAttempts` keeps climbing and never resets, even if there are long delays between failures. Consider applying the same "uptime reset" logic in the connection-phase catch block (after successful sleep but before `continue`).

How can I resolve this? If you propose a fix, please make it concise.

Copilot

Pull request overview

This PR fixes a critical WhatsApp reconnection bug where connection-phase errors (DNS failures, timeouts, TLS errors) during the initial connection attempt would crash the reconnect loop permanently, even though the network might recover shortly after. The fix wraps listener creation in try-catch, applies exponential backoff on connection failures, and switches to periodic recovery after exhausting max reconnection attempts.

Changes:

Wrap waitForWaConnection in try-catch with socket cleanup to prevent file descriptor leaks on connection failures
Catch connection-phase errors in the reconnect loop and treat them as retryable disconnects with backoff
Replace permanent exit after max attempts with periodic recovery attempts using the heartbeat interval

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File	Description
src/web/inbound/monitor.ts	Adds socket cleanup (ws.close) when waitForWaConnection fails to prevent connection/FD leaks
src/web/auto-reply/monitor.ts	Wraps listener creation in try-catch to handle connection-phase errors; switches from permanent exit to periodic recovery after max attempts

Copilot · 2026-02-25T18:44:11Z

src/web/auto-reply/monitor.ts

+      // Instead of giving up permanently, wait one heartbeat interval and
+      // retry — the network may recover after a transient outage.
+      try {
+        await sleep(heartbeatSeconds * 1000, abortSignal);
+      } catch {
+        break;
+      }
+      continue;


The existing test "stops after hitting max reconnect attempts" at line 191-230 expects the monitor to exit after reaching max reconnection attempts. With this change, the monitor now continues retrying indefinitely with periodic recovery attempts instead of stopping, which will cause this test to hang.

The test needs to be updated to reflect the new never-terminate behavior, or it needs to use an abort signal to stop the monitor after verifying the periodic recovery behavior.

Copilot · 2026-02-25T18:44:11Z

src/web/auto-reply/monitor.ts

+    } catch (err) {
+      // Connection-phase failure (DNS, timeout, TLS, etc.) — treat as a
+      // retryable disconnect instead of crashing the reconnect loop.
+      const errorStr = formatError(err);
+      reconnectAttempts += 1;
+      status.connected = false;
+      status.lastEventAt = Date.now();
+      status.lastDisconnect = {
+        at: status.lastEventAt,
+        status: undefined,
+        error: errorStr,
+        loggedOut: false,
+      };
+      status.lastError = errorStr;
+      status.reconnectAttempts = reconnectAttempts;
+      emitStatus();
+
+      reconnectLogger.warn(
+        { connectionId, error: errorStr, reconnectAttempts },
+        "web reconnect: initial connection failed; will retry",
+      );
+      runtime.error(
+        `WhatsApp Web connection failed (${errorStr}). Retry ${reconnectAttempts}/${reconnectPolicy.maxAttempts || "∞"} in ${formatDurationPrecise(computeBackoff(reconnectPolicy, reconnectAttempts))}…`,
+      );
+
+      const delay = computeBackoff(reconnectPolicy, reconnectAttempts);
+      try {
+        await sleep(delay, abortSignal);
+      } catch {
+        break;
+      }
+      continue;
+    }


The connection-phase error handler does not check if max reconnection attempts have been reached. This creates an inconsistency with the post-connection error handling (lines 442-464), which switches to periodic recovery mode after max attempts.

When connection-phase errors occur repeatedly (e.g., persistent DNS failures during initial connection), the code will continue using exponential backoff indefinitely instead of switching to the periodic recovery mode (fixed heartbeat interval) like the post-connection path does.

This should check if reconnectAttempts >= maxAttempts (when maxAttempts > 0) and switch to periodic recovery by waiting heartbeatSeconds * 1000 instead of using computeBackoff, consistent with the logic at lines 442-464.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 330cd0a797

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-02-25T18:45:05Z

src/web/auto-reply/monitor.ts

+        status: undefined,
+        error: errorStr,
+        loggedOut: false,


Surface logged-out failures during initial connect

This reconnect catch treats every startup failure as transient by hard-coding loggedOut: false, but monitorWebInbox() can throw a pre-open close from waitForWaConnection(sock) when WhatsApp returns DisconnectReason.loggedOut (expired/revoked session). In that case the monitor now retries forever and never reaches the existing relink flow, so users with invalid creds get stuck in a reconnect loop instead of being told to re-run login.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-02-25T18:45:05Z

src/web/auto-reply/monitor.ts

+        await sleep(delay, abortSignal);
+      } catch {
+        break;
+      }
+      continue;


Honor one-shot mode on initial connection failure

This new error path always sleeps and continues the outer while (true) loop, even when keepAlive is false. keepAlive=false is the single-run mode (the function returns after one successful connect), so callers now block in an unbounded retry loop if the first connection attempt fails (for example DNS/TLS/network startup errors) instead of getting a failure back.

Useful? React with 👍 / 👎.

When connection-phase errors (DNS, TLS, timeout) occur repeatedly but with long gaps between them, the reconnect counter kept climbing and never reset. This caused the backoff to escalate permanently even after the network recovered. Add a time-based reset that mirrors the existing "healthy stretch" logic for post-connection disconnects: if the last connection-phase failure was more than one heartbeat interval ago, reset the counter before incrementing.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 104a6fbe0d

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-03-03T03:51:25Z

src/web/auto-reply/monitor.ts

+      const delay = computeBackoff(reconnectPolicy, reconnectAttempts);
+      try {
+        await sleep(delay, abortSignal);


Enforce max-attempt policy for initial connect failures

When listenerFactory fails before a socket is established (e.g., DNS/TLS errors during startup), this catch path always computes backoff and continues, so it never reaches the later maxAttempts handling block in monitorWebChannel. In practice, web.reconnect.maxAttempts is ignored for connection-phase failures, which can cause indefinite high-frequency retries/log churn (especially with a small maxMs) instead of switching to the intended degraded heartbeat-paced retry mode.

Useful? React with 👍 / 👎.

…ling - P1: Detect loggedOut status in connection-phase errors via getStatusCode instead of hardcoding loggedOut: false - P2: Respect keepAlive=false (single-run mode) — break instead of retrying on connection-phase failures - P2: Enforce maxAttempts policy for connection-phase errors, consistent with post-connection handling Updated e2e tests to cover all three scenarios and reflect degraded-mode behavior (continue retrying after max attempts instead of stopping).

openclaw-barnacle bot added channel: whatsapp-web Channel integration: whatsapp-web size: S labels Feb 25, 2026

greptile-apps bot reviewed Feb 25, 2026

View reviewed changes

Copilot AI reviewed Feb 25, 2026

View reviewed changes

chatgpt-codex-connector bot reviewed Feb 25, 2026

View reviewed changes

This comment was marked as spam.

Sign in to view

chatgpt-codex-connector bot reviewed Mar 3, 2026

View reviewed changes

openclaw-barnacle bot added size: M and removed size: S labels Mar 5, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(whatsapp): catch connection-phase errors in reconnect loop#26785

fix(whatsapp): catch connection-phase errors in reconnect loop#26785
nikolasdehor wants to merge 3 commits intoopenclaw:mainfrom
nikolasdehor:fix/whatsapp-reconnect-dns-v2

nikolasdehor commented Feb 25, 2026 •

edited by greptile-apps bot

Loading

Uh oh!

greptile-apps bot left a comment

Uh oh!

greptile-apps bot commented Feb 25, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Feb 25, 2026

Uh oh!

Copilot AI Feb 25, 2026

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

chatgpt-codex-connector bot Feb 25, 2026

Uh oh!

chatgpt-codex-connector bot Feb 25, 2026

Uh oh!

This comment was marked as spam.

chatgpt-codex-connector bot left a comment

Uh oh!

chatgpt-codex-connector bot Mar 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

nikolasdehor commented Feb 25, 2026 • edited by greptile-apps bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Testing

Prior art

Greptile Summary

Confidence Score: 4/5

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot commented Feb 25, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Feb 25, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 25, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Feb 25, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector bot Feb 25, 2026

Choose a reason for hiding this comment

Uh oh!

This comment was marked as spam.

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

nikolasdehor commented Feb 25, 2026 •

edited by greptile-apps bot

Loading