Skip to content

fix(whatsapp): catch connection-phase errors in reconnect loop#26785

Open
nikolasdehor wants to merge 3 commits intoopenclaw:mainfrom
nikolasdehor:fix/whatsapp-reconnect-dns-v2
Open

fix(whatsapp): catch connection-phase errors in reconnect loop#26785
nikolasdehor wants to merge 3 commits intoopenclaw:mainfrom
nikolasdehor:fix/whatsapp-reconnect-dns-v2

Conversation

@nikolasdehor
Copy link

@nikolasdehor nikolasdehor commented Feb 25, 2026

Summary

Combines the best parts of three closed PRs (#9727 by @luizlf, #14484 by @onthway, #17487) to fix the long-standing WhatsApp reconnection bug reported in #13371.

Problem: When the WhatsApp WebSocket connection hits a DNS resolution failure (ENOTFOUND), network timeout, or TLS error during the initial connection phase, the exception propagates uncaught out of the while(true) reconnect loop in monitorWebChannel. The channel exits permanently and never reconnects — even though the network may recover seconds later.

Additionally, after reaching maxAttempts, the monitor breaks out of the loop permanently ("Stopping web monitoring"), with no way to recover short of a full gateway restart.

Root cause: The listenerFactory() call (line ~199 in monitor.ts) is not wrapped in a try-catch. The existing reconnect logic only handles disconnects that arrive after a successful connection via listener.onClose. Connection-phase errors bypass the entire backoff/retry mechanism.

Changes

  1. Catch connection-phase errors (src/web/auto-reply/monitor.ts): Wrap listener creation in try-catch. On failure, increment reconnectAttempts, update status with lastDisconnect info, apply exponential backoff, and continue the loop.

  2. Periodic recovery after max attempts (src/web/auto-reply/monitor.ts): Instead of break after reaching maxAttempts, switch to periodic recovery (one attempt per heartbeatSeconds interval). The existing reconnectAttempts reset logic (when uptime exceeds heartbeat interval) handles counter reset after a successful recovery.

  3. Socket leak prevention (src/web/inbound/monitor.ts): Wrap waitForWaConnection(sock) in try-catch that calls sock.ws?.close() on failure before re-throwing. Prevents FD/connection leaks when the connection handshake fails.

Testing

Verified on a macOS gateway (v2026.2.24) that experiences regular DNS failures after sleep/wake cycles. Before this fix, the WhatsApp channel would exit permanently after the first ENOTFOUND. After applying, the channel retries with backoff and recovers once the network is available.

Prior art

Closes #13371

Greptile Summary

Addresses a critical WhatsApp reconnection bug where DNS failures, timeouts, or TLS errors during the initial connection phase would cause the monitor to exit permanently instead of retrying.

Key changes:

  • Wrapped listenerFactory call in try-catch to treat connection-phase errors as retryable disconnects with exponential backoff (lines 193-243 in monitor.ts)
  • Changed max-attempts behavior from permanent exit (break) to periodic recovery with heartbeat-interval retries (lines 442-463 in monitor.ts)
  • Added socket cleanup (sock.ws?.close()) in monitorWebInbox when waitForWaConnection fails to prevent file descriptor leaks (lines 43-53 in inbound/monitor.ts)

Potential concern:

  • The reconnect counter reset logic (line 383) only triggers after successful connections that later disconnect, not for connection-phase failures. This means repeated connection-phase errors won't benefit from the "healthy stretch" reset, potentially causing the backoff to max out even with long delays between attempts.

Confidence Score: 4/5

  • This PR is safe to merge with one minor concern about reconnect counter behavior
  • The fix correctly addresses the root cause of permanent exit on connection-phase errors and adds proper socket cleanup. The logic is sound and well-tested according to the PR description. Score is 4 (not 5) due to the reconnect counter reset issue that could cause suboptimal backoff behavior in edge cases with intermittent connection-phase failures.
  • No files require special attention - changes are straightforward error handling improvements

Last reviewed commit: 330cd0a

(4/5) You can add custom instructions or style guidelines for the agent here!

…event permanent exit

Combines fixes from closed PRs openclaw#9727, openclaw#14484, and openclaw#17487 to address openclaw#13371.

Three changes:

1. Wrap listener creation in try-catch so DNS/timeout/TLS errors during
   the connection phase are treated as retryable disconnects instead of
   crashing the reconnect while-loop.

2. After max reconnect attempts, switch to periodic recovery (one attempt
   per heartbeat interval) instead of breaking out of the loop permanently.
   The network may recover after a transient outage.

3. Close the Baileys socket on connection failure in monitorWebInbox to
   prevent FD/connection leaks when waitForWaConnection throws.

Closes openclaw#13371
@openclaw-barnacle openclaw-barnacle bot added channel: whatsapp-web Channel integration: whatsapp-web size: S labels Feb 25, 2026
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Feb 25, 2026

Additional Comments (1)

src/web/auto-reply/monitor.ts
The reconnect counter reset logic won't be reached when connection-phase errors occur (line 216), since those errors continue before reaching this code. This means connection-phase failures won't benefit from the "healthy stretch" reset that post-connection disconnects receive.

The result: if the listener repeatedly fails during connection (e.g., intermittent DNS), reconnectAttempts keeps climbing and never resets, even if there are long delays between failures. Consider applying the same "uptime reset" logic in the connection-phase catch block (after successful sleep but before continue).

Prompt To Fix With AI
This is a comment left during a code review.
Path: src/web/auto-reply/monitor.ts
Line: 381-384

Comment:
The reconnect counter reset logic won't be reached when connection-phase errors occur (line 216), since those errors `continue` before reaching this code. This means connection-phase failures won't benefit from the "healthy stretch" reset that post-connection disconnects receive.

The result: if the listener repeatedly fails during connection (e.g., intermittent DNS), `reconnectAttempts` keeps climbing and never resets, even if there are long delays between failures. Consider applying the same "uptime reset" logic in the connection-phase catch block (after successful sleep but before `continue`).

How can I resolve this? If you propose a fix, please make it concise.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes a critical WhatsApp reconnection bug where connection-phase errors (DNS failures, timeouts, TLS errors) during the initial connection attempt would crash the reconnect loop permanently, even though the network might recover shortly after. The fix wraps listener creation in try-catch, applies exponential backoff on connection failures, and switches to periodic recovery after exhausting max reconnection attempts.

Changes:

  • Wrap waitForWaConnection in try-catch with socket cleanup to prevent file descriptor leaks on connection failures
  • Catch connection-phase errors in the reconnect loop and treat them as retryable disconnects with backoff
  • Replace permanent exit after max attempts with periodic recovery attempts using the heartbeat interval

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File Description
src/web/inbound/monitor.ts Adds socket cleanup (ws.close) when waitForWaConnection fails to prevent connection/FD leaks
src/web/auto-reply/monitor.ts Wraps listener creation in try-catch to handle connection-phase errors; switches from permanent exit to periodic recovery after max attempts

Comment on lines +456 to +463
// Instead of giving up permanently, wait one heartbeat interval and
// retry — the network may recover after a transient outage.
try {
await sleep(heartbeatSeconds * 1000, abortSignal);
} catch {
break;
}
continue;
Copy link

Copilot AI Feb 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The existing test "stops after hitting max reconnect attempts" at line 191-230 expects the monitor to exit after reaching max reconnection attempts. With this change, the monitor now continues retrying indefinitely with periodic recovery attempts instead of stopping, which will cause this test to hang.

The test needs to be updated to reflect the new never-terminate behavior, or it needs to use an abort signal to stop the monitor after verifying the periodic recovery behavior.

Copilot uses AI. Check for mistakes.
Comment on lines +212 to +244
} catch (err) {
// Connection-phase failure (DNS, timeout, TLS, etc.) — treat as a
// retryable disconnect instead of crashing the reconnect loop.
const errorStr = formatError(err);
reconnectAttempts += 1;
status.connected = false;
status.lastEventAt = Date.now();
status.lastDisconnect = {
at: status.lastEventAt,
status: undefined,
error: errorStr,
loggedOut: false,
};
status.lastError = errorStr;
status.reconnectAttempts = reconnectAttempts;
emitStatus();

reconnectLogger.warn(
{ connectionId, error: errorStr, reconnectAttempts },
"web reconnect: initial connection failed; will retry",
);
runtime.error(
`WhatsApp Web connection failed (${errorStr}). Retry ${reconnectAttempts}/${reconnectPolicy.maxAttempts || "∞"} in ${formatDurationPrecise(computeBackoff(reconnectPolicy, reconnectAttempts))}…`,
);

const delay = computeBackoff(reconnectPolicy, reconnectAttempts);
try {
await sleep(delay, abortSignal);
} catch {
break;
}
continue;
}
Copy link

Copilot AI Feb 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The connection-phase error handler does not check if max reconnection attempts have been reached. This creates an inconsistency with the post-connection error handling (lines 442-464), which switches to periodic recovery mode after max attempts.

When connection-phase errors occur repeatedly (e.g., persistent DNS failures during initial connection), the code will continue using exponential backoff indefinitely instead of switching to the periodic recovery mode (fixed heartbeat interval) like the post-connection path does.

This should check if reconnectAttempts >= maxAttempts (when maxAttempts > 0) and switch to periodic recovery by waiting heartbeatSeconds * 1000 instead of using computeBackoff, consistent with the logic at lines 442-464.

Copilot uses AI. Check for mistakes.
Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 330cd0a797

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +221 to +223
status: undefined,
error: errorStr,
loggedOut: false,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Surface logged-out failures during initial connect

This reconnect catch treats every startup failure as transient by hard-coding loggedOut: false, but monitorWebInbox() can throw a pre-open close from waitForWaConnection(sock) when WhatsApp returns DisconnectReason.loggedOut (expired/revoked session). In that case the monitor now retries forever and never reaches the existing relink flow, so users with invalid creds get stuck in a reconnect loop instead of being told to re-run login.

Useful? React with 👍 / 👎.

Comment on lines +239 to +243
await sleep(delay, abortSignal);
} catch {
break;
}
continue;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Honor one-shot mode on initial connection failure

This new error path always sleeps and continues the outer while (true) loop, even when keepAlive is false. keepAlive=false is the single-run mode (the function returns after one successful connect), so callers now block in an unbounded retry loop if the first connection attempt fails (for example DNS/TLS/network startup errors) instead of getting a failure back.

Useful? React with 👍 / 👎.

@nikolasdehor

This comment was marked as spam.

When connection-phase errors (DNS, TLS, timeout) occur repeatedly but
with long gaps between them, the reconnect counter kept climbing and
never reset. This caused the backoff to escalate permanently even after
the network recovered.

Add a time-based reset that mirrors the existing "healthy stretch" logic
for post-connection disconnects: if the last connection-phase failure was
more than one heartbeat interval ago, reset the counter before
incrementing.
Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 104a6fbe0d

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +249 to +251
const delay = computeBackoff(reconnectPolicy, reconnectAttempts);
try {
await sleep(delay, abortSignal);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Enforce max-attempt policy for initial connect failures

When listenerFactory fails before a socket is established (e.g., DNS/TLS errors during startup), this catch path always computes backoff and continues, so it never reaches the later maxAttempts handling block in monitorWebChannel. In practice, web.reconnect.maxAttempts is ignored for connection-phase failures, which can cause indefinite high-frequency retries/log churn (especially with a small maxMs) instead of switching to the intended degraded heartbeat-paced retry mode.

Useful? React with 👍 / 👎.

…ling

- P1: Detect loggedOut status in connection-phase errors via getStatusCode
  instead of hardcoding loggedOut: false
- P2: Respect keepAlive=false (single-run mode) — break instead of retrying
  on connection-phase failures
- P2: Enforce maxAttempts policy for connection-phase errors, consistent
  with post-connection handling

Updated e2e tests to cover all three scenarios and reflect degraded-mode
behavior (continue retrying after max attempts instead of stopping).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

channel: whatsapp-web Channel integration: whatsapp-web size: M

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug] WhatsApp channel exits permanently after DNS failure instead of retrying

2 participants