fix(whatsapp): catch connection-phase errors in reconnect loop#26785
fix(whatsapp): catch connection-phase errors in reconnect loop#26785nikolasdehor wants to merge 3 commits intoopenclaw:mainfrom
Conversation
…event permanent exit Combines fixes from closed PRs openclaw#9727, openclaw#14484, and openclaw#17487 to address openclaw#13371. Three changes: 1. Wrap listener creation in try-catch so DNS/timeout/TLS errors during the connection phase are treated as retryable disconnects instead of crashing the reconnect while-loop. 2. After max reconnect attempts, switch to periodic recovery (one attempt per heartbeat interval) instead of breaking out of the loop permanently. The network may recover after a transient outage. 3. Close the Baileys socket on connection failure in monitorWebInbox to prevent FD/connection leaks when waitForWaConnection throws. Closes openclaw#13371
Additional Comments (1)
The result: if the listener repeatedly fails during connection (e.g., intermittent DNS), Prompt To Fix With AIThis is a comment left during a code review.
Path: src/web/auto-reply/monitor.ts
Line: 381-384
Comment:
The reconnect counter reset logic won't be reached when connection-phase errors occur (line 216), since those errors `continue` before reaching this code. This means connection-phase failures won't benefit from the "healthy stretch" reset that post-connection disconnects receive.
The result: if the listener repeatedly fails during connection (e.g., intermittent DNS), `reconnectAttempts` keeps climbing and never resets, even if there are long delays between failures. Consider applying the same "uptime reset" logic in the connection-phase catch block (after successful sleep but before `continue`).
How can I resolve this? If you propose a fix, please make it concise. |
There was a problem hiding this comment.
Pull request overview
This PR fixes a critical WhatsApp reconnection bug where connection-phase errors (DNS failures, timeouts, TLS errors) during the initial connection attempt would crash the reconnect loop permanently, even though the network might recover shortly after. The fix wraps listener creation in try-catch, applies exponential backoff on connection failures, and switches to periodic recovery after exhausting max reconnection attempts.
Changes:
- Wrap
waitForWaConnectionin try-catch with socket cleanup to prevent file descriptor leaks on connection failures - Catch connection-phase errors in the reconnect loop and treat them as retryable disconnects with backoff
- Replace permanent exit after max attempts with periodic recovery attempts using the heartbeat interval
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| src/web/inbound/monitor.ts | Adds socket cleanup (ws.close) when waitForWaConnection fails to prevent connection/FD leaks |
| src/web/auto-reply/monitor.ts | Wraps listener creation in try-catch to handle connection-phase errors; switches from permanent exit to periodic recovery after max attempts |
| // Instead of giving up permanently, wait one heartbeat interval and | ||
| // retry — the network may recover after a transient outage. | ||
| try { | ||
| await sleep(heartbeatSeconds * 1000, abortSignal); | ||
| } catch { | ||
| break; | ||
| } | ||
| continue; |
There was a problem hiding this comment.
The existing test "stops after hitting max reconnect attempts" at line 191-230 expects the monitor to exit after reaching max reconnection attempts. With this change, the monitor now continues retrying indefinitely with periodic recovery attempts instead of stopping, which will cause this test to hang.
The test needs to be updated to reflect the new never-terminate behavior, or it needs to use an abort signal to stop the monitor after verifying the periodic recovery behavior.
| } catch (err) { | ||
| // Connection-phase failure (DNS, timeout, TLS, etc.) — treat as a | ||
| // retryable disconnect instead of crashing the reconnect loop. | ||
| const errorStr = formatError(err); | ||
| reconnectAttempts += 1; | ||
| status.connected = false; | ||
| status.lastEventAt = Date.now(); | ||
| status.lastDisconnect = { | ||
| at: status.lastEventAt, | ||
| status: undefined, | ||
| error: errorStr, | ||
| loggedOut: false, | ||
| }; | ||
| status.lastError = errorStr; | ||
| status.reconnectAttempts = reconnectAttempts; | ||
| emitStatus(); | ||
|
|
||
| reconnectLogger.warn( | ||
| { connectionId, error: errorStr, reconnectAttempts }, | ||
| "web reconnect: initial connection failed; will retry", | ||
| ); | ||
| runtime.error( | ||
| `WhatsApp Web connection failed (${errorStr}). Retry ${reconnectAttempts}/${reconnectPolicy.maxAttempts || "∞"} in ${formatDurationPrecise(computeBackoff(reconnectPolicy, reconnectAttempts))}…`, | ||
| ); | ||
|
|
||
| const delay = computeBackoff(reconnectPolicy, reconnectAttempts); | ||
| try { | ||
| await sleep(delay, abortSignal); | ||
| } catch { | ||
| break; | ||
| } | ||
| continue; | ||
| } |
There was a problem hiding this comment.
The connection-phase error handler does not check if max reconnection attempts have been reached. This creates an inconsistency with the post-connection error handling (lines 442-464), which switches to periodic recovery mode after max attempts.
When connection-phase errors occur repeatedly (e.g., persistent DNS failures during initial connection), the code will continue using exponential backoff indefinitely instead of switching to the periodic recovery mode (fixed heartbeat interval) like the post-connection path does.
This should check if reconnectAttempts >= maxAttempts (when maxAttempts > 0) and switch to periodic recovery by waiting heartbeatSeconds * 1000 instead of using computeBackoff, consistent with the logic at lines 442-464.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 330cd0a797
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| status: undefined, | ||
| error: errorStr, | ||
| loggedOut: false, |
There was a problem hiding this comment.
Surface logged-out failures during initial connect
This reconnect catch treats every startup failure as transient by hard-coding loggedOut: false, but monitorWebInbox() can throw a pre-open close from waitForWaConnection(sock) when WhatsApp returns DisconnectReason.loggedOut (expired/revoked session). In that case the monitor now retries forever and never reaches the existing relink flow, so users with invalid creds get stuck in a reconnect loop instead of being told to re-run login.
Useful? React with 👍 / 👎.
| await sleep(delay, abortSignal); | ||
| } catch { | ||
| break; | ||
| } | ||
| continue; |
There was a problem hiding this comment.
Honor one-shot mode on initial connection failure
This new error path always sleeps and continues the outer while (true) loop, even when keepAlive is false. keepAlive=false is the single-run mode (the function returns after one successful connect), so callers now block in an unbounded retry loop if the first connection attempt fails (for example DNS/TLS/network startup errors) instead of getting a failure back.
Useful? React with 👍 / 👎.
This comment was marked as spam.
This comment was marked as spam.
When connection-phase errors (DNS, TLS, timeout) occur repeatedly but with long gaps between them, the reconnect counter kept climbing and never reset. This caused the backoff to escalate permanently even after the network recovered. Add a time-based reset that mirrors the existing "healthy stretch" logic for post-connection disconnects: if the last connection-phase failure was more than one heartbeat interval ago, reset the counter before incrementing.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 104a6fbe0d
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| const delay = computeBackoff(reconnectPolicy, reconnectAttempts); | ||
| try { | ||
| await sleep(delay, abortSignal); |
There was a problem hiding this comment.
Enforce max-attempt policy for initial connect failures
When listenerFactory fails before a socket is established (e.g., DNS/TLS errors during startup), this catch path always computes backoff and continues, so it never reaches the later maxAttempts handling block in monitorWebChannel. In practice, web.reconnect.maxAttempts is ignored for connection-phase failures, which can cause indefinite high-frequency retries/log churn (especially with a small maxMs) instead of switching to the intended degraded heartbeat-paced retry mode.
Useful? React with 👍 / 👎.
…ling - P1: Detect loggedOut status in connection-phase errors via getStatusCode instead of hardcoding loggedOut: false - P2: Respect keepAlive=false (single-run mode) — break instead of retrying on connection-phase failures - P2: Enforce maxAttempts policy for connection-phase errors, consistent with post-connection handling Updated e2e tests to cover all three scenarios and reflect degraded-mode behavior (continue retrying after max attempts instead of stopping).
Summary
Combines the best parts of three closed PRs (#9727 by @luizlf, #14484 by @onthway, #17487) to fix the long-standing WhatsApp reconnection bug reported in #13371.
Problem: When the WhatsApp WebSocket connection hits a DNS resolution failure (
ENOTFOUND), network timeout, or TLS error during the initial connection phase, the exception propagates uncaught out of thewhile(true)reconnect loop inmonitorWebChannel. The channel exits permanently and never reconnects — even though the network may recover seconds later.Additionally, after reaching
maxAttempts, the monitor breaks out of the loop permanently ("Stopping web monitoring"), with no way to recover short of a full gateway restart.Root cause: The
listenerFactory()call (line ~199 inmonitor.ts) is not wrapped in a try-catch. The existing reconnect logic only handles disconnects that arrive after a successful connection vialistener.onClose. Connection-phase errors bypass the entire backoff/retry mechanism.Changes
Catch connection-phase errors (
src/web/auto-reply/monitor.ts): Wrap listener creation in try-catch. On failure, incrementreconnectAttempts, update status withlastDisconnectinfo, apply exponential backoff, andcontinuethe loop.Periodic recovery after max attempts (
src/web/auto-reply/monitor.ts): Instead ofbreakafter reachingmaxAttempts, switch to periodic recovery (one attempt perheartbeatSecondsinterval). The existingreconnectAttemptsreset logic (when uptime exceeds heartbeat interval) handles counter reset after a successful recovery.Socket leak prevention (
src/web/inbound/monitor.ts): WrapwaitForWaConnection(sock)in try-catch that callssock.ws?.close()on failure before re-throwing. Prevents FD/connection leaks when the connection handshake fails.Testing
Verified on a macOS gateway (v2026.2.24) that experiences regular DNS failures after sleep/wake cycles. Before this fix, the WhatsApp channel would exit permanently after the first
ENOTFOUND. After applying, the channel retries with backoff and recovers once the network is available.Prior art
Closes #13371
Greptile Summary
Addresses a critical WhatsApp reconnection bug where DNS failures, timeouts, or TLS errors during the initial connection phase would cause the monitor to exit permanently instead of retrying.
Key changes:
listenerFactorycall in try-catch to treat connection-phase errors as retryable disconnects with exponential backoff (lines 193-243 inmonitor.ts)break) to periodic recovery with heartbeat-interval retries (lines 442-463 inmonitor.ts)sock.ws?.close()) inmonitorWebInboxwhenwaitForWaConnectionfails to prevent file descriptor leaks (lines 43-53 ininbound/monitor.ts)Potential concern:
Confidence Score: 4/5
Last reviewed commit: 330cd0a
(4/5) You can add custom instructions or style guidelines for the agent here!