Skip to content

fix(whatsapp): retry reconnect loop on initial connection failure#9727

Closed
luizlf wants to merge 2 commits intoopenclaw:mainfrom
luizlf:fix/whatsapp-dns-reconnect
Closed

fix(whatsapp): retry reconnect loop on initial connection failure#9727
luizlf wants to merge 2 commits intoopenclaw:mainfrom
luizlf:fix/whatsapp-dns-reconnect

Conversation

@luizlf
Copy link

@luizlf luizlf commented Feb 5, 2026

Summary

  • Retry initial WhatsApp Web listener startup failures in monitorWebChannel using the existing reconnect backoff instead of exiting.
  • Update reconnect status/logging for startup failures and respect maxAttempts.
  • Add a regression test that simulates an initial ENOTFOUND and verifies the reconnect loop retries.

Why

Log Evidence

  • Original bug (2026-02-05 07:16:14 UTC, production OpenClaw 2026.2.3): reconnect loop did not engage; channel remained dead until manual restart at 13:06.
{"error":"Error: getaddrinfo ENOTFOUND web.whatsapp.com"},"WebSocket error"
path: "opt/homebrew/lib/node_modules/openclaw/dist/web/session.js:117"
time: "2026-02-05T07:16:14.679Z"
  • Fix working (2026-02-05 15:01:56 UTC, dev build with fix): new "will retry" log indicates the initial failure is captured and the reconnect loop continues.
{"error":"ENOTFOUND web.whatsapp.com","reconnectAttempts":0},"web reconnect: failed to establish initial connection; will retry"
path: "/Users/lsantos/Projects/openclaw/src/web/auto-reply/monitor.ts:214"
time: "2026-02-05T15:01:56.442Z"

Testing

  • pnpm vitest run --config vitest.unit.config.ts "src/web/auto-reply.reconnects" (1 test passed in 17ms)
  • New test: src/web/auto-reply.reconnects-after-initial-connection-failure.test.ts uses a mocked listenerFactory that throws ENOTFOUND on the first attempt, asserts a second attempt happens without propagating the error, then aborts and closes cleanly.
  • pnpm build && pnpm check && pnpm test

AI Assistance

  • AI-assisted: yes (Codex (gpt-5.2-codex xhigh) full-auto).
  • Collaboration notes:
    • Claude (Opus 4.5) analyzed logs and identified the root cause in monitorWebChannel (the initial await listenerFactory() call lacked a try/catch).
    • Codex CLI reviewed the root cause, implemented the fix and wrote the test.
    • Claude reviewed the fix and confirmed it matched the root-cause analysis.
  • Original prompt to Codex: "Fix the WhatsApp DNS reconnect bug. The issue is in src/web/auto-reply/monitor.ts around line 192 - the await listenerFactory() call needs try/catch to handle initial connection failures and continue the retry loop with backoff."
  • Understanding confirmation: I understand this change catches listener startup errors, records the failure, increments reconnect attempts, waits with backoff, and retries until the max attempts is reached; the new test asserts a retry happens after an initial ENOTFOUND.

Greptile Overview

Greptile Summary

This PR updates the WhatsApp Web reconnect logic so that failures during the initial listener startup are handled by the same reconnect/backoff loop as later disconnects, rather than escaping and stopping the gateway. Concretely, monitorWebChannel now wraps the initial listenerFactory/monitorWebInbox startup in a try/catch, records the error in channel status, increments reconnectAttempts, applies maxAttempts, waits using the configured backoff, and retries.

It also adds a regression test that simulates a first-attempt DNS failure (ENOTFOUND) from the listener factory and asserts that the reconnect loop performs a second startup attempt without propagating the initial error, then aborts cleanly.

Confidence Score: 4/5

  • This PR is close to merge-ready; the runtime fix looks correct, but the new regression test is likely to be flaky in CI as written.
  • The reconnect-loop change is localized and follows the existing backoff/maxAttempts flow. The main concern is the test’s dependence on a hard 200ms wall-clock polling loop with real timers, which can intermittently fail under CI load despite correct behavior.
  • src/web/auto-reply.reconnects-after-initial-connection-failure.test.ts

(2/5) Greptile learns from your feedback when you react with thumbs up/down!

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 file reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

When DNS or network errors occur during initial WhatsApp connection
(e.g., ENOTFOUND web.whatsapp.com), the reconnect loop now catches
the error and retries with backoff, instead of exiting entirely.

Fixes openclaw#2198
@luizlf
Copy link
Author

luizlf commented Feb 10, 2026

CI Failure Note

The failing check (checks-windows) is unrelated to this PR. The failure is in src/docker-setup.test.ts ("avoids associative arrays so the script remains Bash 3.2-compatible"), which fails on Windows because bash is not available (result.status is null instead of 0).

This test was introduced by commit 6731c6a1c ("fix(docker): support Bash 3.2 in docker-setup.sh") on main — it's a pre-existing issue on upstream.

All tests from this PR pass on all platforms, including the new src/web/auto-reply.reconnects-after-initial-connection-failure.test.ts.

@luizlf
Copy link
Author

luizlf commented Feb 13, 2026

Fixes #13371

@nikolasdehor

This comment was marked as spam.

@nikolasdehor

This comment was marked as spam.

@nikolasdehor

This comment was marked as spam.

@luizlf
Copy link
Author

luizlf commented Feb 16, 2026

Agree @nikolasdehor. Great job bringing more attention to this.
Thanks!
Hoping the maintainers see this soon.

onthway added a commit to onthway/openclaw that referenced this pull request Feb 17, 2026
Copy link

@nikolasdehor nikolasdehor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I previously reviewed and approved PR #14484 by @onthway, which was closed as a duplicate of this PR. Both address the same root cause — #13371 (WhatsApp permanent disconnect on DNS/timeout), which I've been tracking since it was filed.

Core fix comparison:

Both PRs take the same fundamental approach: wrap the listenerFactory/monitorWebInbox() call inside the reconnect while(true) loop with a try-catch, so connection-phase errors (DNS failures, TLS handshake errors, etc.) flow into the existing backoff/retry path instead of crashing the loop. This is the correct fix.

What #14484 had that this PR is missing:

The socket leak fix in src/web/inbound/monitor.ts. When waitForWaConnection(sock) throws, the already-created socket from createWaSocket() is never closed. #14484 addressed this with:

try {
  await waitForWaConnection(sock);
} catch (err) {
  try {
    sock.ws?.close();
  } catch {}
  throw err;
}

Without this, each failed connection retry accumulates a dangling socket/FD. Under sustained DNS/network failures with the reconnect loop now correctly retrying, this could mean dozens of leaked sockets before maxAttempts is reached (or unlimited leaks if maxAttempts: 0).

I flagged this in my earlier comment, and I think it should be included before merge. It's a 5-line change in a separate file so it won't conflict with anything here.

Minor observations on the catch block:

  1. emitStatus() is called twice in succession (lines ~210 and ~213) — the first emit is immediately superseded by the second. Could consolidate to a single emit after all status fields are set.
  2. The log level is error for the initial failure message. #14484 used warn which feels more appropriate since this is now a retryable condition, not a terminal error. Minor style point.

Summary: The core fix is solid and correct. I'd like to see the socket leak fix from #14484 folded in before this merges — otherwise we're solving the reconnect problem but introducing a resource leak under the exact conditions the fix enables (repeated connection failures). Happy to approve once that's added.

@steipete
Copy link
Contributor

Closing as AI-assisted stale-fix triage.

Linked issue #13506 ("[Bug]: WhatsApp reconnect loop exits on initial connection failure (DNS/network errors)") is currently CLOSED and was closed on 2026-02-13T03:23:29Z with state reason NOT_PLANNED.
Given that issue state, this fix PR is no longer needed in the active queue and is being closed as stale.

If the underlying bug is still reproducible on current main, please reopen this PR (or open a new focused fix PR) and reference both #13506 and #9727 for fast re-triage.

@steipete
Copy link
Contributor

Closed after AI-assisted stale-fix triage (closed issue duplicate/stale fix).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

channel: whatsapp-web Channel integration: whatsapp-web

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: WhatsApp reconnect loop exits on initial connection failure (DNS/network errors)

3 participants