Skip to content

fix(whatsapp): enable TCP keepalive on WebSocket socket to prevent WSL2 disconnects#72735

Closed
yhyatt wants to merge 3 commits intoopenclaw:mainfrom
yhyatt:fix/wa-tcp-keepalive
Closed

fix(whatsapp): enable TCP keepalive on WebSocket socket to prevent WSL2 disconnects#72735
yhyatt wants to merge 3 commits intoopenclaw:mainfrom
yhyatt:fix/wa-tcp-keepalive

Conversation

@yhyatt
Copy link
Copy Markdown
Contributor

@yhyatt yhyatt commented Apr 27, 2026

What

Enable TCP keepalive on the underlying socket in the WhatsApp WebSocket connection, preventing idle TCP disconnects on Windows WSL2 Hyper-V NAT.

Why

Root cause: Windows Hyper-V NAT drops idle TCP connections after ~60 seconds. Baileys sends application-level WebSocket pings every 25-30 seconds, but NAT devices operate at the TCP layer and do not inspect WS frames. This causes repeated disconnect/reconnect storms on WSL2 (observed: ~70 reconnects in 70 minutes), each triggering a race condition in creds.json writes.

Mechanism: The fix calls socket.setKeepAlive(true, 15000) on every new TCP socket managed by the HTTP agent passed to Baileys. This sends OS-level TCP ACK probes well before the 60-second NAT timeout, keeping the connection alive even when the application-level activity is sparse.

How

  1. New file: extensions/whatsapp/src/tcp-keepalive-agent.ts

    • Exports wrapAgentWithTcpKeepalive(agent, opts)
    • Patches the agent's createConnection method to apply keepalive to every new socket
    • Handles both callback (Node.js core agent) and synchronous return (proxy-agent) paths
    • Returns undefined when no agent is configured (graceful no-op)
    • Catches and swallows socket errors (keepalive is defense-in-depth, not a blocker)
  2. Integration in extensions/whatsapp/src/session.ts

    • Resolve fetchAgent with unmutated baseAgent first
    • Then wrap agent with TCP keepalive before passing to makeWASocket
  3. Test coverage: extensions/whatsapp/src/tcp-keepalive-agent.test.ts

    • Undefined passthrough
    • Callback pattern (Node.js core agent) with default/custom delays
    • Synchronous return pattern (proxy-agent via agent-base)
    • Error path: keepalive not applied on connection errors
    • setKeepAlive throw handling (best-effort)

Fixes

Closes #58481

Also related to #61788 (WhatsApp WebSocket ETIMEDOUT — different root cause but same symptom area)

Test Evidence

  • All 71 WhatsApp extension tests pass
  • New tests cover: callback path, sync-return path, error path, default/custom delays, throw handling
  • tsc --noEmit clean for all modified files

AI-Assisted: Developed with GLM-5-turbo + Claude. Fully tested and reviewed.

…L2 disconnects

On WSL2, Windows Hyper-V NAT drops idle TCP connections after ~60 seconds.
Baileys sends application-level WebSocket pings every 25-30s, but NAT devices
operate at the TCP layer and do not inspect WS frames. This causes repeated
disconnect/reconnect storms (observed: 70 reconnects in 70 minutes), each
triggering a creds.json write race.

Fix: wrap the HTTP agent passed to Baileys with a thin layer that calls
socket.setKeepAlive(true, 15000) on every new TCP socket. This sends OS-level
TCP ACK probes well before the NAT timeout, keeping the connection alive.

The wrapper:
- Returns undefined when no proxy agent is configured (no-op when not needed)
- Covers both initial connections and reconnects (via createConnection hook)
- Works with proxy-agent (wraps the tunnel socket, which is what NAT sees)
- Is environment-agnostic — harmless on Linux, macOS, Docker, bare metal

Closes openclaw#58481
Related: openclaw#61788
@openclaw-barnacle openclaw-barnacle Bot added channel: whatsapp-web Channel integration: whatsapp-web size: S labels Apr 27, 2026
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Apr 27, 2026

Greptile Summary

This PR adds TCP keepalive support to the WhatsApp WebSocket connection to prevent idle-connection drops under Windows WSL2 Hyper-V NAT. The core mechanism — monkey-patching agent.createConnection to call socket.setKeepAlive — is sound.

  • P1: wrapAgentWithTcpKeepalive mutates baseAgent in place, so the baseAgent reference passed to resolveEnvFetchDispatcher on the very next line is already patched. The stated design goal of keeping the fetch agent unwrapped is not met.

Confidence Score: 4/5

Safe to merge with the mutation ordering fixed; keepalive on fetch connections is likely harmless but violates stated design intent.

One P1 (in-place mutation causes fetchAgent to receive the wrapped agent) and one P2 (flawed test assertions). The P1 may be practically benign but contradicts the PR's explicit design contract.

extensions/whatsapp/src/session.ts lines 148-150 (mutation order) and extensions/whatsapp/src/tcp-keepalive-agent.test.ts (error-path test reliability).

Prompt To Fix All With AI
This is a comment left during a code review.
Path: extensions/whatsapp/src/session.ts
Line: 149-150

Comment:
**`baseAgent` is mutated before being passed to `fetchAgent`**

`wrapAgentWithTcpKeepalive` patches `baseAgent.createConnection` **in place** and returns the same object reference (see `tcp-keepalive-agent.ts` line 44–57). By the time `resolveEnvFetchDispatcher(sessionLogger, baseAgent)` is called on the next line, `baseAgent` already has the keepalive wrapper on its `createConnection` method. The PR description's claim that "fetch agent (uploads) still uses unwrapped `baseAgent`" is incorrect — both `agent` and `fetchAgent` use the same mutated object.

Whether keepalive on fetch connections is harmful is unclear, but the design intent is definitely violated. To fix, either clone the agent before wrapping, or wrap after constructing the fetch dispatcher:

```ts
const baseAgent = await resolveEnvProxyAgent(sessionLogger);
const fetchAgent = await resolveEnvFetchDispatcher(sessionLogger, baseAgent);
const agent = wrapAgentWithTcpKeepalive(baseAgent);
```

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: extensions/whatsapp/src/tcp-keepalive-agent.test.ts
Line: 94-112

Comment:
**Error-path test never actually exercises an error**

The shared `createMockAgent()` factory always resolves successfully — it does `process.nextTick(() => callback(null, mockSocket))`. In this test the inner assertions `expect(err).toBeInstanceOf(Error)` and `expect(socket).toBeUndefined()` run with `err = null` and `socket = <mockSocket>`, so both assertions would fail if the callback were awaited. Because the callback fires asynchronously and the test only `await`s a single `process.nextTick`, the assertions inside the callback may be executing after the test already passed, silently swallowing the failures.

To actually test error preservation, create a mock agent whose `createConnection` calls back with a real `Error`:
```ts
const errorAgent = {
  createConnection: vi.fn((_opts, cb) => {
    process.nextTick(() => cb(new Error("ECONNREFUSED"), undefined));
    return {} as NodeJS.Socket;
  }),
  destroy: vi.fn(),
} as unknown as Agent & { createConnection: ReturnType<typeof vi.fn> };
```

How can I resolve this? If you propose a fix, please make it concise.

Reviews (1): Last reviewed commit: "fix(whatsapp): enable TCP keepalive on W..." | Re-trigger Greptile

Comment thread extensions/whatsapp/src/session.ts Outdated
Comment thread extensions/whatsapp/src/tcp-keepalive-agent.test.ts Outdated
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 44dca906c1

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread extensions/whatsapp/src/tcp-keepalive-agent.ts Outdated
Chai added 2 commits April 27, 2026 14:14
- Use Duplex type for createConnection callback (matches Agent contract)
- Import HttpsAgent type alias in session.ts for proper cast
- Guard callback invocation with optional chaining
- All 71 WhatsApp extension tests pass, tsc --noEmit clean
… test

- Handle synchronous socket return (proxy-agent via agent-base) in
  addition to the callback path. proxy-agent returns the socket directly
  without invoking the callback, so keepalive was never applied on the
  most common agent path. (chatgpt-codex-connector[bot] P1)

- Move fetchAgent resolution before wrapAgentWithTcpKeepalive so that
  fetchAgent receives the unmutated baseAgent. (greptile-apps[bot] P1)

- Add dedicated error-returning mock agent and test that keepalive is
  not applied on error callbacks. (greptile-apps[bot] P2)

- Add test for synchronous return pattern and double-apply behavior.

All 71 WhatsApp extension tests pass.
@markfietje
Copy link
Copy Markdown
Contributor

@yhyatt Nice implementation. I wonder though if this is the right layer — TCP keepalive is an OS-level setting, and WSL2 is a single-user environment where you have full access to sysctl:

net.ipv4.tcp_keepalive_time = 15
net.ipv4.tcp_keepalive_intvl = 5
net.ipv4.tcp_keepalive_probes = 3

That fixes it system-wide for all TCP connections, not just this one socket. The root issue is the kernel's default tcp_keepalive_time of 7200s being too high for WSL2's Hyper-V NAT — feels like that's where the fix belongs rather than per-socket in application code.

One note on #61788 — that issue is ETIMEDOUT during the initial WebSocket handshake (connection never establishes), so TCP keepalive wouldn't apply there since it only affects already-established idle connections. Probably worth tracking separately.

@yhyatt
Copy link
Copy Markdown
Contributor Author

yhyatt commented Apr 27, 2026

Great point — sysctl is indeed the cleaner fix for single-user environments like WSL2 where you have full control. We're applying it locally:

net.ipv4.tcp_keepalive_time = 15
net.ipv4.tcp_keepalive_intvl = 5
net.ipv4.tcp_keepalive_probes = 3

That said, the per-socket approach in this PR still has value for users who can't set sysctl — Docker containers, shared hosts, managed environments where kernel params are locked down. Application-level keepalive is the only option there.

Re: #61788 — you're right, that's ETIMEDOUT during the initial WebSocket handshake, not an idle connection drop. TCP keepalive wouldn't apply there since the connection never establishes. That's a separate issue (likely DNS resolution or Hyper-V NAT port forwarding for new outbound connections). Updated the PR body to remove the Related: #61788 link.

@markfietje
Copy link
Copy Markdown
Contributor

markfietje commented Apr 27, 2026

@yhyatt Good call applying the sysctl values locally — that's the right move for WSL2.

I want to push back on the Docker/shared-host argument though, and clarify the layering because I think this PR is solving the right problem at the wrong layer.

WebSocket keepalive has three distinct layers

Layer Mechanism Who owns it Config location
OS / TCP setKeepAlive() / tcp_keepalive_time Kernel / sysadmin sysctl, setsockopt
WebSocket protocol RFC 6455 ping/pong control frames Application WS client/server config
Application protocol Baileys <iq> stanzas, custom heartbeat ticks Application logic Library config

These should not be mixed. Each layer solves a different problem and belongs to a different owner.

The correct layer is WebSocket protocol ping/pong

Best practice for long-lived WebSocket connections through NATs and reverse proxies is RFC 6455 ping/pong control frames — not OS-level TCP keepalive, and not application-protocol stanzas. This is what the WebSocket protocol designed ping/pong for.

WS ping/pong control frames:

  • Generate TCP traffic that keeps NAT mappings alive — the NAT doesn't need to understand the protocol, it just sees bytes on the wire
  • Are forwarded natively by every reverse proxy (nginx, Caddy, HAProxy, AWS ALB, Tailscale Serve)
  • Work everywhere — Docker, shared hosts, managed environments — no kernel access needed
  • Include a built-in timeout mechanism: if no pong comes back, the connection is dead

A correct implementation looks like this:

// After WebSocket handshake completes
socket.on("pong", () => {
  pongReceived = true;
  if (pongTimer) {
    clearTimeout(pongTimer);
    pongTimer = null;
  }
});

pingTimer = setInterval(() => {
  if (closed) return;
  pongReceived = false;
  try {
    socket.ping();
  } catch {
    close(PONG_TIMEOUT_CLOSE_CODE, "ping failed");
  }
  pongTimer = setTimeout(() => {
    if (!closed && !pongReceived) {
      close(PONG_TIMEOUT_CLOSE_CODE, "pong timeout");
    }
  }, pongTimeoutMs);
}, pingIntervalMs);

This should be contributed to upstream as a separate, focused PR — it applies to all gateway WebSocket connections, not just WhatsApp. It's a general hardening improvement, not a WhatsApp-specific workaround.

For the WhatsApp channel specifically

Baileys already has its own keepalive mechanism via keepAliveIntervalMs — but it sends WhatsApp protocol-level <iq> stanzas, not WS ping/pong frames. Both generate TCP traffic and both keep NAT alive. If the default 30s interval isn't aggressive enough for WSL2's ~60s NAT timeout, the fix is:

const sock = makeWASocket({
    // ... existing config ...
    keepAliveIntervalMs: 15_000,  // lower from 30_000 for aggressive NAT environments
});

That's a one-line config change in session.ts. No new files, no agent wrapper, no createConnection monkey-patching.

Why the per-socket setKeepAlive approach is the wrong layer

socket.setKeepAlive(true, 15000) reaches past two protocol layers (WebSocket, TCP) to configure kernel TCP keepalive probes from application code. This conflates OS administration with application logic:

  • In Docker: you can set sysctl via --sysctl in most cases, and WS ping/pong works without it regardless.
  • On PaaS / shared hosts (Heroku, Fly.io, Render): you can deploy application code but cannot set kernel params. This is exactly where setKeepAlive() would be an option — but WS ping/pong is a better option because it stays at the correct protocol layer, provides liveness detection (pong timeout), and doesn't depend on kernel TCP keepalive behavior varying across platforms.
  • The "application-level keepalive is the only option" argument already has a native answer: WS ping/pong control frames. That IS the application-level option, at the correct layer. No need to drop to setsockopt.

Regarding #61788

Agreed that's a separate issue — ETIMEDOUT during the initial WebSocket handshake is a connection establishment problem (DNS, routing, firewall, Baileys connectTimeoutMs), not an idle-connection problem. That needs its own investigation and PR.

Suggested path

  1. Close this PR
  2. If Baileys' 30s keepalive interval genuinely isn't enough for WSL2 NAT, open a targeted PR to lower keepAliveIntervalMs in the makeWASocket call — one config line
  3. Add a WSL2 troubleshooting section to the WhatsApp docs recommending the sysctl values for users who want additional OS-level defense-in-depth
  4. Track WhatsApp WebSocket ETIMEDOUT on connection after login #61788 separately as a connection-establishment timeout issue
  5. Contribute WS ping/pong support for the gateway as a separate PR — that's the proper architectural fix that benefits all channels, not just WhatsApp

@velvet-shark
Copy link
Copy Markdown
Member

Thanks for the focused PR and for following up on the review comments.

I’m going to close this one because #73580 is now merged and is the preferred fix path for this class of WhatsApp disconnects.

The key reason is layering: Baileys already sends regular keepalive traffic on the WebSocket via keepAliveIntervalMs, and #73580 exposes that setting as web.whatsapp.keepAliveIntervalMs with docs recommending shorter intervals such as 15000 for aggressive idle-timeout networks. That keeps traffic flowing on the same TCP connection without reaching below Baileys/WebSocket into per-socket OS TCP keepalive behavior.

There is also an implementation mismatch in this PR as currently written: it wraps the env proxy baseAgent. In the normal direct-connection WSL2 case, OpenClaw does not create an env proxy agent, so the wrapper would return undefined and likely would not apply to the reported default path. That makes this too narrow to land as the product fix for #58481.

If users can still reproduce #58481 after #73580 with web.whatsapp.keepAliveIntervalMs: 15000, we should reopen the investigation with fresh logs. At that point the next step would be either docs/sysctl guidance for WSL2, WebSocket-level ping/pong support in the right owner layer, or a deliberately designed TCP keepalive implementation that actually covers non-proxy sockets and has a clear config/default story. I do not think this PR is the right shape to merge as-is.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

channel: whatsapp-web Channel integration: whatsapp-web size: M

Projects

None yet

Development

Successfully merging this pull request may close these issues.

WhatsApp WebSocket drops on WSL2 — missing TCP keepalive on underlying socket

3 participants