fix(whatsapp): enable TCP keepalive on WebSocket socket to prevent WSL2 disconnects by yhyatt · Pull Request #72735 · openclaw/openclaw

yhyatt · 2026-04-27T09:34:01Z

What

Enable TCP keepalive on the underlying socket in the WhatsApp WebSocket connection, preventing idle TCP disconnects on Windows WSL2 Hyper-V NAT.

Why

Root cause: Windows Hyper-V NAT drops idle TCP connections after ~60 seconds. Baileys sends application-level WebSocket pings every 25-30 seconds, but NAT devices operate at the TCP layer and do not inspect WS frames. This causes repeated disconnect/reconnect storms on WSL2 (observed: ~70 reconnects in 70 minutes), each triggering a race condition in creds.json writes.

Mechanism: The fix calls socket.setKeepAlive(true, 15000) on every new TCP socket managed by the HTTP agent passed to Baileys. This sends OS-level TCP ACK probes well before the 60-second NAT timeout, keeping the connection alive even when the application-level activity is sparse.

How

New file: extensions/whatsapp/src/tcp-keepalive-agent.ts
- Exports wrapAgentWithTcpKeepalive(agent, opts)
- Patches the agent's createConnection method to apply keepalive to every new socket
- Handles both callback (Node.js core agent) and synchronous return (proxy-agent) paths
- Returns undefined when no agent is configured (graceful no-op)
- Catches and swallows socket errors (keepalive is defense-in-depth, not a blocker)
Integration in extensions/whatsapp/src/session.ts
- Resolve fetchAgent with unmutated baseAgent first
- Then wrap agent with TCP keepalive before passing to makeWASocket
Test coverage: extensions/whatsapp/src/tcp-keepalive-agent.test.ts
- Undefined passthrough
- Callback pattern (Node.js core agent) with default/custom delays
- Synchronous return pattern (proxy-agent via agent-base)
- Error path: keepalive not applied on connection errors
- setKeepAlive throw handling (best-effort)

Fixes

Closes #58481

Also related to #61788 (WhatsApp WebSocket ETIMEDOUT — different root cause but same symptom area)

Test Evidence

All 71 WhatsApp extension tests pass
New tests cover: callback path, sync-return path, error path, default/custom delays, throw handling
tsc --noEmit clean for all modified files

AI-Assisted: Developed with GLM-5-turbo + Claude. Fully tested and reviewed.

…L2 disconnects On WSL2, Windows Hyper-V NAT drops idle TCP connections after ~60 seconds. Baileys sends application-level WebSocket pings every 25-30s, but NAT devices operate at the TCP layer and do not inspect WS frames. This causes repeated disconnect/reconnect storms (observed: 70 reconnects in 70 minutes), each triggering a creds.json write race. Fix: wrap the HTTP agent passed to Baileys with a thin layer that calls socket.setKeepAlive(true, 15000) on every new TCP socket. This sends OS-level TCP ACK probes well before the NAT timeout, keeping the connection alive. The wrapper: - Returns undefined when no proxy agent is configured (no-op when not needed) - Covers both initial connections and reconnects (via createConnection hook) - Works with proxy-agent (wraps the tunnel socket, which is what NAT sees) - Is environment-agnostic — harmless on Linux, macOS, Docker, bare metal Closes openclaw#58481 Related: openclaw#61788

greptile-apps · 2026-04-27T09:36:11Z

Greptile Summary

This PR adds TCP keepalive support to the WhatsApp WebSocket connection to prevent idle-connection drops under Windows WSL2 Hyper-V NAT. The core mechanism — monkey-patching agent.createConnection to call socket.setKeepAlive — is sound.

P1: wrapAgentWithTcpKeepalive mutates baseAgent in place, so the baseAgent reference passed to resolveEnvFetchDispatcher on the very next line is already patched. The stated design goal of keeping the fetch agent unwrapped is not met.

Confidence Score: 4/5

Safe to merge with the mutation ordering fixed; keepalive on fetch connections is likely harmless but violates stated design intent.

One P1 (in-place mutation causes fetchAgent to receive the wrapped agent) and one P2 (flawed test assertions). The P1 may be practically benign but contradicts the PR's explicit design contract.

extensions/whatsapp/src/session.ts lines 148-150 (mutation order) and extensions/whatsapp/src/tcp-keepalive-agent.test.ts (error-path test reliability).

Prompt To Fix All With AI

This is a comment left during a code review.
Path: extensions/whatsapp/src/session.ts
Line: 149-150

Comment:
**`baseAgent` is mutated before being passed to `fetchAgent`**

`wrapAgentWithTcpKeepalive` patches `baseAgent.createConnection` **in place** and returns the same object reference (see `tcp-keepalive-agent.ts` line 44–57). By the time `resolveEnvFetchDispatcher(sessionLogger, baseAgent)` is called on the next line, `baseAgent` already has the keepalive wrapper on its `createConnection` method. The PR description's claim that "fetch agent (uploads) still uses unwrapped `baseAgent`" is incorrect — both `agent` and `fetchAgent` use the same mutated object.

Whether keepalive on fetch connections is harmful is unclear, but the design intent is definitely violated. To fix, either clone the agent before wrapping, or wrap after constructing the fetch dispatcher:

```ts
const baseAgent = await resolveEnvProxyAgent(sessionLogger);
const fetchAgent = await resolveEnvFetchDispatcher(sessionLogger, baseAgent);
const agent = wrapAgentWithTcpKeepalive(baseAgent);
```

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: extensions/whatsapp/src/tcp-keepalive-agent.test.ts
Line: 94-112

Comment:
**Error-path test never actually exercises an error**

The shared `createMockAgent()` factory always resolves successfully — it does `process.nextTick(() => callback(null, mockSocket))`. In this test the inner assertions `expect(err).toBeInstanceOf(Error)` and `expect(socket).toBeUndefined()` run with `err = null` and `socket = <mockSocket>`, so both assertions would fail if the callback were awaited. Because the callback fires asynchronously and the test only `await`s a single `process.nextTick`, the assertions inside the callback may be executing after the test already passed, silently swallowing the failures.

To actually test error preservation, create a mock agent whose `createConnection` calls back with a real `Error`:
```ts
const errorAgent = {
  createConnection: vi.fn((_opts, cb) => {
    process.nextTick(() => cb(new Error("ECONNREFUSED"), undefined));
    return {} as NodeJS.Socket;
  }),
  destroy: vi.fn(),
} as unknown as Agent & { createConnection: ReturnType<typeof vi.fn> };
```

How can I resolve this? If you propose a fix, please make it concise.

_{Reviews (1): Last reviewed commit: "fix(whatsapp): enable TCP keepalive on W..." | Re-trigger Greptile}

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 44dca906c1

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

- Use Duplex type for createConnection callback (matches Agent contract) - Import HttpsAgent type alias in session.ts for proper cast - Guard callback invocation with optional chaining - All 71 WhatsApp extension tests pass, tsc --noEmit clean

… test - Handle synchronous socket return (proxy-agent via agent-base) in addition to the callback path. proxy-agent returns the socket directly without invoking the callback, so keepalive was never applied on the most common agent path. (chatgpt-codex-connector[bot] P1) - Move fetchAgent resolution before wrapAgentWithTcpKeepalive so that fetchAgent receives the unmutated baseAgent. (greptile-apps[bot] P1) - Add dedicated error-returning mock agent and test that keepalive is not applied on error callbacks. (greptile-apps[bot] P2) - Add test for synchronous return pattern and double-apply behavior. All 71 WhatsApp extension tests pass.

markfietje · 2026-04-27T13:15:51Z

@yhyatt Nice implementation. I wonder though if this is the right layer — TCP keepalive is an OS-level setting, and WSL2 is a single-user environment where you have full access to sysctl:

net.ipv4.tcp_keepalive_time = 15
net.ipv4.tcp_keepalive_intvl = 5
net.ipv4.tcp_keepalive_probes = 3
That fixes it system-wide for all TCP connections, not just this one socket. The root issue is the kernel's default tcp_keepalive_time of 7200s being too high for WSL2's Hyper-V NAT — feels like that's where the fix belongs rather than per-socket in application code.

One note on #61788 — that issue is ETIMEDOUT during the initial WebSocket handshake (connection never establishes), so TCP keepalive wouldn't apply there since it only affects already-established idle connections. Probably worth tracking separately.

yhyatt · 2026-04-27T13:21:22Z

Great point — sysctl is indeed the cleaner fix for single-user environments like WSL2 where you have full control. We're applying it locally:

net.ipv4.tcp_keepalive_time = 15
net.ipv4.tcp_keepalive_intvl = 5
net.ipv4.tcp_keepalive_probes = 3

That said, the per-socket approach in this PR still has value for users who can't set sysctl — Docker containers, shared hosts, managed environments where kernel params are locked down. Application-level keepalive is the only option there.

Re: #61788 — you're right, that's ETIMEDOUT during the initial WebSocket handshake, not an idle connection drop. TCP keepalive wouldn't apply there since the connection never establishes. That's a separate issue (likely DNS resolution or Hyper-V NAT port forwarding for new outbound connections). Updated the PR body to remove the Related: #61788 link.

markfietje · 2026-04-27T13:42:08Z

@yhyatt Good call applying the sysctl values locally — that's the right move for WSL2.

I want to push back on the Docker/shared-host argument though, and clarify the layering because I think this PR is solving the right problem at the wrong layer.

WebSocket keepalive has three distinct layers

Layer	Mechanism	Who owns it	Config location
OS / TCP	`setKeepAlive()` / `tcp_keepalive_time`	Kernel / sysadmin	`sysctl`, `setsockopt`
WebSocket protocol	RFC 6455 ping/pong control frames	Application	WS client/server config
Application protocol	Baileys `<iq>` stanzas, custom heartbeat ticks	Application logic	Library config

These should not be mixed. Each layer solves a different problem and belongs to a different owner.

The correct layer is WebSocket protocol ping/pong

Best practice for long-lived WebSocket connections through NATs and reverse proxies is RFC 6455 ping/pong control frames — not OS-level TCP keepalive, and not application-protocol stanzas. This is what the WebSocket protocol designed ping/pong for.

WS ping/pong control frames:

Generate TCP traffic that keeps NAT mappings alive — the NAT doesn't need to understand the protocol, it just sees bytes on the wire
Are forwarded natively by every reverse proxy (nginx, Caddy, HAProxy, AWS ALB, Tailscale Serve)
Work everywhere — Docker, shared hosts, managed environments — no kernel access needed
Include a built-in timeout mechanism: if no pong comes back, the connection is dead

A correct implementation looks like this:

// After WebSocket handshake completes
socket.on("pong", () => {
  pongReceived = true;
  if (pongTimer) {
    clearTimeout(pongTimer);
    pongTimer = null;
  }
});

pingTimer = setInterval(() => {
  if (closed) return;
  pongReceived = false;
  try {
    socket.ping();
  } catch {
    close(PONG_TIMEOUT_CLOSE_CODE, "ping failed");
  }
  pongTimer = setTimeout(() => {
    if (!closed && !pongReceived) {
      close(PONG_TIMEOUT_CLOSE_CODE, "pong timeout");
    }
  }, pongTimeoutMs);
}, pingIntervalMs);

This should be contributed to upstream as a separate, focused PR — it applies to all gateway WebSocket connections, not just WhatsApp. It's a general hardening improvement, not a WhatsApp-specific workaround.

For the WhatsApp channel specifically

Baileys already has its own keepalive mechanism via keepAliveIntervalMs — but it sends WhatsApp protocol-level <iq> stanzas, not WS ping/pong frames. Both generate TCP traffic and both keep NAT alive. If the default 30s interval isn't aggressive enough for WSL2's ~60s NAT timeout, the fix is:

const sock = makeWASocket({
    // ... existing config ...
    keepAliveIntervalMs: 15_000,  // lower from 30_000 for aggressive NAT environments
});

That's a one-line config change in session.ts. No new files, no agent wrapper, no createConnection monkey-patching.

Why the per-socket `setKeepAlive` approach is the wrong layer

socket.setKeepAlive(true, 15000) reaches past two protocol layers (WebSocket, TCP) to configure kernel TCP keepalive probes from application code. This conflates OS administration with application logic:

In Docker: you can set sysctl via --sysctl in most cases, and WS ping/pong works without it regardless.
On PaaS / shared hosts (Heroku, Fly.io, Render): you can deploy application code but cannot set kernel params. This is exactly where setKeepAlive() would be an option — but WS ping/pong is a better option because it stays at the correct protocol layer, provides liveness detection (pong timeout), and doesn't depend on kernel TCP keepalive behavior varying across platforms.
The "application-level keepalive is the only option" argument already has a native answer: WS ping/pong control frames. That IS the application-level option, at the correct layer. No need to drop to setsockopt.

Regarding #61788

Agreed that's a separate issue — ETIMEDOUT during the initial WebSocket handshake is a connection establishment problem (DNS, routing, firewall, Baileys connectTimeoutMs), not an idle-connection problem. That needs its own investigation and PR.

Suggested path

Close this PR
If Baileys' 30s keepalive interval genuinely isn't enough for WSL2 NAT, open a targeted PR to lower keepAliveIntervalMs in the makeWASocket call — one config line
Add a WSL2 troubleshooting section to the WhatsApp docs recommending the sysctl values for users who want additional OS-level defense-in-depth
Track WhatsApp WebSocket ETIMEDOUT on connection after login #61788 separately as a connection-establishment timeout issue
Contribute WS ping/pong support for the gateway as a separate PR — that's the proper architectural fix that benefits all channels, not just WhatsApp

velvet-shark · 2026-04-28T13:57:54Z

Thanks for the focused PR and for following up on the review comments.

I’m going to close this one because #73580 is now merged and is the preferred fix path for this class of WhatsApp disconnects.

The key reason is layering: Baileys already sends regular keepalive traffic on the WebSocket via keepAliveIntervalMs, and #73580 exposes that setting as web.whatsapp.keepAliveIntervalMs with docs recommending shorter intervals such as 15000 for aggressive idle-timeout networks. That keeps traffic flowing on the same TCP connection without reaching below Baileys/WebSocket into per-socket OS TCP keepalive behavior.

There is also an implementation mismatch in this PR as currently written: it wraps the env proxy baseAgent. In the normal direct-connection WSL2 case, OpenClaw does not create an env proxy agent, so the wrapper would return undefined and likely would not apply to the reported default path. That makes this too narrow to land as the product fix for #58481.

If users can still reproduce #58481 after #73580 with web.whatsapp.keepAliveIntervalMs: 15000, we should reopen the investigation with fresh logs. At that point the next step would be either docs/sysctl guidance for WSL2, WebSocket-level ping/pong support in the right owner layer, or a deliberately designed TCP keepalive implementation that actually covers non-proxy sockets and has a clear config/default story. I do not think this PR is the right shape to merge as-is.

openclaw-barnacle Bot added channel: whatsapp-web Channel integration: whatsapp-web size: S labels Apr 27, 2026

greptile-apps Bot reviewed Apr 27, 2026

View reviewed changes

Comment thread extensions/whatsapp/src/session.ts Outdated

Comment thread extensions/whatsapp/src/tcp-keepalive-agent.test.ts Outdated

chatgpt-codex-connector Bot reviewed Apr 27, 2026

View reviewed changes

Comment thread extensions/whatsapp/src/tcp-keepalive-agent.ts Outdated

Chai added 2 commits April 27, 2026 14:14

openclaw-barnacle Bot added size: M and removed size: S labels Apr 27, 2026

This was referenced Apr 27, 2026

WhatsApp WebSocket drops on WSL2 — missing TCP keepalive on underlying socket #58481

Closed

WhatsApp WebSocket ETIMEDOUT on connection after login #61788

Open

jared-rebel mentioned this pull request Apr 27, 2026

[Feature]: Native WAHA channel adapter (alongside Baileys) for stabler WhatsApp #73016

Closed

velvet-shark mentioned this pull request Apr 28, 2026

fix(whatsapp): expose Baileys socket timing #73580

Merged

velvet-shark closed this Apr 28, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(whatsapp): enable TCP keepalive on WebSocket socket to prevent WSL2 disconnects#72735

fix(whatsapp): enable TCP keepalive on WebSocket socket to prevent WSL2 disconnects#72735
yhyatt wants to merge 3 commits intoopenclaw:mainfrom
yhyatt:fix/wa-tcp-keepalive

yhyatt commented Apr 27, 2026 •

edited

Loading

Uh oh!

greptile-apps Bot commented Apr 27, 2026

Uh oh!

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

Uh oh!

markfietje commented Apr 27, 2026

Uh oh!

yhyatt commented Apr 27, 2026

Uh oh!

markfietje commented Apr 27, 2026 •

edited

Loading

Uh oh!

velvet-shark commented Apr 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

yhyatt commented Apr 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What

Why

How

Fixes

Test Evidence

Uh oh!

greptile-apps Bot commented Apr 27, 2026

Greptile Summary

Confidence Score: 4/5

Uh oh!

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

markfietje commented Apr 27, 2026

Uh oh!

yhyatt commented Apr 27, 2026

Uh oh!

markfietje commented Apr 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

WebSocket keepalive has three distinct layers

The correct layer is WebSocket protocol ping/pong

For the WhatsApp channel specifically

Why the per-socket setKeepAlive approach is the wrong layer

Regarding #61788

Suggested path

Uh oh!

velvet-shark commented Apr 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

yhyatt commented Apr 27, 2026 •

edited

Loading

markfietje commented Apr 27, 2026 •

edited

Loading

Why the per-socket `setKeepAlive` approach is the wrong layer