fix(whatsapp): enable TCP keepalive on WebSocket socket to prevent WSL2 disconnects#72735
fix(whatsapp): enable TCP keepalive on WebSocket socket to prevent WSL2 disconnects#72735yhyatt wants to merge 3 commits intoopenclaw:mainfrom
Conversation
…L2 disconnects On WSL2, Windows Hyper-V NAT drops idle TCP connections after ~60 seconds. Baileys sends application-level WebSocket pings every 25-30s, but NAT devices operate at the TCP layer and do not inspect WS frames. This causes repeated disconnect/reconnect storms (observed: 70 reconnects in 70 minutes), each triggering a creds.json write race. Fix: wrap the HTTP agent passed to Baileys with a thin layer that calls socket.setKeepAlive(true, 15000) on every new TCP socket. This sends OS-level TCP ACK probes well before the NAT timeout, keeping the connection alive. The wrapper: - Returns undefined when no proxy agent is configured (no-op when not needed) - Covers both initial connections and reconnects (via createConnection hook) - Works with proxy-agent (wraps the tunnel socket, which is what NAT sees) - Is environment-agnostic — harmless on Linux, macOS, Docker, bare metal Closes openclaw#58481 Related: openclaw#61788
Greptile SummaryThis PR adds TCP keepalive support to the WhatsApp WebSocket connection to prevent idle-connection drops under Windows WSL2 Hyper-V NAT. The core mechanism — monkey-patching
Confidence Score: 4/5Safe to merge with the mutation ordering fixed; keepalive on fetch connections is likely harmless but violates stated design intent. One P1 (in-place mutation causes fetchAgent to receive the wrapped agent) and one P2 (flawed test assertions). The P1 may be practically benign but contradicts the PR's explicit design contract. extensions/whatsapp/src/session.ts lines 148-150 (mutation order) and extensions/whatsapp/src/tcp-keepalive-agent.test.ts (error-path test reliability). Prompt To Fix All With AIThis is a comment left during a code review.
Path: extensions/whatsapp/src/session.ts
Line: 149-150
Comment:
**`baseAgent` is mutated before being passed to `fetchAgent`**
`wrapAgentWithTcpKeepalive` patches `baseAgent.createConnection` **in place** and returns the same object reference (see `tcp-keepalive-agent.ts` line 44–57). By the time `resolveEnvFetchDispatcher(sessionLogger, baseAgent)` is called on the next line, `baseAgent` already has the keepalive wrapper on its `createConnection` method. The PR description's claim that "fetch agent (uploads) still uses unwrapped `baseAgent`" is incorrect — both `agent` and `fetchAgent` use the same mutated object.
Whether keepalive on fetch connections is harmful is unclear, but the design intent is definitely violated. To fix, either clone the agent before wrapping, or wrap after constructing the fetch dispatcher:
```ts
const baseAgent = await resolveEnvProxyAgent(sessionLogger);
const fetchAgent = await resolveEnvFetchDispatcher(sessionLogger, baseAgent);
const agent = wrapAgentWithTcpKeepalive(baseAgent);
```
How can I resolve this? If you propose a fix, please make it concise.
---
This is a comment left during a code review.
Path: extensions/whatsapp/src/tcp-keepalive-agent.test.ts
Line: 94-112
Comment:
**Error-path test never actually exercises an error**
The shared `createMockAgent()` factory always resolves successfully — it does `process.nextTick(() => callback(null, mockSocket))`. In this test the inner assertions `expect(err).toBeInstanceOf(Error)` and `expect(socket).toBeUndefined()` run with `err = null` and `socket = <mockSocket>`, so both assertions would fail if the callback were awaited. Because the callback fires asynchronously and the test only `await`s a single `process.nextTick`, the assertions inside the callback may be executing after the test already passed, silently swallowing the failures.
To actually test error preservation, create a mock agent whose `createConnection` calls back with a real `Error`:
```ts
const errorAgent = {
createConnection: vi.fn((_opts, cb) => {
process.nextTick(() => cb(new Error("ECONNREFUSED"), undefined));
return {} as NodeJS.Socket;
}),
destroy: vi.fn(),
} as unknown as Agent & { createConnection: ReturnType<typeof vi.fn> };
```
How can I resolve this? If you propose a fix, please make it concise.Reviews (1): Last reviewed commit: "fix(whatsapp): enable TCP keepalive on W..." | Re-trigger Greptile |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 44dca906c1
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
- Use Duplex type for createConnection callback (matches Agent contract) - Import HttpsAgent type alias in session.ts for proper cast - Guard callback invocation with optional chaining - All 71 WhatsApp extension tests pass, tsc --noEmit clean
… test - Handle synchronous socket return (proxy-agent via agent-base) in addition to the callback path. proxy-agent returns the socket directly without invoking the callback, so keepalive was never applied on the most common agent path. (chatgpt-codex-connector[bot] P1) - Move fetchAgent resolution before wrapAgentWithTcpKeepalive so that fetchAgent receives the unmutated baseAgent. (greptile-apps[bot] P1) - Add dedicated error-returning mock agent and test that keepalive is not applied on error callbacks. (greptile-apps[bot] P2) - Add test for synchronous return pattern and double-apply behavior. All 71 WhatsApp extension tests pass.
|
@yhyatt Nice implementation. I wonder though if this is the right layer — TCP keepalive is an OS-level setting, and WSL2 is a single-user environment where you have full access to
|
|
Great point — sysctl is indeed the cleaner fix for single-user environments like WSL2 where you have full control. We're applying it locally: That said, the per-socket approach in this PR still has value for users who can't set sysctl — Docker containers, shared hosts, managed environments where kernel params are locked down. Application-level keepalive is the only option there. Re: #61788 — you're right, that's ETIMEDOUT during the initial WebSocket handshake, not an idle connection drop. TCP keepalive wouldn't apply there since the connection never establishes. That's a separate issue (likely DNS resolution or Hyper-V NAT port forwarding for new outbound connections). Updated the PR body to remove the |
|
@yhyatt Good call applying the sysctl values locally — that's the right move for WSL2. I want to push back on the Docker/shared-host argument though, and clarify the layering because I think this PR is solving the right problem at the wrong layer. WebSocket keepalive has three distinct layers
These should not be mixed. Each layer solves a different problem and belongs to a different owner. The correct layer is WebSocket protocol ping/pongBest practice for long-lived WebSocket connections through NATs and reverse proxies is RFC 6455 ping/pong control frames — not OS-level TCP keepalive, and not application-protocol stanzas. This is what the WebSocket protocol designed ping/pong for. WS ping/pong control frames:
A correct implementation looks like this: // After WebSocket handshake completes
socket.on("pong", () => {
pongReceived = true;
if (pongTimer) {
clearTimeout(pongTimer);
pongTimer = null;
}
});
pingTimer = setInterval(() => {
if (closed) return;
pongReceived = false;
try {
socket.ping();
} catch {
close(PONG_TIMEOUT_CLOSE_CODE, "ping failed");
}
pongTimer = setTimeout(() => {
if (!closed && !pongReceived) {
close(PONG_TIMEOUT_CLOSE_CODE, "pong timeout");
}
}, pongTimeoutMs);
}, pingIntervalMs);This should be contributed to upstream as a separate, focused PR — it applies to all gateway WebSocket connections, not just WhatsApp. It's a general hardening improvement, not a WhatsApp-specific workaround. For the WhatsApp channel specificallyBaileys already has its own keepalive mechanism via const sock = makeWASocket({
// ... existing config ...
keepAliveIntervalMs: 15_000, // lower from 30_000 for aggressive NAT environments
});That's a one-line config change in Why the per-socket
|
|
Thanks for the focused PR and for following up on the review comments. I’m going to close this one because #73580 is now merged and is the preferred fix path for this class of WhatsApp disconnects. The key reason is layering: Baileys already sends regular keepalive traffic on the WebSocket via There is also an implementation mismatch in this PR as currently written: it wraps the env proxy If users can still reproduce #58481 after #73580 with |
What
Enable TCP keepalive on the underlying socket in the WhatsApp WebSocket connection, preventing idle TCP disconnects on Windows WSL2 Hyper-V NAT.
Why
Root cause: Windows Hyper-V NAT drops idle TCP connections after ~60 seconds. Baileys sends application-level WebSocket pings every 25-30 seconds, but NAT devices operate at the TCP layer and do not inspect WS frames. This causes repeated disconnect/reconnect storms on WSL2 (observed: ~70 reconnects in 70 minutes), each triggering a race condition in
creds.jsonwrites.Mechanism: The fix calls
socket.setKeepAlive(true, 15000)on every new TCP socket managed by the HTTP agent passed to Baileys. This sends OS-level TCP ACK probes well before the 60-second NAT timeout, keeping the connection alive even when the application-level activity is sparse.How
New file:
extensions/whatsapp/src/tcp-keepalive-agent.tswrapAgentWithTcpKeepalive(agent, opts)createConnectionmethod to apply keepalive to every new socketundefinedwhen no agent is configured (graceful no-op)Integration in
extensions/whatsapp/src/session.tsmakeWASocketTest coverage:
extensions/whatsapp/src/tcp-keepalive-agent.test.tsFixes
Closes #58481
Also related to #61788 (WhatsApp WebSocket ETIMEDOUT — different root cause but same symptom area)
Test Evidence
tsc --noEmitclean for all modified filesAI-Assisted: Developed with GLM-5-turbo + Claude. Fully tested and reviewed.