Skip to content

fix(whatsapp): preserve replies across reconnects#62892

Merged
mcaxtr merged 1 commit into
mainfrom
fix/whatsapp-reconnect-reply-path
Apr 9, 2026
Merged

fix(whatsapp): preserve replies across reconnects#62892
mcaxtr merged 1 commit into
mainfrom
fix/whatsapp-reconnect-reply-path

Conversation

@mcaxtr

@mcaxtr mcaxtr commented Apr 8, 2026

Copy link
Copy Markdown
Member

Summary

Describe the problem and fix in 2–5 bullets:

  • Problem: WhatsApp inbound reply closures captured the socket that existed when the inbound message object was created, so a reconnect could leave later replies sending on a dead socket.
  • Why it matters: long-running replies could be lost when the socket disconnected and reconnected before the send happened.
  • What changed: the reconnect loop now owns a shared current-socket ref, and the inbound send path resolves reply/media/composing sends against that current socket and waits through reconnect gaps using the existing reconnect policy.
  • What changed after review: timeout-only send failures still retry, but they no longer clear socketRef.current; only confirmed disconnect-style failures clear the ref.
  • What did NOT change (scope boundary): this does not add a broader reconnect retry subsystem and does not change deliver-reply.ts, config, or user-facing defaults.

Change Type (select all)

  • Bug fix
  • Feature
  • Refactor required for the fix
  • Docs
  • Security hardening
  • Chore/infra

Scope (select all touched areas)

  • Gateway / orchestration
  • Skills / tool execution
  • Auth / tokens
  • Memory / storage
  • Integrations
  • API / contracts
  • UI / DX
  • CI/CD / infra

Linked Issue/PR

  • Closes #
  • Related #
  • This PR fixes a bug or regression

Root Cause (if applicable)

For bug fixes or regressions, explain why this happened, not just what changed. Otherwise write N/A. If the cause is unclear, write Unknown.

  • Root cause: monitorWebInbox() built reply / sendMedia / sendComposing closures against a specific Baileys socket, while monitorWebChannel() replaces that socket during reconnect.
  • Missing detection / guardrail: there was no regression test proving that an inbound message created on socket A still sends successfully after reconnect swaps to socket B or during the reconnect gap.
  • Contributing context (if known): the existing short retry behavior lived above the transport seam, so it could keep retrying stale send closures instead of re-resolving the active socket.

Regression Test Plan (if applicable)

For bug fixes or regressions, name the smallest reliable test coverage that should catch this. Otherwise write N/A.

  • Coverage level that should have caught this:
    • Unit test
    • Seam / integration test
    • End-to-end test
    • Existing coverage already sufficient
  • Target test or file: extensions/whatsapp/src/monitor-inbox.streams-inbound-messages.test.ts
  • Scenario the test should lock in: an inbound reply created before reconnect should use the replacement socket, a reply sent during a reconnect gap should wait and succeed once a socket is available again, and a timed-out send should retry on the same socket without clearing the ref.
  • Why this is the smallest reliable guardrail: the bug is born at the WhatsApp inbound transport seam, so this test exercises the exact closure and reconnect behavior without needing a full live WhatsApp setup.
  • Existing test that already covers this (if any): None.
  • If no new test is added, why not: N/A

User-visible / Behavior Changes

  • WhatsApp auto-replies that finish after a reconnect should now send on the replacement socket instead of being lost on the stale one.
  • When the reply send lands during the reconnect gap, the send path waits through reconnect using the existing reconnect policy instead of failing immediately on the obsolete socket.
  • Timeout-only send failures keep retrying on the active socket instead of forcing the reconnect-gap path.
  • No config or default behavior changes outside this reconnect path.

Diagram (if applicable)

For UI changes or non-trivial logic flows, include a small ASCII diagram reviewers can scan quickly. Otherwise write N/A.

Before:
[inbound message on socket A] -> [reply closure captures socket A] -> [socket A closes, socket B connects] -> [reply send uses socket A] -> [reply can be lost]

After:
[inbound message on socket A] -> [reply send resolves current socket at send time] -> [socket A closes, socket B connects] -> [reply send uses socket B or waits for reconnect] -> [reply delivered]

Security Impact (required)

  • New permissions/capabilities? (Yes/No) No
  • Secrets/tokens handling changed? (Yes/No) No
  • New/changed network calls? (Yes/No) No
  • Command/tool execution surface changed? (Yes/No) No
  • Data access scope changed? (Yes/No) No
  • If any Yes, explain risk + mitigation:

Repro + Verification

Environment

  • OS: Linux
  • Runtime/container: Node 22 + pnpm workspace, local worktree
  • Model/provider: N/A
  • Integration/channel (if any): WhatsApp
  • Relevant config (redacted): default WhatsApp reconnect policy; no config changes required

Steps

  1. Start a WhatsApp auto-reply flow and let the inbound message be accepted on socket A.
  2. Force-close the active Meta/WhatsApp HTTPS socket, then temporarily block outbound :443 for the gateway user to create a reconnect gap.
  3. Let WhatsApp reconnect to socket B and wait for the reply to finish sending.

Expected

  • On the fixed branch, replies created before reconnect send successfully after reconnect, and timeout-only send failures do not strand the reply on a cleared socket ref.

Actual

  • On main, the real repro hit WhatsApp Web connected. and later still logged Retrying text ... Connection Closed, which dropped the reply.
  • On this branch, the same reconnect window produced repeated auto-reply sent (text) logs after reconnect, and the reply arrived successfully.

Evidence

Attach at least one:

  • Failing test/log before + passing after
  • Trace/log snippets
  • Screenshot/recording
  • Perf numbers (if relevant)

Verification logs:

  • Real repro on main: reconnect succeeded, then later send attempts logged Retrying text ... Connection Closed
  • Real repro on this branch: reconnect succeeded, then later sends logged repeated auto-reply sent (text)
  • pnpm test extensions/whatsapp/src/monitor-inbox.streams-inbound-messages.test.ts -> 1 file, 17 tests passed
  • pnpm build -> passed
  • follow-up reviewer fix commit: eac0982b17

Human Verification (required)

What you personally verified (not just CI), and how:

  • Verified scenarios: ran the focused WhatsApp inbox regression tests for replacement-socket sends, reconnect-gap waits, and timeout retry behavior; ran pnpm build; and manually reproduced the broken behavior on main versus the fixed behavior on this branch with a forced WhatsApp reconnect.
  • Edge cases checked: reconnect-gap wait, replacement socket after reconnect, composing/media paths bound to the replacement socket, and timeout-only send retry behavior.
  • What you did not verify: non-WhatsApp channels, or any broader retry-system behavior outside this transport seam.

Review Conversations

  • I replied to or resolved every bot review conversation I addressed in this PR.
  • I left unresolved only the conversations that still need reviewer or maintainer judgment.

If a bot review conversation is addressed by this PR, resolve that conversation yourself. Do not leave bot review conversation cleanup for maintainers.

Compatibility / Migration

  • Backward compatible? (Yes/No) Yes
  • Config/env changes? (Yes/No) No
  • Migration needed? (Yes/No) No
  • If yes, exact upgrade steps:

Risks and Mitigations

List only real risks for this PR. Add/remove entries as needed. If none, write None.

  • Risk: send retries now live in the WhatsApp inbound transport seam, so a reconnect policy with a very long wait could delay final failure longer than before.
    • Mitigation: the wait reuses the existing reconnect policy and stays scoped to the reconnect path; normal sends are unchanged.
  • Risk: active-listener send APIs still use their existing path and are not changed by this PR.
    • Mitigation: scope is intentionally limited to the stale inbound reply closure bug addressed here.

@openclaw-barnacle openclaw-barnacle Bot added channel: whatsapp-web Channel integration: whatsapp-web size: M maintainer Maintainer-authored PR labels Apr 8, 2026
@greptile-apps

greptile-apps Bot commented Apr 8, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR fixes a socket-staleness bug in the WhatsApp inbound path: reply, sendMedia, and sendComposing closures previously captured the socket that existed when an inbound message was created, so a reconnect before the send would silently route replies through a dead socket. The fix introduces a shared socketRef owned by the outer monitorWebChannel reconnect loop, a sendTrackedMessage helper that resolves the current socket at send time, and a bounded retry/wait path for the reconnect gap. The change is backward-compatible (the new options are all optional) and is covered by three focused regression tests.

Confidence Score: 5/5

Safe to merge — the fix is well-scoped, backward-compatible, and covered by focused regression tests.

All remaining findings are P2 or informational. The core logic in sendTrackedMessage, socket-ref lifecycle management in monitorWebChannel, and the terminal-exit stopDisconnectRetries calls are all correct. The three new regression tests directly lock in the repaired behavior.

No files require special attention.

Vulnerabilities

No security concerns identified. The change routes sends through a shared socket reference, does not introduce new network calls, does not expose secrets, and does not widen any trust boundary.

Reviews (2): Last reviewed commit: "fix: keep whatsapp socket ref on send ti..." | Re-trigger Greptile

Comment thread extensions/whatsapp/src/inbound/monitor.ts Outdated

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: f7d60d7219

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread extensions/whatsapp/src/inbound/monitor.ts Outdated
@mcaxtr

mcaxtr commented Apr 8, 2026

Copy link
Copy Markdown
Member Author

Addressed the review feedback in follow-up commit eac0982b17.

Changes made:

  • merged the duplicate ../session.js import in extensions/whatsapp/src/inbound/monitor.ts
  • tightened the retry path so timeout-only send failures do not clear socketRef.current
  • added a regression in extensions/whatsapp/src/monitor-inbox.streams-inbound-messages.test.ts covering a timed-out send that retries on the same socket

Validation run:

  • pnpm test extensions/whatsapp/src/monitor-inbox.streams-inbound-messages.test.ts
  • pnpm build
  • commit hook pnpm check

@greptile review
@codex review

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: eac0982b17

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread extensions/whatsapp/src/inbound/monitor.ts Outdated
@mcaxtr mcaxtr force-pushed the fix/whatsapp-reconnect-reply-path branch from eac0982 to 2524abb Compare April 8, 2026 04:20
@mcaxtr

mcaxtr commented Apr 8, 2026

Copy link
Copy Markdown
Member Author

@codex review

@chatgpt-codex-connector

Copy link
Copy Markdown

Codex Review: Didn't find any major issues. Keep them coming!

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

@aisle-research-bot

aisle-research-bot Bot commented Apr 8, 2026

Copy link
Copy Markdown

🔒 Aisle Security Analysis

We found 3 potential security issue(s) in this PR:

# Severity Title
1 🟡 Medium PII (WhatsApp JID) and error details logged during send retry backoff
2 🟡 Medium Unbounded concurrent send retry loops can exhaust resources during WhatsApp reconnect
3 🟡 Medium Potential duplicate WhatsApp sends due to retrying on broad regex-matched errors (includes timeouts)
1. 🟡 PII (WhatsApp JID) and error details logged during send retry backoff
Property Value
Severity Medium
CWE CWE-532
Location extensions/whatsapp/src/inbound/monitor.ts:224-227

Description

The sendTrackedMessage reconnect-gap retry loop logs the destination WhatsApp jid together with the formatted error message.

  • WhatsApp JIDs often embed phone numbers (E.164) and are PII.
  • logVerbose() writes to the structured logger at debug when shouldLogVerbose() is true (either interactive verbose mode or file log level debug is enabled), so this can end up in persisted/centralized logs.
  • formatError(err) returns err.message for Error instances, which may still include sensitive operational/session details depending on upstream error messages.

Vulnerable code:

logVerbose(
  `Waiting ${delayMs}ms for WhatsApp reconnect before retrying send to ${jid}: ${formatError(lastErr)}`,
);

Recommendation

Avoid logging raw identifiers and full error strings in retry loops.

  • Redact/hash the JID before logging (e.g., last 4 digits or a stable hash).
  • Log only an error code/class (or a fixed reason) rather than arbitrary err.message.

Example:

const redactedJid = jid.replace(/^\d+(?=@)/, (m) => `***${m.slice(-4)}`);
const reason = lastErr instanceof Error ? lastErr.name : "send_failed";
logVerbose(`Waiting ${delayMs}ms for reconnect before retrying send to ${redactedJid}: ${reason}`);

If you need full error detail, emit it only to secured traces with access controls, not general logs.

2. 🟡 Unbounded concurrent send retry loops can exhaust resources during WhatsApp reconnect
Property Value
Severity Medium
CWE CWE-400
Location extensions/whatsapp/src/inbound/monitor.ts:196-233

Description

sendTrackedMessage now performs a retry-with-backoff loop when the socket is disconnected and shouldRetryDisconnect() is true.

Because this function is called per outbound operation (reply/media/etc.) and there is no global throttling/queueing, a burst of inbound messages (or a single message that triggers many outbound sends) during a disconnect/reconnect window can create many concurrent retry loops. Each loop can:

  • allocate a long-lived promise chain
  • schedule repeated sleeps/timers via sleepWithAbort
  • repeat up to disconnectRetryPolicy.maxAttempts times (default 12)

If disconnectRetryAbortSignal is not always provided/wired in all call sites, these loops may persist until max attempts are reached even during shutdown, compounding the risk.

Vulnerable code:

for (let attempt = 1; ; attempt++) {
  const currentSock = getCurrentSock();
  ...
  const delayMs = computeBackoff(disconnectRetryPolicy, attempt);
  await sleepWithAbort(delayMs, options.disconnectRetryAbortSignal);
}

Recommendation

Add global backpressure for outbound sends during reconnect, so many concurrent sends do not each start their own retry loop.

Options (any one is fine):

  1. Centralized reconnect barrier: expose a single awaitReconnect() promise shared by all senders; while disconnected, all sends await the same promise instead of each doing their own backoff loop.

  2. Outbound send queue + concurrency limit: enqueue outbound sends and process with a small fixed concurrency. If disconnected, pause the queue.

  3. Fail-fast when socket is null (or after 1 attempt) and let higher layers decide whether to retry, plus add rate limiting.

Example sketch (shared barrier):

let reconnectWait: Promise<void> | null = null;
async function waitForReconnect(signal?: AbortSignal) {
  if (!reconnectWait) {
    reconnectWait = (async () => {
      for (let attempt = 1; attempt <= maxAttempts; attempt++) {
        if (getCurrentSock()) return;
        await sleepWithAbort(computeBackoff(policy, attempt), signal);
      }
      throw new Error("reconnect timeout");
    })().finally(() => { reconnectWait = null; });
  }
  return reconnectWait;
}

async function sendTrackedMessage(jid: string, content: AnyMessageContent) {
  if (!getCurrentSock()) await waitForReconnect(options.disconnectRetryAbortSignal);
  return getCurrentSock()!.sendMessage(jid, content);
}

Also ensure all call sites provide disconnectRetryAbortSignal tied to shutdown, and consider a hard cap on total time waited.

3. 🟡 Potential duplicate WhatsApp sends due to retrying on broad regex-matched errors (includes timeouts)
Property Value
Severity Medium
CWE CWE-703
Location extensions/whatsapp/src/inbound/monitor.ts:38-40

Description

sendTrackedMessage retries sendMessage when the error text matches a broad regex that includes timed out. This can lead to duplicate outbound messages/media:

  • isRetryableSendDisconnectError treats any error whose formatted text contains closed|reset|timed out|disconnect|no active socket as retryable.
  • A timeout does not reliably imply the message was not accepted server-side; the first send may succeed but the client times out waiting for an ack/response.
  • Retrying sendMessage typically generates a new WhatsApp message ID, resulting in a second message being delivered.
  • The existing dedupe (rememberRecentOutboundMessage / isRecentOutboundMessage) only filters inbound echoes and does not prevent sending duplicates to recipients.

Vulnerable logic:

function isRetryableSendDisconnectError(err: unknown): boolean {
  return /closed|reset|timed\s*out|disconnect|no active socket/i.test(formatError(err));
}
...
const result = await currentSock.sendMessage(jid, content);
...
} catch (err) {
  if (!shouldRetryDisconnect() || !isRetryableSendDisconnectError(err)) {
    throw err;
  }// retry loop continues
}

Impact:

  • Under transient network issues (or induced latency), the gateway may send the same reply multiple times (spam/amplification), especially for non-idempotent sends (media, interactive messages).

Recommendation

Make retries idempotent and/or substantially narrow the retry conditions.

Suggested changes:

  1. Do not retry on generic timeouts unless you can prove the send did not reach WhatsApp:
function isRetryableSendDisconnectError(err: unknown): boolean {
  const msg = formatError(err);// Only retry on clear transport/socket replacement conditions
  return /no active socket|EPIPE|ECONNRESET|socket closed|disconnect/i.test(msg);
}
  1. If you must retry after uncertain outcomes (timeouts), implement an idempotency mechanism:
  • Generate a client-side idempotency key per logical outbound message (e.g., hash of jid + content + quoted context + timestamp bucket)
  • Persist it briefly and avoid re-sending when a retry is triggered unless you can correlate that WhatsApp did not accept the first attempt.
  1. Prefer using Baileys/transport-specific error codes (where available) instead of regex matching formatError(err) strings, to avoid accidentally classifying unrelated errors as retryable.

Analyzed PR: #62892 at commit baefbfe

Last updated on: 2026-04-09T03:06:21Z

@mcaxtr mcaxtr force-pushed the fix/whatsapp-reconnect-reply-path branch 3 times, most recently from 9ebcfc1 to 13454a9 Compare April 9, 2026 01:32

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 13454a9aca

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +89 to +91
disconnectRetryPolicy.maxAttempts > 0
? disconnectRetryPolicy.maxAttempts
: DEFAULT_RECONNECT_POLICY.maxAttempts;

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Allow at least one reconnect wait at maxAttempts=1

sendRetryMaxAttempts is taken directly from disconnectRetryPolicy.maxAttempts when it is positive, so a config of web.reconnect.maxAttempts: 1 causes sendTrackedMessage to fail on the first reconnect-gap iteration before any backoff sleep occurs. In that case, a reply that lands while socketRef.current is temporarily null throws no active socket - reconnection in progress immediately, even though the single reconnect attempt may still complete moments later; this undermines the new “preserve replies across reconnects” behavior for a valid finite-retry configuration.

Useful? React with 👍 / 👎.

@mcaxtr mcaxtr force-pushed the fix/whatsapp-reconnect-reply-path branch from 13454a9 to baefbfe Compare April 9, 2026 03:04
@mcaxtr mcaxtr merged commit da1da61 into main Apr 9, 2026
27 checks passed
@mcaxtr mcaxtr deleted the fix/whatsapp-reconnect-reply-path branch April 9, 2026 03:05
lovewanwan pushed a commit to lovewanwan/openclaw that referenced this pull request Apr 28, 2026
ogt-redknie pushed a commit to ogt-redknie/OPENX that referenced this pull request May 2, 2026
github-actions Bot pushed a commit to Desicool/openclaw that referenced this pull request May 9, 2026
github-actions Bot pushed a commit to Desicool/openclaw that referenced this pull request May 24, 2026
jameslcowan pushed a commit to jameslcowan/openclaw that referenced this pull request Jun 2, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

channel: whatsapp-web Channel integration: whatsapp-web maintainer Maintainer-authored PR size: M

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant