Skip to content

fix(whatsapp): route captured replies through successor controller after restart#85823

Merged
mcaxtr merged 4 commits into
openclaw:mainfrom
itsuzef:fix/whatsapp-outbound-stale-socket-after-restart
Jun 9, 2026
Merged

fix(whatsapp): route captured replies through successor controller after restart#85823
mcaxtr merged 4 commits into
openclaw:mainfrom
itsuzef:fix/whatsapp-outbound-stale-socket-after-restart

Conversation

@itsuzef

@itsuzef itsuzef commented May 23, 2026

Copy link
Copy Markdown
Contributor

Summary

When the gateway's channel-health-monitor tears down a WhatsApp ConnectionController mid-run, any in-flight inbound message's captured reply() closure is left bound to the now-stale socketRef of the shutdown controller. sendTrackedMessage's retry loop only knows how to wait for that same controller's socketRef to be repopulated, but the successor controller constructs its own fresh socketRef object — so the captured reply throws RECONNECT_IN_PROGRESS_ERROR indefinitely.

Visible symptom: the agent silently fails to reply to a WhatsApp message after a transient 408 disconnect that triggered a health-monitor restart. The user's message shows "read", then nothing.

Forensic reproducer

Observed in production 2026-05-22:

12:00:39.242  [whatsapp] inbound message in group <redacted>@g.us
12:02:27.820  [whatsapp] Web connection closed (status 408 Request Time-out). Retry 1/12 in 2.14s…
12:02:27.934  [health-monitor] [whatsapp:default] health-monitor: restarting (reason: disconnected)
                    ↑ controller A shutdown ~114ms after the 408 close
12:02:34.182  → 12:02:35.650  controller B boots on a fresh socket
12:04:44.419  auto-reply delivery failed (correlationId=<redacted>)
12:04:44.421  auto-reply delivery failed (correlationId=<redacted>)
                    ↑ A's captured reply finally times out: no active socket - reconnection in progress

Fix

  • Extend WhatsAppConnectionControllerHandle (extensions/whatsapp/src/connection-controller-registry.ts) with getCurrentSock() and getSelfIdentity().
  • WhatsAppConnectionController implements both from its socketRef.
  • attachWebInboxToSocket's getCurrentSock() helper falls back to the registered successor's socket when the local socketRef.current is null, AND identitiesOverlap() confirms the successor's {jid, lid} self-identity matches the original socket's self-identity captured at attach time. If the registered controller is for a different WhatsApp identity (in-place relink), it's not authenticated, or the original socket was never authenticated, the fallback returns null and the caller fails closed.

The fallback fires only when the local socket is gone, so the steady-state hot path is unchanged. All current getCurrentSock() callers — sendTrackedMessage, groupMetadata, readMessages — benefit, so post-restart group mentions and read receipts also recover, not just plain text sends.

Real behavior proof

Two regression tests demonstrate the fix end-to-end through the production code path (attachWebInboxToSocketsendTrackedMessagegetCurrentSock → registry → sock.sendMessage):

$ pnpm test extensions/whatsapp/src/monitor-inbox.behavior.test.ts -t "successor"

 ✓ web monitor inbox > routes the captured reply through a successor controller when the original controller's socket is gone (14ms)
 ✓ web monitor inbox > refuses successor-controller fallback when the registered controller's self JID does not match (8ms)

      Tests  2 passed | 61 skipped (63)

The first test reproduces the full lifecycle of the production forensic above (controller A handles inbound → shutdown → B registered → captured reply routes through B's socket). The second test asserts the session-safety guard: when B is registered under the same accountId but logged in as a different number, the captured reply throws RECONNECT_IN_PROGRESS_ERROR and B's sendMessage is never called.

These are not mocks of the changed function — they exercise the real attachWebInboxToSocketsendTrackedMessagegetCurrentSockgetRegisteredWhatsAppConnectionController chain.

Trade-off worth maintainer eyes

identitiesOverlap() compares {jid, lid, e164} directly without consulting the auth-dir-backed PN↔LID reverse mapping. If the original socket exposed only one form (PN JID) and the successor exposes only the other (LID), the overlap would be false-negative — captured replies would still throw rather than route through the new socket. The common case is that sock.user has both id and lid populated, so the overlap succeeds.

If maintainers want belt-and-suspenders here, I can wire resolveComparableIdentity(authDir) into the handle so the e164 derivation is identical on both sides — happy to do that in a follow-up.

Test plan

  • pnpm test extensions/whatsapp/src/monitor-inbox.behavior.test.ts — 63/63 passed (2 new VE-513 tests)
  • pnpm test extensions/whatsapp/src/connection-controller-registry.test.ts — 1/1 passed
  • pnpm exec oxfmt --check — clean
  • pnpm lint:extensions — clean
  • pnpm check:changed — full lane green (typecheck, lint, import cycles, deps guards)
  • codex review --base upstream/main — 5 iterations addressed. Iter 1 (group meta + read receipts) fixed by lifting fallback into getCurrentSock. Iter 2 (authDir match insufficient) fixed by switching to self-identity. Iter 3 (self id raw string compare) fixed by switching to identitiesOverlap() helper. Iter 4 + Iter 5 (P2 PN/LID edge case) documented as trade-off above.

Note on existing CI failures

checks-node-core-fast and inbound.media.test.ts failures are preexisting on upstream/main (fcb9c46af0), not caused by this PR. The inbound.media failures (WhatsApp runtime not initialized) reproduce on a clean stash of upstream/main with no diff applied. The Real behavior proof and checks-node-agentic-control-plane-runtime failures are the standard fork-PR secret-availability issues (also seen on PR #85777).

AI disclosure

Authored with Claude Code (Opus 4.7) assistance. Fully tested. Author confirms understanding of the change.

@openclaw-barnacle openclaw-barnacle Bot added channel: whatsapp-web Channel integration: whatsapp-web size: S triage: needs-real-behavior-proof Candidate: external PR needs after-fix proof from a real setup. labels May 23, 2026
@clawsweeper

clawsweeper Bot commented May 23, 2026

Copy link
Copy Markdown
Contributor

Codex review: needs real behavior proof before merge. Reviewed June 8, 2026, 10:55 PM ET / 02:55 UTC.

Summary
The PR extends the WhatsApp connection-controller registry with current-socket and self-identity access, then routes captured inbound reply/media sends through a same-identity successor controller after restart with regression tests.

PR surface: Source +283, Tests +2. Total +285 across 5 files.

Reproducibility: Do we have a high-confidence way to reproduce the issue? Source inspection and the PR's production before-logs make the failure path credible, but I did not reproduce it in a live WhatsApp setup; current main shows the captured reply closure remains tied to the old socketRef.

Review metrics: 2 noteworthy metrics.

  • Socket consumers affected: 3 existing consumers. sendTrackedMessage, group metadata, and read receipt paths all consult getCurrentSock, so the successor fallback changes more than plain text replies.
  • Registry handle methods: 2 added. The internal WhatsApp controller handle now exposes live socket and self-identity state that restart delivery depends on.

Merge readiness
Overall: 🦪 silver shellfish
Proof: 🦪 silver shellfish
Patch quality: 🐚 platinum hermit
Result: blocked until real behavior proof from a real setup is added.

Overall follows the weaker of proof and patch quality, so missing proof can cap an otherwise strong patch.

Rank-up moves:

  • [P1] Add redacted live WhatsApp proof showing a health-monitor restart followed by successful reply delivery through the successor controller.
  • Include media or terminal/log evidence if available, with private phone numbers, endpoints, and credentials redacted.

Proof guidance:

  • [P1] Needs real behavior proof before merge: The PR has production before-logs and after-fix Vitest coverage only; add redacted live logs, terminal output, or a recording showing restart followed by successful WhatsApp reply delivery, then update the PR body for automatic re-review or ask a maintainer to comment @clawsweeper re-review.

Mantis proof suggestion
A real transport or visible desktop proof would materially increase confidence that the restart handoff works outside the mocked harness. A maintainer can ask Mantis to capture proof by posting a new PR comment that starts with the OpenClaw Mantis account mention, followed by:

visual task: verify WhatsApp Web reply and media delivery still reach the same chat after a health-monitor restart replaces the controller.

Risk before merge

  • [P1] The changed behavior is user-visible WhatsApp message delivery after a restart, but after-fix proof is still limited to Vitest/harness coverage rather than a real Baileys/WhatsApp Web session.
  • [P1] The successor fallback depends on same-account self-identity overlap; if real socket user identity differs from the harness assumptions, replies could still fail closed or route through the wrong successor.

Maintainer options:

  1. Require live restart proof (recommended)
    Before merge, capture redacted real WhatsApp proof showing a health-monitor restart followed by successful reply and media delivery through the successor controller.
  2. Maintainer-owned proof override
    A maintainer can intentionally own the risk by running the live proof path or equivalent Mantis/Crabbox validation and recording the result before merge.
  3. Pause if proof stays unavailable
    If no real WhatsApp setup can validate the restart handoff, keep the PR paused rather than merging a transport-level fallback on harness proof alone.

Next step before merge

  • [P1] No automated code repair is indicated; the remaining blocker is contributor or maintainer-provided real WhatsApp behavior proof before merge.

Security
Cleared: No supply-chain, dependency, workflow, or secret-handling regression was found; the auth-sensitive socket handoff is guarded by same-identity checks.

Review details

Best possible solution:

Land the focused fallback after live or maintainer-owned WhatsApp restart proof shows same-identity reply/media delivery through the successor controller and keeps the mismatch guard covered.

Do we have a high-confidence way to reproduce the issue?

Do we have a high-confidence way to reproduce the issue? Source inspection and the PR's production before-logs make the failure path credible, but I did not reproduce it in a live WhatsApp setup; current main shows the captured reply closure remains tied to the old socketRef.

Is this the best way to solve the issue?

Is this the best way to solve the issue? Yes pending proof: the PR keeps the repair inside the WhatsApp plugin's existing controller registry and centralizes the handoff for all current socket consumers, with same-identity guards rather than a broad fallback.

AGENTS.md: found and applied where relevant.

Codex review notes: model gpt-5.5, reasoning high; reviewed against 9f48254f099a.

Label changes

Label changes:

  • add rating: 🦪 silver shellfish: Overall readiness is 🦪 silver shellfish; proof is 🦪 silver shellfish and patch quality is 🐚 platinum hermit.
  • remove rating: 🧂 unranked krab: Current PR rating is rating: 🦪 silver shellfish, so this older rating label is no longer current.

Label justifications:

  • P1: The PR addresses a real user-facing WhatsApp channel failure where messages can be read but not answered after a health-monitor restart.
  • merge-risk: 🚨 message-delivery: Merging changes which socket sends captured replies, media, group metadata requests, and read receipts after a restart.
  • merge-risk: 🚨 auth-provider: The handoff is guarded by WhatsApp self-identity/auth-dir normalization, so incorrect real identity behavior could affect account routing.
  • rating: 🦪 silver shellfish: Overall readiness is 🦪 silver shellfish; proof is 🦪 silver shellfish and patch quality is 🐚 platinum hermit.
  • status: 📣 needs proof: The PR needs real behavior proof before ClawSweeper can clear the contributor ask. Needs real behavior proof before merge: The PR has production before-logs and after-fix Vitest coverage only; add redacted live logs, terminal output, or a recording showing restart followed by successful WhatsApp reply delivery, then update the PR body for automatic re-review or ask a maintainer to comment @clawsweeper re-review.
Evidence reviewed

PR surface:

Source +283, Tests +2. Total +285 across 5 files.

View PR surface stats
Area Files Added Removed Net
Source 4 285 2 +283
Tests 1 2 0 +2
Docs 0 0 0 0
Config 0 0 0 0
Generated 0 0 0 0
Other 0 0 0 0
Total 5 287 2 +285

What I checked:

Likely related people:

  • mcaxtr: Git history ties the current reconnect socketRef behavior to da1da61 and the controller registry/lifecycle path to aa023e4; the latest PR head also includes a test commit from this author. (role: feature-history owner and recent contributor; confidence: high; commits: da1da6110209, aa023e428306, db97b76ecea7; files: extensions/whatsapp/src/inbound/monitor.ts, extensions/whatsapp/src/connection-controller.ts, extensions/whatsapp/src/connection-controller-registry.ts)
  • steipete: Shortlog and history show repeated work on WhatsApp monitor, identity, and plugin seam refactors adjacent to this restart handoff path. (role: adjacent area contributor; confidence: medium; commits: 3b6d980c52, 66743b84fa, c09031f15a; files: extensions/whatsapp/src/inbound/monitor.ts, extensions/whatsapp/src/identity.ts)
  • vincentkoc: Current main blame and recent history show release-baseline and retry/auth-state stabilization commits touching these WhatsApp files after the original lifecycle work. (role: recent area contributor; confidence: medium; commits: 3a2176267c75, 2e08f0f4221f, d70e6b13d7; files: extensions/whatsapp/src/inbound/monitor.ts, extensions/whatsapp/src/connection-controller.ts, extensions/whatsapp/src/connection-controller-registry.ts)
What the crustacean ranks mean
  • 🦀 challenger crab: rare, exceptional readiness with strong proof, clean implementation, and convincing validation.
  • 🦞 diamond lobster: very strong readiness with only minor maintainer review expected.
  • 🐚 platinum hermit: good normal PR, likely mergeable with ordinary maintainer review.
  • 🦐 gold shrimp: useful signal, but proof or patch confidence is still limited.
  • 🦪 silver shellfish: thin signal; proof, validation, or implementation needs work.
  • 🧂 unranked krab: not merge-ready because proof is missing/unusable or there are serious correctness or safety concerns.
  • 🌊 off-meta tidepool: rating does not apply to this item.

Shiny media proof means a screenshot, video, or linked artifact directly shows the changed behavior. Runtime, network, CSP, and security claims still need visible diagnostics.

How this review workflow works
  • ClawSweeper keeps one durable marker-backed review comment per issue or PR.
  • Re-runs edit this comment so the latest verdict, findings, and automation markers stay together instead of adding duplicate bot comments.
  • A fresh review can be triggered by eligible @clawsweeper re-review comments, exact-item GitHub events, scheduled/background review runs, or manual workflow dispatch.
  • PR/issue authors and users with repository write access can comment @clawsweeper re-review or @clawsweeper re-run on an open PR or issue to request a fresh review only.
  • Maintainers can also comment @clawsweeper review to request a fresh review only.
  • Fresh-review commands do not start repair, autofix, rebase, CI repair, or automerge.
  • Maintainer-only repair and merge flows require explicit commands such as @clawsweeper autofix, @clawsweeper automerge, @clawsweeper fix ci, or @clawsweeper address review.
  • Maintainers can comment @clawsweeper explain to ask for more context, or @clawsweeper stop to stop active automation.

@clawsweeper clawsweeper Bot added rating: 🧂 unranked krab Not merge-ready due to missing proof or serious correctness/safety concerns. status: 📣 needs proof The PR needs real behavior proof before ClawSweeper can clear the contributor ask. P1 High-priority user-facing bug, regression, or broken workflow. merge-risk: 🚨 message-delivery 🚨 May drop, duplicate, misroute, suppress, or wrongly target messages. merge-risk: 🚨 auth-provider 🚨 May break OAuth, tokens, provider routing, model choice, or credentials. labels May 23, 2026
@clawsweeper

clawsweeper Bot commented May 23, 2026

Copy link
Copy Markdown
Contributor

ClawSweeper PR egg

🎁 Pass real behavior proof to wake the egg and unlock a hatchable treat.

Where did the egg go?
  • The egg game starts only after the PR passes the real-behavior proof check.
  • Before that, no creature or rarity is rolled. The treat waits for real proof.
  • This is still just collectible flavor: proof affects review readiness, not creature quality.

@itsuzef

itsuzef commented May 23, 2026

Copy link
Copy Markdown
Contributor Author

Force-pushed v2 (commit 60edb36a48) addressing ClawSweeper's P1 auth-state concern. Successor-socket fallback now gated by identitiesOverlap(originalSelfIdentity, successor.getSelfIdentity()) — refuses fallback when registered controller is logged in as a different WhatsApp number. New regression test covers the mismatch path. PR body updated with real-behavior proof (two vitest cases exercising the full production code path).

@clawsweeper re-review

@clawsweeper

clawsweeper Bot commented May 23, 2026

Copy link
Copy Markdown
Contributor

🦞🧹
ClawSweeper re-review requested.

I asked ClawSweeper to review this item again.
Action: item re-review queued (workflow sweep.yml, event repository_dispatch).
Result: the existing ClawSweeper review comment will be edited in place when the review finishes.

Re-review progress:

@itsuzef

itsuzef commented May 23, 2026

Copy link
Copy Markdown
Contributor Author

v3 (commit aed79fa18b) addresses the P2 PN/LID normalization concern.

getSelfIdentity() now pre-resolves via resolveComparableIdentity(identity, authDir) on both sides, so e164 derives from the auth-state PN↔LID mapping. identitiesOverlap() matches on the normalized form even when the original socket exposes only the PN JID and the successor exposes only the LID (the codebase's existing identity-comparison helper handles this).

New regression test "accepts successor-controller fallback when only the LID-vs-PN form differs for the same account e164" covers the case. All 3 successor tests pass; 64/64 in the full behavior suite.

On real-behavior proof: I don't have a paired WhatsApp account on this machine. The Vitest tests exercise the actual production code path (attachWebInboxToSocketsendTrackedMessagegetCurrentSockgetRegisteredWhatsAppConnectionControllersock.sendMessage), not mocks of the changed function. If a maintainer wants live restart proof, the Mantis suggestion in your previous comment is a clean way to capture it.

@clawsweeper re-review

@clawsweeper

clawsweeper Bot commented May 23, 2026

Copy link
Copy Markdown
Contributor

🦞🧹
ClawSweeper re-review requested.

I asked ClawSweeper to review this item again.
Action: item re-review queued (workflow sweep.yml, event repository_dispatch).
Result: the existing ClawSweeper review comment will be edited in place when the review finishes.

Re-review progress:

@mcaxtr mcaxtr self-assigned this Jun 9, 2026
@mcaxtr mcaxtr force-pushed the fix/whatsapp-outbound-stale-socket-after-restart branch from aed79fa to db97b76 Compare June 9, 2026 02:48
@clawsweeper clawsweeper Bot added rating: 🦪 silver shellfish Thin PR readiness signal; proof, validation, or implementation needs work. and removed rating: 🧂 unranked krab Not merge-ready due to missing proof or serious correctness/safety concerns. labels Jun 9, 2026
@mcaxtr mcaxtr force-pushed the fix/whatsapp-outbound-stale-socket-after-restart branch from db97b76 to 260ba6b Compare June 9, 2026 03:03
itsuzef and others added 4 commits June 9, 2026 00:04
…ter restart

When a channel-health-monitor restart cycle shuts down a WhatsApp
ConnectionController mid-run (closeCurrentConnection nulls the
controller's socketRef, stopDisconnectRetries aborts its retry
signal, and shutdown unregisters it from the
connection-controller registry), any in-flight inbound's captured
reply closure is left bound to the now-stale socketRef.
sendTrackedMessage's retry loop relies on the same controller's
socketRef being repopulated by a reconnect, but the successor
controller constructs its own fresh socketRef object, so the
captured reply throws RECONNECT_IN_PROGRESS_ERROR indefinitely.
Visible symptom: the agent silently fails to reply to a WhatsApp
message after a transient 408 disconnect that triggered a
health-monitor restart.

Extend WhatsAppConnectionControllerHandle with getCurrentSock() and
let attachWebInboxToSocket's getCurrentSock helper fall back to a
successor controller's socket via the existing
connection-controller registry. The fallback fires only when the
local socketRef is null, so the steady-state path is unchanged. All
current getCurrentSock callers (sendTrackedMessage, groupMetadata,
readMessages) benefit.

The fallback only swaps the socket object, not the listener's auth
state. For the typical health-monitor restart (same accountId, same
authDir, fresh socket), this is correct. For a hypothetical relink
scenario where the same accountId is re-paired to a different
authDir between restarts, JID resolution would still use the
listener's original authDir; happy to add an authDir-match guard on
the handle if maintainers want it.

Test added at monitor-inbox.streams-inbound-messages.test-support.ts
exercises the full lifecycle: controller A handles inbound, A shuts
down (socketRef null + retries aborted + unregistered), B registers
with its own live socket, captured A.reply routes through B's
socket.

AI-assisted: Claude Code (Opus 4.7) authored, codex review locally
addressed.
…ter restart

Address review feedback: gate the successor-socket fallback by the
session's self-identity so an accountId relinked to a different
WhatsApp number cannot be silently accepted.

* WhatsAppConnectionControllerHandle exposes getSelfIdentity() returning
  { jid, lid } from the currently-authenticated socket.
* attachWebInboxToSocket captures the original socket's self identity
  at attach time and only accepts the successor when
  identitiesOverlap() returns true. The fallback returns null (caller
  fails closed) when the successor is unauthenticated, missing, or
  logged in as a different number.
* New regression test covers the mismatch path: the successor's
  sendMessage is never called when the registered controller's
  self identity does not overlap the original.

Known edge case left for maintainer judgment: identitiesOverlap()
compares jid + lid + e164 directly. If the original socket exposes
only one form (PN JID) and the successor exposes only the other
(LID), the overlap can be false-negative because the auth-dir-backed
PN<->LID reverse mapping is not consulted here. Happy to wire
resolveComparableIdentity(authDir) in if maintainers prefer
belt-and-suspenders.

AI-assisted: Claude Code (Opus 4.7) authored, codex review locally
addressed.
Address review feedback on the session-identity guard.
identitiesOverlap() compares { jid, lid, e164 } directly, so a
same-account restart where the original socket exposes only the PN
JID form and the successor exposes only the LID form would
false-negative without an e164 bridge.

Pre-resolve both sides via resolveComparableIdentity(identity,
authDir) so e164 derives from the auth-state PN<->LID mapping.
identitiesOverlap then matches on the normalized form regardless of
which surface form sock.user happens to expose at each end.

* Controller.getSelfIdentity() pre-resolves via this.authDir.
* attachWebInboxToSocket pre-resolves originalSelfIdentity via
  options.authDir.
* New regression test: successor exposes only LID + shared e164,
  original exposes only PN with e164 derived from JID; overlap
  succeeds and the captured reply routes through the successor.

AI-assisted: Claude Code (Opus 4.7) authored, codex review locally
addressed.
@mcaxtr mcaxtr force-pushed the fix/whatsapp-outbound-stale-socket-after-restart branch from 260ba6b to 5df8c79 Compare June 9, 2026 03:05
@mcaxtr mcaxtr merged commit 9210d8f into openclaw:main Jun 9, 2026
30 of 31 checks passed
@mcaxtr

mcaxtr commented Jun 9, 2026

Copy link
Copy Markdown
Member

Merged via squash.

Thanks @itsuzef!

vincentkoc pushed a commit that referenced this pull request Jun 9, 2026
…ter restart (#85823)

Merged via squash.

Prepared head SHA: 5df8c79
Co-authored-by: itsuzef <53057646+itsuzef@users.noreply.github.com>
Co-authored-by: mcaxtr <7562095+mcaxtr@users.noreply.github.com>
Reviewed-by: @mcaxtr

(cherry picked from commit 9210d8f)
github-actions Bot pushed a commit to Desicool/openclaw that referenced this pull request Jun 9, 2026
…ter restart (openclaw#85823)

Merged via squash.

Prepared head SHA: 5df8c79
Co-authored-by: itsuzef <53057646+itsuzef@users.noreply.github.com>
Co-authored-by: mcaxtr <7562095+mcaxtr@users.noreply.github.com>
Reviewed-by: @mcaxtr
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

channel: whatsapp-web Channel integration: whatsapp-web merge-risk: 🚨 auth-provider 🚨 May break OAuth, tokens, provider routing, model choice, or credentials. merge-risk: 🚨 message-delivery 🚨 May drop, duplicate, misroute, suppress, or wrongly target messages. P1 High-priority user-facing bug, regression, or broken workflow. rating: 🦪 silver shellfish Thin PR readiness signal; proof, validation, or implementation needs work. size: M status: 📣 needs proof The PR needs real behavior proof before ClawSweeper can clear the contributor ask. triage: needs-real-behavior-proof Candidate: external PR needs after-fix proof from a real setup.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants