fix(whatsapp): route captured replies through successor controller after restart#85823
Conversation
|
Codex review: needs real behavior proof before merge. Reviewed June 8, 2026, 10:55 PM ET / 02:55 UTC. Summary PR surface: Source +283, Tests +2. Total +285 across 5 files. Reproducibility: Do we have a high-confidence way to reproduce the issue? Source inspection and the PR's production before-logs make the failure path credible, but I did not reproduce it in a live WhatsApp setup; current main shows the captured reply closure remains tied to the old socketRef. Review metrics: 2 noteworthy metrics.
Merge readiness Overall follows the weaker of proof and patch quality, so missing proof can cap an otherwise strong patch. Rank-up moves:
Proof guidance:
Mantis proof suggestion Risk before merge
Maintainer options:
Next step before merge
Security Review detailsBest possible solution: Land the focused fallback after live or maintainer-owned WhatsApp restart proof shows same-identity reply/media delivery through the successor controller and keeps the mismatch guard covered. Do we have a high-confidence way to reproduce the issue? Do we have a high-confidence way to reproduce the issue? Source inspection and the PR's production before-logs make the failure path credible, but I did not reproduce it in a live WhatsApp setup; current main shows the captured reply closure remains tied to the old socketRef. Is this the best way to solve the issue? Is this the best way to solve the issue? Yes pending proof: the PR keeps the repair inside the WhatsApp plugin's existing controller registry and centralizes the handoff for all current socket consumers, with same-identity guards rather than a broad fallback. AGENTS.md: found and applied where relevant. Codex review notes: model gpt-5.5, reasoning high; reviewed against 9f48254f099a. Label changesLabel changes:
Label justifications:
Evidence reviewedPR surface: Source +283, Tests +2. Total +285 across 5 files. View PR surface stats
What I checked:
Likely related people:
What the crustacean ranks mean
Shiny media proof means a screenshot, video, or linked artifact directly shows the changed behavior. Runtime, network, CSP, and security claims still need visible diagnostics. How this review workflow works
|
|
ClawSweeper PR egg 🎁 Pass real behavior proof to wake the egg and unlock a hatchable treat. Where did the egg go?
|
|
Force-pushed v2 (commit @clawsweeper re-review |
|
🦞🧹 I asked ClawSweeper to review this item again. Re-review progress:
|
|
v3 (commit
New regression test "accepts successor-controller fallback when only the LID-vs-PN form differs for the same account e164" covers the case. All 3 successor tests pass; 64/64 in the full behavior suite. On real-behavior proof: I don't have a paired WhatsApp account on this machine. The Vitest tests exercise the actual production code path ( @clawsweeper re-review |
|
🦞🧹 I asked ClawSweeper to review this item again. Re-review progress:
|
aed79fa to
db97b76
Compare
db97b76 to
260ba6b
Compare
…ter restart When a channel-health-monitor restart cycle shuts down a WhatsApp ConnectionController mid-run (closeCurrentConnection nulls the controller's socketRef, stopDisconnectRetries aborts its retry signal, and shutdown unregisters it from the connection-controller registry), any in-flight inbound's captured reply closure is left bound to the now-stale socketRef. sendTrackedMessage's retry loop relies on the same controller's socketRef being repopulated by a reconnect, but the successor controller constructs its own fresh socketRef object, so the captured reply throws RECONNECT_IN_PROGRESS_ERROR indefinitely. Visible symptom: the agent silently fails to reply to a WhatsApp message after a transient 408 disconnect that triggered a health-monitor restart. Extend WhatsAppConnectionControllerHandle with getCurrentSock() and let attachWebInboxToSocket's getCurrentSock helper fall back to a successor controller's socket via the existing connection-controller registry. The fallback fires only when the local socketRef is null, so the steady-state path is unchanged. All current getCurrentSock callers (sendTrackedMessage, groupMetadata, readMessages) benefit. The fallback only swaps the socket object, not the listener's auth state. For the typical health-monitor restart (same accountId, same authDir, fresh socket), this is correct. For a hypothetical relink scenario where the same accountId is re-paired to a different authDir between restarts, JID resolution would still use the listener's original authDir; happy to add an authDir-match guard on the handle if maintainers want it. Test added at monitor-inbox.streams-inbound-messages.test-support.ts exercises the full lifecycle: controller A handles inbound, A shuts down (socketRef null + retries aborted + unregistered), B registers with its own live socket, captured A.reply routes through B's socket. AI-assisted: Claude Code (Opus 4.7) authored, codex review locally addressed.
…ter restart
Address review feedback: gate the successor-socket fallback by the
session's self-identity so an accountId relinked to a different
WhatsApp number cannot be silently accepted.
* WhatsAppConnectionControllerHandle exposes getSelfIdentity() returning
{ jid, lid } from the currently-authenticated socket.
* attachWebInboxToSocket captures the original socket's self identity
at attach time and only accepts the successor when
identitiesOverlap() returns true. The fallback returns null (caller
fails closed) when the successor is unauthenticated, missing, or
logged in as a different number.
* New regression test covers the mismatch path: the successor's
sendMessage is never called when the registered controller's
self identity does not overlap the original.
Known edge case left for maintainer judgment: identitiesOverlap()
compares jid + lid + e164 directly. If the original socket exposes
only one form (PN JID) and the successor exposes only the other
(LID), the overlap can be false-negative because the auth-dir-backed
PN<->LID reverse mapping is not consulted here. Happy to wire
resolveComparableIdentity(authDir) in if maintainers prefer
belt-and-suspenders.
AI-assisted: Claude Code (Opus 4.7) authored, codex review locally
addressed.
Address review feedback on the session-identity guard.
identitiesOverlap() compares { jid, lid, e164 } directly, so a
same-account restart where the original socket exposes only the PN
JID form and the successor exposes only the LID form would
false-negative without an e164 bridge.
Pre-resolve both sides via resolveComparableIdentity(identity,
authDir) so e164 derives from the auth-state PN<->LID mapping.
identitiesOverlap then matches on the normalized form regardless of
which surface form sock.user happens to expose at each end.
* Controller.getSelfIdentity() pre-resolves via this.authDir.
* attachWebInboxToSocket pre-resolves originalSelfIdentity via
options.authDir.
* New regression test: successor exposes only LID + shared e164,
original exposes only PN with e164 derived from JID; overlap
succeeds and the captured reply routes through the successor.
AI-assisted: Claude Code (Opus 4.7) authored, codex review locally
addressed.
260ba6b to
5df8c79
Compare
|
Merged via squash.
Thanks @itsuzef! |
…ter restart (openclaw#85823) Merged via squash. Prepared head SHA: 5df8c79 Co-authored-by: itsuzef <53057646+itsuzef@users.noreply.github.com> Co-authored-by: mcaxtr <7562095+mcaxtr@users.noreply.github.com> Reviewed-by: @mcaxtr
Summary
When the gateway's
channel-health-monitortears down a WhatsAppConnectionControllermid-run, any in-flight inbound message's capturedreply()closure is left bound to the now-stalesocketRefof the shutdown controller.sendTrackedMessage's retry loop only knows how to wait for that same controller'ssocketRefto be repopulated, but the successor controller constructs its own freshsocketRefobject — so the captured reply throwsRECONNECT_IN_PROGRESS_ERRORindefinitely.Visible symptom: the agent silently fails to reply to a WhatsApp message after a transient 408 disconnect that triggered a health-monitor restart. The user's message shows "read", then nothing.
Forensic reproducer
Observed in production 2026-05-22:
Fix
WhatsAppConnectionControllerHandle(extensions/whatsapp/src/connection-controller-registry.ts) withgetCurrentSock()andgetSelfIdentity().WhatsAppConnectionControllerimplements both from itssocketRef.attachWebInboxToSocket'sgetCurrentSock()helper falls back to the registered successor's socket when the localsocketRef.currentis null, ANDidentitiesOverlap()confirms the successor's{jid, lid}self-identity matches the original socket's self-identity captured at attach time. If the registered controller is for a different WhatsApp identity (in-place relink), it's not authenticated, or the original socket was never authenticated, the fallback returns null and the caller fails closed.The fallback fires only when the local socket is gone, so the steady-state hot path is unchanged. All current
getCurrentSock()callers —sendTrackedMessage,groupMetadata,readMessages— benefit, so post-restart group mentions and read receipts also recover, not just plain text sends.Real behavior proof
Two regression tests demonstrate the fix end-to-end through the production code path (
attachWebInboxToSocket→sendTrackedMessage→getCurrentSock→ registry →sock.sendMessage):The first test reproduces the full lifecycle of the production forensic above (controller A handles inbound → shutdown → B registered → captured reply routes through B's socket). The second test asserts the session-safety guard: when B is registered under the same
accountIdbut logged in as a different number, the captured reply throwsRECONNECT_IN_PROGRESS_ERRORand B'ssendMessageis never called.These are not mocks of the changed function — they exercise the real
attachWebInboxToSocket→sendTrackedMessage→getCurrentSock→getRegisteredWhatsAppConnectionControllerchain.Trade-off worth maintainer eyes
identitiesOverlap()compares{jid, lid, e164}directly without consulting the auth-dir-backed PN↔LID reverse mapping. If the original socket exposed only one form (PN JID) and the successor exposes only the other (LID), the overlap would be false-negative — captured replies would still throw rather than route through the new socket. The common case is thatsock.userhas bothidandlidpopulated, so the overlap succeeds.If maintainers want belt-and-suspenders here, I can wire
resolveComparableIdentity(authDir)into the handle so the e164 derivation is identical on both sides — happy to do that in a follow-up.Test plan
pnpm test extensions/whatsapp/src/monitor-inbox.behavior.test.ts— 63/63 passed (2 new VE-513 tests)pnpm test extensions/whatsapp/src/connection-controller-registry.test.ts— 1/1 passedpnpm exec oxfmt --check— cleanpnpm lint:extensions— cleanpnpm check:changed— full lane green (typecheck, lint, import cycles, deps guards)codex review --base upstream/main— 5 iterations addressed. Iter 1 (group meta + read receipts) fixed by lifting fallback intogetCurrentSock. Iter 2 (authDir match insufficient) fixed by switching to self-identity. Iter 3 (self id raw string compare) fixed by switching toidentitiesOverlap()helper. Iter 4 + Iter 5 (P2 PN/LID edge case) documented as trade-off above.Note on existing CI failures
checks-node-core-fastandinbound.media.test.tsfailures are preexisting onupstream/main(fcb9c46af0), not caused by this PR. Theinbound.mediafailures (WhatsApp runtime not initialized) reproduce on a clean stash of upstream/main with no diff applied. TheReal behavior proofandchecks-node-agentic-control-plane-runtimefailures are the standard fork-PR secret-availability issues (also seen on PR #85777).AI disclosure
Authored with Claude Code (Opus 4.7) assistance. Fully tested. Author confirms understanding of the change.