Skip to content

fix(whatsapp): report transport activity so stale-socket health detection works#72656

Merged
mcaxtr merged 4 commits intoopenclaw:mainfrom
Sathvik-1007:fix/whatsapp-stale-socket-detection
Apr 29, 2026
Merged

fix(whatsapp): report transport activity so stale-socket health detection works#72656
mcaxtr merged 4 commits intoopenclaw:mainfrom
Sathvik-1007:fix/whatsapp-stale-socket-detection

Conversation

@Sathvik-1007
Copy link
Copy Markdown
Contributor

Summary

  • Problem: WhatsApp never sets lastTransportActivityAt in its status snapshot. The gateway health policy skips stale-socket detection when this field is null (channel-health-policy.ts:117), so a wedged WhatsApp socket is invisible to the health monitor indefinitely.
  • Why it matters: This is the root cause of [Bug]: Gateway wedges silently mid-session after 2026.4.15 — only recovers on WhatsApp 408 + health monitor restart #67986 — the gateway can wedge silently mid-session and only recovers when WhatsApp's own 408 timeout fires (up to 30 min). The health monitor, which checks every 5 minutes, can't help because it thinks WhatsApp is healthy.
  • What changed: WhatsApp monitor-state.ts now calls createTransportActivityStatusPatch(at) on noteConnected and noteInbound, matching the pattern already used by Discord, Matrix, Slack, and Telegram. Added lastTransportActivityAt to the WebChannelStatus type.
  • What did NOT change (scope boundary): No gateway core changes. No health policy changes. No threshold tuning. Extension-only.

Change Type (select all)

  • Bug fix

Scope (select all touched areas)

  • Integrations

Linked Issue/PR

Root Cause (if applicable)

  • Root cause: d8d0380297 ("fix: use transport activity for stale health", Apr 22) added createTransportActivityStatusPatch calls to Discord, Matrix, Slack, and Telegram — but WhatsApp was not updated. WhatsApp's monitor-state.ts only called createConnectedChannelStatusPatch, which sets connected/lastConnectedAt/lastEventAt but not lastTransportActivityAt.
  • Missing detection / guardrail: The health policy intentionally skips stale-socket checks when lastTransportActivityAt is null (to avoid false positives on channels without transport tracking). This is correct behavior — the gap was WhatsApp not opting in.
  • Contributing context (if known): WhatsApp's monitor-state.ts was created in a separate refactor (66743b84fa, Mar 22) before transport activity was introduced, so it was easy to miss during the Apr 22 rollout.

Regression Test Plan (if applicable)

  • Coverage level that should have caught this:
    • Unit test
  • Target test or file: extensions/whatsapp/src/auto-reply/monitor-state.test.ts
  • Scenario the test should lock in: noteConnected and noteInbound both set lastTransportActivityAt to a non-null timestamp, enabling evaluateChannelHealth to run the stale-socket check path.
  • Why this is the smallest reliable guardrail: The test directly asserts the status snapshot fields that the health policy checks. No integration test needed — the health policy itself is already well-tested (16 cases in channel-health-policy.test.ts).

User-visible / Behavior Changes

WhatsApp channels now participate in the gateway's stale-socket health detection. If the WhatsApp WebSocket goes silent, the health monitor will detect it and restart the channel instead of waiting for WhatsApp's own 408 timeout (which can take up to 30 minutes).

Diagram (if applicable)

Before:
WhatsApp socket wedges → lastTransportActivityAt = null → health policy skips check → healthy forever → waits for 408

After:
WhatsApp socket wedges → lastTransportActivityAt = stale timestamp → health policy detects stale → restart

Security Impact (required)

  • New permissions/capabilities? No
  • Secrets/tokens handling changed? No
  • New/changed network calls? No
  • Command/tool execution surface changed? No
  • Data access scope changed? No

Repro + Verification

Environment

  • OS: Linux (CachyOS)
  • Runtime/container: Node 22
  • Integration/channel (if any): WhatsApp

Steps

  1. Run pnpm test extensions/whatsapp/src/auto-reply/monitor-state.test.ts — 4/4 pass
  2. Run pnpm test src/gateway/channel-health-policy.test.ts — 16/16 pass
  3. Run pnpm test:extension whatsapp — 537/537 pass
  4. Run pnpm build — clean
  5. Run pnpm tsgo:prod — clean
  6. Run pnpm check:changed — 113/113 pass

Expected

noteConnected and noteInbound set lastTransportActivityAt so the health monitor can detect stale WhatsApp sockets.

Actual

Confirmed via unit tests.

Evidence

  • Failing test/log before + passing after

New test file monitor-state.test.ts verifies lastTransportActivityAt is set. Before this change, the field was never present in the status snapshot.

Human Verification (required)

  • Verified scenarios: noteConnected sets transport activity, noteInbound updates it, noteWatchdogStale does NOT update it (correct — watchdog means the socket is idle)
  • Edge cases checked: Verified noteWatchdogStale preserves old lastTransportActivityAt instead of refreshing it. Verified initial status has no lastTransportActivityAt key (null-ish, won't trigger stale check before first connection).
  • What you did not verify: Live WhatsApp connection test (no WhatsApp account available). The commenter's separate "Disconnected: gateway stopping" issue when reading config — that's a different code path (server-close.ts:108) unrelated to transport activity.

Review Conversations

  • I replied to or resolved every bot review conversation I addressed in this PR.
  • I left unresolved only the conversations that still need reviewer or maintainer judgment.

Compatibility / Migration

  • Backward compatible? Yes
  • Config/env changes? No
  • Migration needed? No

Risks and Mitigations

  • Risk: WhatsApp channels will now be restarted by the health monitor when the socket goes stale (previously they were silently ignored). This is the intended fix, but could surface as unexpected restarts if the stale threshold is too aggressive for WhatsApp's polling pattern.
    • Mitigation: The stale threshold is 30 minutes, which is well above WhatsApp's normal heartbeat interval. The same threshold works for Discord, Matrix, Slack, and Telegram without issues.

@openclaw-barnacle openclaw-barnacle Bot added channel: whatsapp-web Channel integration: whatsapp-web size: S labels Apr 27, 2026
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Apr 27, 2026

Greptile Summary

This PR fixes a gap where WhatsApp's monitor-state.ts never set lastTransportActivityAt in its status snapshot, causing the gateway health policy to skip stale-socket detection for WhatsApp entirely. The fix adds createTransportActivityStatusPatch(at) calls to noteConnected and noteInbound, matching the pattern already used by Telegram and Matrix, and adds the corresponding field to WebChannelStatus.

Confidence Score: 5/5

Safe to merge — minimal, targeted fix with full test coverage for the affected paths.

The change is small (3 files, ~10 lines of logic), follows an established pattern already proven by other extensions, the new unit tests directly verify all critical behaviors (connected sets activity, inbound updates it, watchdog-stale preserves it), and no existing tests are affected.

No files require special attention.

Reviews (1): Last reviewed commit: "fix(whatsapp): report transport activity..." | Re-trigger Greptile

@clawsweeper
Copy link
Copy Markdown
Contributor

clawsweeper Bot commented Apr 29, 2026

Codex review: keeping this open for maintainer follow-up; there is still a little grit to resolve.

Keep open. Current main still does not publish WhatsApp lastTransportActivityAt into the channel status snapshot consumed by the generic gateway health policy, so this PR's intended behavior remains unimplemented. The existing main branch only tracks WhatsApp transport activity inside the WhatsApp connection controller and logs it from heartbeat snapshots; it does not expose that timestamp through WebChannelStatus or createWebChannelStatusController.

Best possible solution:

Keep this PR open for maintainer review. The best path is to review and land or revise the narrow WhatsApp-owned status propagation: publish real transport activity through WebChannelStatus, keep the generic gateway policy unchanged, add focused monitor-state/controller regression tests, and decide whether the new transport timeout default is the right product threshold for WhatsApp.

What I checked:

Likely related people:

  • steipete: The provided PR context and prior ClawSweeper review identify this maintainer with the WhatsApp monitor-state refactor and generic transport-activity health contract; current shallow blame also routes the central status and health-policy files through Peter Steinberger's boundary commit in this checkout. (role: introduced adjacent health/status behavior; confidence: high; commits: 66743b84fac2, 2e775fb03e3a, d8d0380297f4; files: extensions/whatsapp/src/auto-reply/monitor-state.ts, extensions/whatsapp/src/auto-reply/types.ts, src/gateway/channel-health-policy.ts)
  • vincentkoc: The provided related-context thread identifies recent merged WhatsApp quiet-socket/watchdog work by this maintainer, and the current changelog credits @vincentkoc on WhatsApp/Web watchdog reliability work in the same liveness area. (role: recent WhatsApp liveness maintainer; confidence: high; commits: e672b61417af, 2377a3a266f8; files: extensions/whatsapp/src/connection-controller.ts, extensions/whatsapp/src/auto-reply/monitor.ts, extensions/whatsapp/src/auto-reply.web-auto-reply.connection-and-logging.e2e.test.ts)
  • mcaxtr: The timeline shows this maintainer assigned the PR, and the PR commit list includes maintainer hardening and test-fix commits on the current branch. This is routing context for review rather than current-main authorship. (role: current PR follow-up maintainer; confidence: medium; commits: 1a3d2f9ef174, 7b2f0a8ebe5a; files: extensions/whatsapp/src/connection-controller.ts, extensions/whatsapp/src/auto-reply/monitor.ts, extensions/whatsapp/src/auto-reply/monitor-state.ts)

Remaining risk / open question:

Codex review notes: model gpt-5.5, reasoning high; reviewed against 64533ed7b1e7.

@mcaxtr mcaxtr force-pushed the fix/whatsapp-stale-socket-detection branch 3 times, most recently from a389d5c to abc3c18 Compare April 29, 2026 03:41
@mcaxtr mcaxtr force-pushed the fix/whatsapp-stale-socket-detection branch from abc3c18 to 1b19207 Compare April 29, 2026 03:46
@mcaxtr mcaxtr merged commit 7ddd815 into openclaw:main Apr 29, 2026
11 checks passed
@mcaxtr
Copy link
Copy Markdown
Member

mcaxtr commented Apr 29, 2026

Merged via squash.

Thanks @Sathvik-1007!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

channel: whatsapp-web Channel integration: whatsapp-web docs Improvements or additions to documentation size: M

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: Gateway wedges silently mid-session after 2026.4.15 — only recovers on WhatsApp 408 + health monitor restart

2 participants