Skip to content

fix: stop treating idle WhatsApp sessions as stale sockets#47513

Closed
jeffrey4341 wants to merge 2 commits intoopenclaw:mainfrom
jeffrey4341:fix/whatsapp-idle-stale-socket
Closed

fix: stop treating idle WhatsApp sessions as stale sockets#47513
jeffrey4341 wants to merge 2 commits intoopenclaw:mainfrom
jeffrey4341:fix/whatsapp-idle-stale-socket

Conversation

@jeffrey4341
Copy link
Copy Markdown

Summary

  • stop the gateway health monitor from restarting connected WhatsApp sessions just because the inbox stays quiet
  • disable the default WhatsApp "no inbound messages in 30m" watchdog, which makes the same invalid assumption after the first inbound message
  • keep the explicit watchdog path available via tuning for tests / diagnostics and add regression coverage for both layers

Fixes #34155.
Helps with #46372 by preventing the false restart loop that was dropping outbound replies during unnecessary reconnects.

Root cause

lastEventAt for WhatsApp currently tracks inbound message flow, not a trustworthy socket-liveness signal. On low-traffic accounts that means:

  1. the gateway health monitor sees connected=true + lastEventAt older than 30m
  2. it classifies the channel as stale-socket
  3. it restarts the provider every ~35 minutes (30m threshold + 5m monitor interval)

The local gateway logs from the affected deployment showed repeated health-monitor: restarting (reason: stale-socket) entries with no corresponding listener-side close / disconnect errors, which points to an idle false positive rather than a real dead-socket event.

The WhatsApp monitor also had a second false-positive path: after the first inbound message, a quiet inbox for 30 minutes could trigger the internal watchdog even if the socket was otherwise healthy.

Why this approach

A quiet WhatsApp inbox is normal. Until the channel exposes a real liveness proof, treating missing inbound messages as proof of socket death is worse than the failure mode it tries to catch: it creates deterministic restart churn and can drop outbound replies during the restart window.

This change takes the conservative route:

  • gateway layer: skip stale-socket inference for WhatsApp idle periods
  • channel layer: make the no-message watchdog opt-in instead of always-on by default

That preserves real close/disconnect handling and reconnect logic, while removing the deterministic false-positive restart loop.

Test plan

  • corepack pnpm exec vitest run src/gateway/channel-health-policy.test.ts src/gateway/channel-health-monitor.test.ts
  • corepack pnpm exec vitest run --config <temp config> extensions/whatsapp/src/auto-reply.web-auto-reply.connection-and-logging.e2e.test.ts
  • PATH="/tmp/openclaw-pr/node_modules/.bin:$PATH" corepack pnpm check

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Mar 15, 2026

Greptile Summary

This PR fixes a deterministic false-positive reconnect loop affecting low-traffic WhatsApp accounts by removing two places where "inbox quiet" was incorrectly treated as "socket dead." The fix is conservative and well-scoped: it skips the stale-socket heuristic for WhatsApp at the gateway layer and makes the channel-layer no-message watchdog opt-in (defaulting to Number.POSITIVE_INFINITY instead of 30 minutes). Real disconnect/close handling is untouched. Both layers receive regression test coverage.

  • channel-health-policy.ts: Added policy.channelId !== "whatsapp" alongside the existing Telegram exclusion; the updated comment correctly documents that lastEventAt tracks inbound message flow, not socket liveness.
  • monitor.ts: MESSAGE_TIMEOUT_MS default changed from 30 * 60 * 1000 to Number.POSITIVE_INFINITY; the watchdog interval is still created but always short-circuits when the timeout is Infinity (minor overhead). The > 30 condition that gates minutesSinceLastMessage in the heartbeat logData is a leftover from the now-removed warn branch and is slightly inconsistent with the new "idle is normal" semantics — see inline comment.
  • Tests: New unit tests in both policy/monitor test files, and a new e2e test that drives ~60 000 fake-timer watchdog ticks with no reconnect, correctly exercising the new default.

Confidence Score: 4/5

  • Safe to merge; the fix is narrowly scoped and real disconnect/close handling is preserved.
  • The root cause analysis is sound and the two-layer fix is correct. No logic bugs found. The only nit is a leftover > 30 threshold in heartbeat logData that was tied to the removed warn-level log — it doesn't affect runtime behavior, just log observability.
  • No files require special attention; the style suggestion in extensions/whatsapp/src/auto-reply/monitor.ts is optional cleanup.

Comments Outside Diff (1)

  1. extensions/whatsapp/src/auto-reply/monitor.ts, line 285-288 (link)

    Leftover > 30 threshold in logData no longer meaningful

    The minutesSinceLastMessage field is only injected into the structured log entry when it exceeds 30 minutes. That threshold was purpose-built to accompany the warn-level log that this PR removes. Now that long idle periods are expected and normal, the > 30 guard suppresses the field from the heartbeat log for "somewhat quiet" sessions (0–30 m silence) while surfacing it only for "very quiet" ones (> 30 m). Given the semantic shift — idle is now routine — it would be more consistent to include the field whenever a message has been received at all (i.e. when minutesSinceLastMessage is non-null), so operators always get the timing signal.

Prompt To Fix All With AI
This is a comment left during a code review.
Path: extensions/whatsapp/src/auto-reply/monitor.ts
Line: 285-288

Comment:
**Leftover `> 30` threshold in `logData` no longer meaningful**

The `minutesSinceLastMessage` field is only injected into the structured log entry when it exceeds 30 minutes. That threshold was purpose-built to accompany the warn-level log that this PR removes. Now that long idle periods are expected and normal, the `> 30` guard suppresses the field from the heartbeat log for "somewhat quiet" sessions (0–30 m silence) while surfacing it only for "very quiet" ones (> 30 m). Given the semantic shift — idle is now routine — it would be more consistent to include the field whenever a message has been received at all (i.e. when `minutesSinceLastMessage` is non-null), so operators always get the timing signal.

```suggestion
        ...(minutesSinceLastMessage !== null
          ? { minutesSinceLastMessage }
          : {}),
```

How can I resolve this? If you propose a fix, please make it concise.

Last reviewed commit: 5709858

@jeffrey4341
Copy link
Copy Markdown
Author

Updated this PR branch to current origin/main and pushed a fresh head to rerun CI on top of the latest baseline.

Local verification before push:

  • npx vitest run extensions/telegram/src/bot-message-context.topic-agentid.test.ts
  • npx vitest run src/gateway/channel-health-policy.test.ts src/gateway/channel-health-monitor.test.ts

This matters because the prior failing test:channels run was still red on an older base due a Telegram topic-agentId failure outside the WhatsApp stale-socket scope. That test now passes locally on the refreshed branch.

I’ll watch the rerun and only keep patching if the updated branch still shows failures that look specific to this PR.

@mcaxtr
Copy link
Copy Markdown
Member

mcaxtr commented Apr 3, 2026

Thanks for the PR and for working on this. We checked the current main branch, and this fix is already in main via #60007, which landed in commit ff62705.

The WhatsApp watchdog now resets lastInboundAt to null on each reconnect and uses the connection start time as the baseline, so quiet channels no longer enter a tight reconnect loop from stale message timestamps carried across connection runs.

Because that has landed, I'm closing this PR as superseded by #60007.

Thanks again for the work here. If you think this closure is mistaken and your PR still fixes something meaningfully different on current main, feel free to open a new PR with that explanation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

channel: whatsapp-web Channel integration: whatsapp-web gateway Gateway runtime size: S

Projects

None yet

Development

Successfully merging this pull request may close these issues.

WhatsApp provider: stale-socket every ~35 minutes on idle connections (keepalive regression)

2 participants