Skip to content

[Bug]: WhatsApp gateway flapping — status 499 reconnect cascade with aggressive heartbeat timeout #60378

@GTSH-beyond-Gaming

Description

@GTSH-beyond-Gaming

Description

The WhatsApp gateway enters a recurring flapping loop where status 499 disconnects trigger a heartbeat-timeout → reconnect → 499 cascade that lasts 20-60 minutes before self-healing.

Environment

  • OS: Windows 11, x64
  • OpenClaw: v2026.4.3 (49936f6)
  • Node.js: v24.12.0
  • Connection: WhatsApp Multi-Device (Web)

Observed Pattern

Every ~60 seconds:

  1. web-heartbeat: detects message timeout (minutesSinceLastMessage increments each cycle)
  2. Forces reconnect
  3. WhatsApp responds with status 499 (server-side disconnect)
  4. web-session: treats creds.json as corrupted, restores from .bak
  5. New connectionId assigned, connection briefly established
  6. Back to step 1

Metrics from 2026-04-03 (32-minute flapping window: 15:58–16:30)

  • 62 status 499 disconnects
  • 42 creds.json restored from backup
  • 31 heartbeat timeout forced reconnects
  • 31 unique connectionIds generated

Context

  • Flapping started ~42 minutes after last inbound message (idle timeout trigger?)
  • creds.json is only 1.9 KB — no pre-key bloat
  • registered: False in creds.json after flapping (unclear if cause or effect)
  • Pattern occurs almost daily (observed 2026-03-30, 2x on 2026-04-02, 2026-04-03)

Log Excerpt

2026-04-03T16:10:16 | web-heartbeat: minutesSinceLastMessage: 42, forcing reconnect
2026-04-03T16:10:16 | WhatsApp Web connection closed (status 499). Retry 1/12 in 2.41s
2026-04-03T16:10:20 | restored corrupted WhatsApp creds.json from backup
2026-04-03T16:11:20 | web-heartbeat: minutesSinceLastMessage: 43, forcing reconnect  ← 60s later, loop continues

Root Cause Analysis

  1. Heartbeat timeout too aggressive: 60s silence → immediate forced reconnect. During idle periods (no inbound messages), this triggers unnecessarily.
  2. Reconnect backoff too short: ~2s initial retry is too fast when WhatsApp returns 499 repeatedly.
  3. Creds falsely flagged as corrupted: Every reconnect cycle restores creds.json from backup, even though the file is intact (1.9 KB, no bloat). This may destabilize the session further.
  4. Self-healing: After ~30 min the cascade stops — possibly server-side backoff window expiring.

Suggested Improvements

  1. Increase heartbeat message-timeout threshold to 3–5 minutes before forcing a reconnect (60s is too aggressive for idle connections)
  2. Exponential backoff on 499: Instead of 2s flat retry, use exponential backoff (2s → 4s → 8s → …) specifically for status 499
  3. Don't treat creds as corrupted on 499: Status 499 is a server disconnect, not a credential issue. Skip creds restore for this error code.
  4. Separate idle-timeout from connection-failure: minutesSinceLastMessage growing is normal during idle periods and should not trigger reconnects by itself

Workaround

Currently self-healing after 20-60 minutes. No messages lost. No user action required.

Metadata

Metadata

Assignees

No one assigned

    Labels

    close:duplicateClosed as duplicatededupe:childDuplicate issue/PR child in dedupe cluster

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions