Skip to content

BlueBubbles: replay missed webhook messages after gateway restart (cursor + fetchBlueBubblesHistory + processMessage) #66721

@omarshahine

Description

@omarshahine

Problem

When the OpenClaw gateway is down, wedging, or restarting, inbound BlueBubbles messages delivered during the outage window are permanently lost. The underlying iMessages are intact (they remain in Messages.app and in BB Server's DB), but the agent never sees them, never replies, and no recovery happens when the gateway comes back up. This is the BlueBubbles analog of #50093 (WhatsApp) and is partially related to #38307 (stale-socket restarts).

Validated by a controlled experiment (2026-04-14)

I stopped the gateway cleanly, sent three distinct test iMessages to a monitored handle, waited, then started the gateway — instrumenting both ~/Library/Logs/bluebubbles-server/main.log and ~/.openclaw/logs/gateway.log.

Timeline:

Time (local) Event
11:05:14 Gateway stopped (openclaw gateway stop, pgrep-clean, healthz refused)
11:08:35 BB dispatches msg 1 → connect ECONNREFUSED 127.0.0.1:18789, no retry logged
11:08:53 BB dispatches msg 2 → ECONNREFUSED, no retry
11:09:15 BB dispatches msg 3 → ECONNREFUSED, no retry
11:09:56 Gateway start issued
11:10:29 Plugin bootstrap complete

BB-server log (edited, redacted):

[2026-04-14 11:08:35] [WebhookService] Failed to dispatch "new-message" event → connect ECONNREFUSED 127.0.0.1:18789
[2026-04-14 11:08:53] [WebhookService] Failed to dispatch "new-message" event → connect ECONNREFUSED 127.0.0.1:18789
[2026-04-14 11:09:15] [WebhookService] Failed to dispatch "new-message" event → connect ECONNREFUSED 127.0.0.1:18789
# ... nothing more about msgs 1/2/3 — BB never re-dispatches

Findings:

  1. BB Server's WebhookService is fire-and-forget on failure. Every Dispatching event with a failed POST is logged once and never retried, regardless of whether the failure is ECONNRESET (gateway wedged) or ECONNREFUSED (gateway stopped).
  2. BB Server's MessagePoller does NOT replay missed webhooks on webhook-receiver reconnection. After the gateway came back up and registered its webhook target, there were zero fresh Dispatching lines for msgs 1/2/3 — the only new dispatches were for the replies the agent eventually sent. The ~1-week MessagePoller lookback that fix(bluebubbles): dedupe inbound webhooks across restarts (#19176, #12053) #66230's design relies on is driven by BB's own reconnection events (to Messages.app / APNs), not by webhook-target HTTP reachability.
  3. Without external recovery, all three messages would have been permanently lost from the agent's perspective.

Proof the proposed fix works: bb-catchup.sh (Lobster workspace)

My Lobster install has been running a workspace script (openclaw-agents/lobster/scripts/bb-catchup.sh) that implements exactly this proposal. It's been in production for ~4 weeks and recovered all three messages in the experiment above. Its design:

  • Cursor: ~/.openclaw/bb-last-seen-ms (epoch ms), updated after every successful replay pass.
  • Query: POST /api/v1/message/query?password=... with body {"limit":50,"sort":"ASC","after":<cursor>,"with":["chat","chat.participants","attachment"]}.
  • Filter: drop isFromMe, drop own-handle senders, drop pre-cursor messages (defense in depth).
  • Replay: wrap each message in {"type":"new-message","data":<message>} and POST to the gateway's BB webhook endpoint — same path BB itself uses, so processMessage() handles it identically.
  • Bounds: 2-hour max lookback, 50-message cap, 0.5s between POSTs.
  • Trigger: invoked from BOOT.md as boot task fix: add @lid format support and allowFrom wildcard handling #1 on gateway startup.

Experimental result from that script in the run above:

bb-catchup: found 3 missed message(s)
  replayed: [<chat>] from=<handle> text=dive test 1
  replayed: [<chat>] from=<handle> text=dive test 2
  replayed: [<chat>] from=<handle> text=dive test 3
bb-catchup: replayed=3 failed=0

The agent then produced inbound-session entries for all three, matching a clean webhook delivery. Proof the pattern is sound.

What I want to land upstream

Port the bb-catchup pattern into the BlueBubbles channel itself so every OpenClaw install gets message recovery for free, and the workspace script can be retired.

The BB extension already has all the primitives:

  • fetchBlueBubblesHistory(chatGuid, limit, opts) in extensions/bluebubbles/src/history.ts already speaks /api/v1/chat/{guid}/messages. For catchup we want the flat /api/v1/message/query?after=<ts> endpoint that bb-catchup.sh uses (cross-chat in a single call, server-side cursor filter) — this needs a small new helper fetchBlueBubblesMessagesSince(sinceMs, limit, opts) next to it.
  • processMessage in monitor-processing.ts is already the canonical inbound handler. The catchup path can call it directly with the normalized payload — no need for the HTTP re-POST hop bb-catchup.sh does (the re-POST only exists because the workspace script can't reach into the gateway process).
  • monitor-reply-cache.ts + the persistent inbound dedupe from fix(bluebubbles): dedupe inbound webhooks across restarts (#19176, #12053) #66230 already protect against double-processing if a BB webhook and a catchup replay of the same GUID both arrive.

Implementation plan

New files

  • extensions/bluebubbles/src/catchup.ts (~150 LoC)
    • fetchBlueBubblesMessagesSince(sinceMs, limit, opts) — POST /api/v1/message/query with after: sinceMs, sort: "ASC", with: [\"chat\",\"chat.participants\",\"attachment\"], bounded by limit; resilient to the same URL-variant fallbacks as fetchBlueBubblesHistory.
    • loadCursor(accountId) / saveCursor(accountId, ms) — file-backed state at ~/.openclaw/bluebubbles/catchup-cursor/<accountId>.json (matches the layout fix(bluebubbles): dedupe inbound webhooks across restarts (#19176, #12053) #66230 introduces for persistent dedupe). Atomic write via tmp+rename.
    • runBlueBubblesCatchup(account, deps) — orchestrator: loads cursor (fall back to now - 30min on first run), clamps lookback to MAX_AGE_MS (default 2h), calls the query helper, filters isFromMe and self-handles, normalizes each row through the same path webhook POSTs use (normalizeWebhookMessage etc.), and invokes processMessage(...) for each. Updates cursor on success.
  • extensions/bluebubbles/src/catchup.test.ts (~200 LoC)
    • Cursor persistence round-trip, first-run default, atomic-write survival across simulated crash mid-write.
    • Filter correctness: isFromMe, pre-cursor timestamp, self-handle address match.
    • Clamp math: MAX_AGE_MS boundary, identical timestamps, monotonic-clock skew.
    • End-to-end: stub the BB API, stub processMessage, assert call count and argument shape.
    • Interaction with fix(bluebubbles): dedupe inbound webhooks across restarts (#19176, #12053) #66230's inbound dedupe: replayed GUID already in dedupe file → processMessage called but early-exits.

Modified files

  • extensions/bluebubbles/src/monitor.ts — in registerBlueBubblesWebhookTarget, after successful route registration, fire-and-forget runBlueBubblesCatchup(account, deps) on a microtask. Log one INFO summary line per account: bluebubbles catchup: account=<id> replayed=N skipped=M window_ms=.... Errors are caught, logged at WARN, and never block the target registration.
  • extensions/bluebubbles/src/monitor-processing.ts — thread a new optional origin: \"webhook\" | \"catchup\" through processMessage so telemetry can distinguish replays. Default \"webhook\" preserves existing callers.
  • extensions/bluebubbles/src/config-schema.ts — add optional catchup block under the BB channel entry:
    catchup?: {
      enabled?: boolean;         // default true
      maxAgeMinutes?: number;    // default 120, hard cap 720
      perRunLimit?: number;      // default 50, hard cap 500
      firstRunLookbackMinutes?: number; // default 30
    }
  • CHANGELOG.md## Unreleased > ### Fixes bullet: "BlueBubbles: replay missed webhook messages after gateway restart via a persistent cursor and /api/v1/message/query?after=<ts> pass (fixes BlueBubbles: replay missed webhook messages after gateway restart (cursor + fetchBlueBubblesHistory + processMessage) #66721)."

Safety / invariants

  • Default on, bounded. enabled: true out of the box because the downside of no-recovery is loud and user-visible; maxAgeMinutes and perRunLimit clamp the blast radius.
  • Never processes isFromMe — agent's own sends cannot be mistaken for inbound.
  • Cursor is persisted only on success. A failed run leaves the cursor at its previous value so the next run retries; the clamp prevents unbounded growth.
  • Idempotent with fix(bluebubbles): dedupe inbound webhooks across restarts (#19176, #12053) #66230. If a webhook delivery and a catchup pass both surface the same GUID, the persistent dedupe drops the second. Catchup can therefore be aggressive without risk of double-reply.
  • No new network surface. Only existing BB REST endpoints (same as fetchBlueBubblesHistory and bb-catchup.sh).
  • No new inbound code path. Catchup goes through processMessage — the exact same handler webhooks already use.

Test plan

  • Unit tests in catchup.test.ts as listed above (pass pnpm test extensions/bluebubbles/src/catchup.test.ts).
  • Full BB suite passes (pnpm test extensions/bluebubbles/).
  • pnpm check green.
  • Live repro on macOS using the same protocol as the 2026-04-14 experiment: stop gateway, send N messages, start gateway, assert: (a) processMessage called N times with origin: \"catchup\", (b) cursor file updated, (c) inbound dedupe file contains N new GUIDs, (d) re-running catchup is a no-op.
  • Regression: with fix(bluebubbles): dedupe inbound webhooks across restarts (#19176, #12053) #66230's dedupe active, send a message while gateway is up (webhook delivers normally), restart gateway, assert catchup sees it in the query window but processMessage early-exits on the dedupe hit — no double reply.

Order of operations with #66230

#66230 (persistent inbound dedupe) is a prerequisite for catchup to be safe to turn on by default. Recommend landing #66230 first, then this issue's fix.

Retirement of workspace script

Once this ships in a released OpenClaw, openclaw-agents/lobster/scripts/bb-catchup.sh and its BOOT.md invocation should be removed. Keeping both would double-process during the one-turn window where catchup runs at gateway startup, which #66230's dedupe handles correctly but introduces unnecessary overhead.

Related

Metadata

Metadata

Assignees

Labels

maintainerMaintainer-authored PR

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions