feat(bluebubbles): replay missed webhook messages after gateway restart (#66721)#66857
feat(bluebubbles): replay missed webhook messages after gateway restart (#66721)#66857omarshahine merged 1 commit intomainfrom
Conversation
🔒 Aisle Security AnalysisWe found 4 potential security issue(s) in this PR:
1. 🟠 BlueBubbles API password embedded in URL query string (leaks via logs/redirects/telemetry)
Description
In this diff, the new catchup path ( Vulnerable code: if (params.password) {
url.searchParams.set("password", params.password);
}RecommendationAvoid putting secrets in URLs. Prefer an Example approach:
export function buildBlueBubblesApiUrl(params: { baseUrl: string; path: string }): string {
const normalized = normalizeBlueBubblesServerUrl(params.baseUrl);
const url = new URL(params.path, `${normalized}/`);
return url.toString();
}
const url = buildBlueBubblesApiUrl({ baseUrl, path: "/api/v1/message/query" });
const res = await blueBubblesFetchWithTimeout(
url,
{
method: "POST",
headers: {
"Content-Type": "application/json",
"Authorization": `Bearer ${password}`, // or server-required scheme
},
body,
redirect: "manual", // optionally prevent leaking credentials on redirects
},
timeoutMs,
ssrfPolicy,
);If the upstream BlueBubbles API only supports query-string authentication, implement explicit redaction wherever URLs are logged/returned in errors, and consider setting 2. 🟠 Unbounded response buffering and JSON parsing in BlueBubbles catchup can cause memory/CPU DoS
Description
This creates an unbounded memory/CPU risk:
Vulnerable flow:
RecommendationEnforce a maximum response size and avoid unbounded buffering/parsing. Options (prefer streaming limits):
Example: enforce max bytes before parsing: const MAX_BYTES = 2 * 1024 * 1024; // 2MB
const len = Number(response.headers.get("content-length") ?? "0");
if (len && len > MAX_BYTES) throw new Error("Response too large");
const reader = response.body?.getReader();
let received = 0;
const chunks: Uint8Array[] = [];
while (reader) {
const { value, done } = await reader.read();
if (done) break;
received += value.byteLength;
if (received > MAX_BYTES) throw new Error("Response too large");
chunks.push(value);
}
const bodyBytes = chunks.length ? Buffer.concat(chunks) : null;
return new Response(bodyBytes, { status: response.status, headers: response.headers });Also ensure catchup handles oversize responses gracefully (log + treat as unresolved) rather than crashing. 3. 🟡 BlueBubbles password may leak to logs via unsanitized error strings (URL query contains password)
Description
This combination can leak credentials to logs because many network/HTTP errors (and wrapped errors) commonly include the full request URL in their message/stack. Relevant points:
Vulnerable log sites added/changed in this diff: error?.(`[${accountId}] BlueBubbles catchup: processMessage failed: ${String(err)}`);
runtime.error?.(`[${account.accountId}] BlueBubbles catchup: unexpected failure: ${String(err)}`);RecommendationAvoid ever placing secrets in URLs, and ensure all logged errors are scrubbed. Preferred fix (remove secret from URL):
Defense-in-depth (redact before logging): function redactBlueBubblesSecrets(text: string): string {
// redact password query param
return text.replace(/([?&]password=)[^&#\s]*/gi, "$1<redacted>");
}
// usage
error?.(
`[${accountId}] BlueBubbles catchup: processMessage failed: ${redactBlueBubblesSecrets(String(err))}`,
);Apply the same redaction to the 4. 🔵 Log injection via unsanitized BlueBubbles accountId in catchup/monitor logging
DescriptionThe BlueBubbles extension logs the configured Evidence that account IDs may contain filesystem-unsafe characters (and thus are not guaranteed to be strictly alphanumeric) exists in the new catchup tests and cursor-path logic:
Problematic logging patterns include:
Vulnerable code examples: error?.(`[${accountId}] BlueBubbles catchup: cannot resolve server account: ${String(err)}`);runtime.error?.(`[${account.accountId}] BlueBubbles catchup: unexpected failure: ${String(err)}`);RecommendationNeutralize untrusted values before writing them to logs (CWE-117), e.g. replace control characters and optionally cap length. Example: function sanitizeForLog(value: string): string {
// Replace CR/LF/TAB and other ASCII control chars.
return value.replace(/[\u0000-\u001F\u007F]/g, "_").slice(0, 128);
}
const accountTag = sanitizeForLog(accountId);
error?.(`[${accountTag}] BlueBubbles catchup: cannot resolve server account: ${String(err)}`);Apply the same sanitization anywhere Analyzed PR: #66857 at commit Last updated on: 2026-04-15T02:22:20Z |
Greptile SummaryAdds a per-account startup catchup pass to the BlueBubbles channel that queries Confidence Score: 5/5Safe to merge; all remaining findings are P2 style/hardening suggestions that do not affect correctness. No P0 or P1 issues found. The cursor logic, security surface (processMessage path unchanged, isFromMe double-filter, dedupe integration), and edge-case handling are well-implemented and well-tested. Two P2 observations: a changelog attribution capitalization nit, and a design note that a persistently-failing message permanently wedges the catchup cursor — a known, documented tradeoff that doesn't introduce incorrect behavior on the normal path. No files require special attention. Prompt To Fix All With AIThis is a comment left during a code review.
Path: CHANGELOG.md
Line: 15
Comment:
**Changelog attribution capitalization**
The entry uses `thanks @omarshahine` (lowercase `t`) while every other attributed entry in this file uses `Thanks @author` (uppercase). Per the repo convention in CLAUDE.md: `prefer \`Thanks @author\``.
```suggestion
- BlueBubbles/webhook-catchup: replay missed webhook messages after gateway restart via a persistent per-account cursor and `/api/v1/message/query?after=<ts>` pass, so messages delivered while the gateway was down no longer disappear. Uses the existing `processMessage` path and is deduped by #66816's inbound GUID cache. (#66721) Thanks @omarshahine
```
How can I resolve this? If you propose a fix, please make it concise.
---
This is a comment left during a code review.
Path: extensions/bluebubbles/src/catchup.ts
Line: 338-345
Comment:
**Poison message permanently wedges cursor**
If `processMessage` throws consistently for a specific message (e.g., a malformed payload that never normalizes correctly), the cursor is pinned at `failureTs - 1` indefinitely. Every subsequent gateway startup re-fetches and retries the same message, emitting an error log each time with no backoff or escape hatch.
The PR's design rationale ("never lose a message") is sound, but an operator with a stuck message would have no automatic relief short of manually advancing or deleting the cursor file. A simple safeguard — such as logging the affected GUID prominently or noting the cursor file path in the error message — would reduce the time-to-diagnose when this situation occurs.
How can I resolve this? If you propose a fix, please make it concise.Reviews (1): Last reviewed commit: "feat(bluebubbles): replay missed webhook..." | Re-trigger Greptile |
e750e70 to
e33dd80
Compare
Review responsePushed Fixed in this push
Deferred to separate issues
Tests
|
7a60224 to
9d1754c
Compare
…rt (#66721) Adds an in-process startup catchup pass to the BlueBubbles channel that queries BB Server for messages delivered since a persisted per-account cursor and re-feeds each through the existing processMessage pipeline. Fixes the missed-message hole documented in #66721: BB's WebhookService is fire-and-forget on POST failure (no retries), and BB's MessagePoller only re-fires webhooks on BB-side reconnection events (Messages.app / APNs), not on webhook-receiver recovery. So inbound messages delivered while the gateway was down, restarting, or wedged were permanently lost. Design - New extensions/bluebubbles/src/catchup.ts with fetchBlueBubblesMessagesSince (POSTs /api/v1/message/query with {after, sort:"ASC", with:[chat, chat.participants, attachment]}), load/saveBlueBubblesCatchupCursor (file-backed {lastSeenMs, updatedAt} per account under <stateDir>/bluebubbles/catchup/<accountId>__<hash>.json using the plugin-sdk's atomic JSON helpers, same state-dir root as inbound-dedupe via the canonical SDK resolver, and resolvePreferredOpenClawTmpDir for test isolation to satisfy the messaging-tmpdir and temp-path-guard lints), and runBlueBubblesCatchup orchestrator. - monitor.ts: fire catchup as a background task after the webhook target registers; errors are logged but never block the channel-ready signal. - config-schema.ts: new optional `catchup` block (enabled, maxAgeMinutes, perRunLimit, firstRunLookbackMinutes); defaults on with 2h lookback / 50 msg cap / 30-min first-run lookback. - accounts.ts: adds `catchup` to nestedObjectKeys so per-account overrides deep-merge on top of channel-level defaults (mirroring the existing `network` precedent). Safety - Goes through the same processMessage path webhooks use, so auth, allowlist, pairing, and downstream agent dispatch apply unchanged. - Dedupes against #66816's persistent inbound GUID cache. - Never dispatches isFromMe records (checked before and after normalization). - Runs once per gateway startup and does NOT skip on rapid restarts - skipping would permanently lose any messages that arrived during the brief downtime between two startups. - Cursor advances to nowMs on full success, held at min(earliestFailureTs - 1, previousCursor) on any processMessage failure so retries pick up exactly the failed records, or at latestFetchedTs on truncation (fetchedCount === perRunLimit) so the next gateway startup picks up the unfetched tail. - Future-dated cursor (NTP rollback, manual clock adjust) treated as unusable and recovered via firstRunLookback; cursor is repaired at end of run. - First-run lookback clamped to the maxAge ceiling. - Hard ceilings: 12h max lookback, 500 messages per run. - Loud WARNING on perRunLimit truncation pointing at the config knob to raise. Why this approach The fix mirrors a workspace-level shell script that's been running on a real OpenClaw install for ~4 weeks (~100 LoC of bash + python doing the same query/filter/POST flow). Porting it into the BB channel itself means every install gets recovery for free, calls processMessage directly (no re-POST hop), and benefits from #66816's persistent dedupe automatically. Validation - 21 scoped tests in extensions/bluebubbles/src/catchup.test.ts. - Full BB suite 410/410. - pnpm check green. - src/security/temp-path-guard.test.ts and lint:tmp:no-random-messaging both pass (use resolvePreferredOpenClawTmpDir + string concatenation instead of os.tmpdir + template literal). - Live E2E on macOS 26.3 / BB Server 1.9.x: 3/3 messages replayed. Closes #66721. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
9d1754c to
3d225e7
Compare
…rt (openclaw#66857) Adds an in-process startup catchup pass to the BlueBubbles channel that queries BB Server for messages delivered since a persisted per-account cursor and re-feeds each through the existing processMessage pipeline. Fixes the missed-message hole documented in openclaw#66721: BB's WebhookService is fire-and-forget on POST failure, and MessagePoller only re-fires webhooks on BB-side reconnection events, not on webhook-receiver recovery. - New extensions/bluebubbles/src/catchup.ts with singleflight per accountId, cursor persistence via the canonical state-paths resolver, bounded query (perRunLimit + maxAgeMinutes), failure-held cursor, truncation-aware page-boundary advancement, future-cursor recovery, isFromMe filter (pre- and post-normalization). - monitor.ts fires catchup as a background task after the webhook target registers. - config-schema.ts adds optional catchup block; accounts.ts adds catchup to nestedObjectKeys for deep-merge per-account overrides. - Dedupes against openclaw#66816's persistent inbound GUID cache. - 22 scoped tests; full BB suite 411/411; pnpm check green; live E2E on macOS 26.3 / BB Server 1.9.x recovered 3/3 missed messages. Closes openclaw#66721. Co-authored-by: Omar Shahine <omar@shahine.com>
…rt (openclaw#66857) Adds an in-process startup catchup pass to the BlueBubbles channel that queries BB Server for messages delivered since a persisted per-account cursor and re-feeds each through the existing processMessage pipeline. Fixes the missed-message hole documented in openclaw#66721: BB's WebhookService is fire-and-forget on POST failure, and MessagePoller only re-fires webhooks on BB-side reconnection events, not on webhook-receiver recovery. - New extensions/bluebubbles/src/catchup.ts with singleflight per accountId, cursor persistence via the canonical state-paths resolver, bounded query (perRunLimit + maxAgeMinutes), failure-held cursor, truncation-aware page-boundary advancement, future-cursor recovery, isFromMe filter (pre- and post-normalization). - monitor.ts fires catchup as a background task after the webhook target registers. - config-schema.ts adds optional catchup block; accounts.ts adds catchup to nestedObjectKeys for deep-merge per-account overrides. - Dedupes against openclaw#66816's persistent inbound GUID cache. - 22 scoped tests; full BB suite 411/411; pnpm check green; live E2E on macOS 26.3 / BB Server 1.9.x recovered 3/3 missed messages. Closes openclaw#66721. Co-authored-by: Omar Shahine <omar@shahine.com>
…rt (openclaw#66857) Adds an in-process startup catchup pass to the BlueBubbles channel that queries BB Server for messages delivered since a persisted per-account cursor and re-feeds each through the existing processMessage pipeline. Fixes the missed-message hole documented in openclaw#66721: BB's WebhookService is fire-and-forget on POST failure, and MessagePoller only re-fires webhooks on BB-side reconnection events, not on webhook-receiver recovery. - New extensions/bluebubbles/src/catchup.ts with singleflight per accountId, cursor persistence via the canonical state-paths resolver, bounded query (perRunLimit + maxAgeMinutes), failure-held cursor, truncation-aware page-boundary advancement, future-cursor recovery, isFromMe filter (pre- and post-normalization). - monitor.ts fires catchup as a background task after the webhook target registers. - config-schema.ts adds optional catchup block; accounts.ts adds catchup to nestedObjectKeys for deep-merge per-account overrides. - Dedupes against openclaw#66816's persistent inbound GUID cache. - 22 scoped tests; full BB suite 411/411; pnpm check green; live E2E on macOS 26.3 / BB Server 1.9.x recovered 3/3 missed messages. Closes openclaw#66721. Co-authored-by: Omar Shahine <omar@shahine.com>
…rt (openclaw#66857) Adds an in-process startup catchup pass to the BlueBubbles channel that queries BB Server for messages delivered since a persisted per-account cursor and re-feeds each through the existing processMessage pipeline. Fixes the missed-message hole documented in openclaw#66721: BB's WebhookService is fire-and-forget on POST failure, and MessagePoller only re-fires webhooks on BB-side reconnection events, not on webhook-receiver recovery. - New extensions/bluebubbles/src/catchup.ts with singleflight per accountId, cursor persistence via the canonical state-paths resolver, bounded query (perRunLimit + maxAgeMinutes), failure-held cursor, truncation-aware page-boundary advancement, future-cursor recovery, isFromMe filter (pre- and post-normalization). - monitor.ts fires catchup as a background task after the webhook target registers. - config-schema.ts adds optional catchup block; accounts.ts adds catchup to nestedObjectKeys for deep-merge per-account overrides. - Dedupes against openclaw#66816's persistent inbound GUID cache. - 22 scoped tests; full BB suite 411/411; pnpm check green; live E2E on macOS 26.3 / BB Server 1.9.x recovered 3/3 missed messages. Closes openclaw#66721. Co-authored-by: Omar Shahine <omar@shahine.com>
Summary
Fixes #66721. Adds an in-process startup catchup pass to the BlueBubbles channel that queries BB Server for messages delivered while the gateway was unreachable and replays them through the existing
processMessagepipeline.The hole this closes: BB Server's
WebhookServiceis fire-and-forget on POST failure (no retries) and BB'sMessagePolleronly re-fires webhooks on BB-side reconnection events (Messages.app / APNs), not on webhook-receiver recovery. Messages delivered while the gateway was down, restarting, or wedged were permanently lost — verified with a controlled experiment on macOS.This PR supersedes #66853 (which was the stacked follow-up to #66760 / dedupe PR #66816). Same diff, collapsed to a single commit for cleaner review. History of review feedback is preserved in the superseded PR trail; all P1 and P2 findings from Greptile / Codex / Aisle were addressed in-branch before this squash.
Design
extensions/bluebubbles/src/catchup.ts:fetchBlueBubblesMessagesSince(sinceMs, limit, opts)calls/api/v1/message/querywith{after, sort:"ASC", with:["chat","chat.participants","attachment"]}so replays carry the same shapenormalizeWebhookMessagealready handles on live dispatch.loadBlueBubblesCatchupCursor/saveBlueBubblesCatchupCursorpersist a single{lastSeenMs, updatedAt}per account under<stateDir>/bluebubbles/catchup/<accountId>__<hash>.json, using the plugin-sdk's atomic JSON helpers. File layout mirrors the inbound-dedupe store from fix(bluebubbles): dedupe inbound webhooks across restarts (#19176, #12053) #66816, and the resolver is the canonicalopenclaw/plugin-sdk/state-paths.resolveStateDir(same helper dedupe uses) so the two stores share a single root.runBlueBubblesCatchup(target)orchestrates: clamp config, fetch, filterisFromMeand pre-cursor records, dispatch toprocessMessage, advance cursor.monitor.ts: fire catchup as a background task after webhook target registers; errors are logged but never block the channel-ready signal.config-schema.ts: new optionalcatchupblock (enabled,maxAgeMinutes,perRunLimit,firstRunLookbackMinutes); defaults on with 2h lookback / 50 msg cap / 30-min first-run lookback.accounts.ts: addscatchupto the account-mergenestedObjectKeyslist so per-account overrides deep-merge on top of channel-level defaults, mirroring the existingnetworkprecedent.Why this approach
The fix mirrors a workspace-level shell script that's been running on a real OpenClaw install for ~4 weeks (~100 LoC of bash + python doing the same query/filter/POST flow). Porting it into the BB channel itself means every install gets recovery for free, calls
processMessagedirectly (no re-POST hop), and benefits from #66816's persistent dedupe automatically.Safety
processMessagepath webhooks use, so auth, allowlist, pairing, and downstream agent dispatch all apply unchanged.isFromMerecords (double-checked before and after normalization) so the agent's own sends cannot enter the inbound path.nowMson fully-successful runs. OnprocessMessagefailure, the cursor is held just before the earliest failure timestamp so the next run retries from there. On truncation (fetchedCount === perRunLimit), the cursor advances only to the last-fetched timestamp so the next gateway startup picks up the unfetched tail.maxAgeMinutes: 5, firstRunLookbackMinutes: 30cannot exceed the operator's stated cap.fetchedCounthitsperRunLimitso operators know a single startup didn't drain the full backlog.Validation
Automated
extensions/bluebubbles/src/catchup.test.ts(21 cases): cursor round-trip, per-account scoping, filesystem-unsafe account IDs, firstRunLookback default and maxAge clamp,enabled: false, rapid-restart-still-runs,isFromMefilter (pre- and post-normalization), query-failure-preserves-cursor, per-message failure isolation, held-cursor-on-retryable-failure, clamp-to-prior-cursor, future-cursor recovery, pre-cursor defense-in-depth, perRunLimit warn / no-warn, and truncation-cursor advances only to page boundary.pnpm checkgreen (madge, tsgo, oxlint, webhook-auth-body-order, no-pairing-store-group, pairing-account-scope).Live end-to-end (macOS, BB Server 1.9.x, 2026-04-14)
Repeating the original repro from #66721's issue body with the new in-process catchup:
connect ECONNREFUSED 127.0.0.1:18789and never retried.~/.openclaw/bluebubbles/catchup/<accountId>__<hash>.json. Subsequent gateway restart with no new inbound activity loggedreplayed=0 fetched=0(no-op).Test plan
pnpm test extensions/bluebubbles/src/catchup.test.ts— 21/21pnpm test extensions/bluebubbles/— 410/410pnpm check— greenHistory (for reviewer context)
This PR carries ~11 hours of iterative bot review that happened on the prior PRs (#66760 → #66853). Squashing here for clean review; the findings addressed were:
catchupoverrides at account level