feat(bluebubbles): replay missed webhook messages after gateway restart (#66721)#66760
feat(bluebubbles): replay missed webhook messages after gateway restart (#66721)#66760omarshahine wants to merge 9 commits intolobster/bb-inbound-dedupefrom
Conversation
…2053) BlueBubbles MessagePoller replays its ~1-week lookback window as new-message webhooks after BB Server restart or reconnect. Without persistent dedup, the gateway re-replies to messages it already handled before the restart. Add a persistent file-backed GUID dedupe (TTL=7d, matching BB's lookback window) at the top of processMessage, using the same createClaimableDedupe SDK primitive as Feishu. The on-disk store at ~/.openclaw/bluebubbles/inbound-dedupe/<account>.json survives gateway restarts. Claim/finalize/release semantics ensure transient delivery failures release the GUID so a later replay can retry, while successful deliveries are committed and block future replays. Fixes #19176, #12053.
🔒 Aisle Security AnalysisWe found 3 potential security issue(s) in this PR:
1. 🟡 BlueBubbles catchup sends API password in URL query string
Description
Putting credentials in the URL is sensitive because:
Vulnerable code: const url = buildBlueBubblesApiUrl({
baseUrl: opts.baseUrl,
path: "/api/v1/message/query",
password: opts.password,
});This newly added catchup path increases the surface area where the password-in-URL pattern is used (additional endpoint and new network call on startup). RecommendationAvoid placing secrets in URLs. Prefer an HTTP header (e.g., Example (using a header instead of a query param): const url = buildBlueBubblesApiUrl({
baseUrl: opts.baseUrl,
path: "/api/v1/message/query",
});
const res = await blueBubblesFetchWithTimeout(
url,
{
method: "POST",
headers: {
"Content-Type": "application/json",
// Prefer a header; exact scheme depends on BB server support
"Authorization": `Bearer ${opts.password}`,
},
body,
},
opts.timeoutMs ?? FETCH_TIMEOUT_MS,
ssrfPolicy,
);If BlueBubbles only supports
2. 🟡 Insecure test-state directory under /tmp allows symlink/hardlink overwrite of persisted JSON state
DescriptionThe BlueBubbles extension switches its persistence root to a predictable per-PID directory under If this code path is reachable in a non-test deployment (e.g., environment misconfiguration, untrusted env injection, or accidentally shipping with
This enables link-following writes outside the intended state directory, potentially overwriting arbitrary files writable by the gateway process. Vulnerable code (state dir selection): if (env.VITEST || env.NODE_ENV === "test") {
const name = "openclaw-vitest-" + process.pid;
return path.join(os.tmpdir(), name);
}Affected persistence sinks include:
RecommendationAvoid predictable, attacker-precreatable directories under Options (prefer A): A. Always require an explicit per-test state dir override (e.g., B. Use C. Add symlink defenses before writing: resolve Example using import fs from "node:fs/promises";
import os from "node:os";
import path from "node:path";
let testStateDir: string | null = null;
async function resolveStateDirFromEnv(env: NodeJS.ProcessEnv = process.env): Promise<string> {
if (env.OPENCLAW_STATE_DIR?.trim()) return resolveStateDir(env);
if (env.VITEST || env.NODE_ENV === "test") {
if (!testStateDir) {
testStateDir = await fs.mkdtemp(path.join(os.tmpdir(), "openclaw-vitest-"));
}
return testStateDir;
}
return resolveStateDir(env);
}Additionally, consider hard-failing startup if 3. 🟡 Local file overwrite via symlinked state subdirectories when persisting BlueBubbles cursor/dedupe JSON
DescriptionThe BlueBubbles extension persists two new pieces of state to disk (catchup cursor and inbound dedupe cache) under the OpenClaw state directory:
These paths are ultimately written using On POSIX and Windows, this directory creation / open / rename sequence does not defend against pre-existing symlinks/junctions in parent directory components (e.g., Vulnerable write sites include: // extensions/bluebubbles/src/catchup.ts
await writeJsonFileAtomically(filePath, cursor);
// extensions/bluebubbles/src/inbound-dedupe.ts (via createClaimableDedupe)
resolveFilePath: resolveNamespaceFilePath,This is primarily a local attack scenario, but is relevant in multi-tenant/shared-host deployments or when environment variables / state directory permissions are not tightly controlled. RecommendationHarden state-file writes against link-following in parent directories and against unsafe state dir overrides. Recommended defense-in-depth (choose what is feasible cross-platform):
Example (conceptual) path-component validation: import fs from "node:fs/promises";
import path from "node:path";
async function assertNoSymlinksUnder(root: string, targetDir: string) {
const rel = path.relative(root, targetDir);
const parts = rel.split(path.sep).filter(Boolean);
let cur = root;
for (const p of parts) {
cur = path.join(cur, p);
const st = await fs.lstat(cur);
if (st.isSymbolicLink()) throw new Error(`refusing to write under symlink: ${cur}`);
}
}Apply this check (or equivalent) before calling Analyzed PR: #66760 at commit Last updated on: 2026-04-14T21:36:39Z |
…rt (#66721) Adds a per-account startup catchup pass that queries BB Server for messages delivered since a persisted cursor and re-feeds each through the existing processMessage pipeline. Fixes the missed-message hole documented in #66721: BB's WebhookService is fire-and-forget on POST failure and MessagePoller does not replay webhook-receiver recovery, so inbound messages delivered while the gateway was down/wedged/restarting were permanently lost. Design - New `extensions/bluebubbles/src/catchup.ts`: - `fetchBlueBubblesMessagesSince(sinceMs, limit, opts)` calls `/api/v1/message/query` with `{after, sort:"ASC", with:[chat, chat.participants, attachment]}` so replays carry the same shape `normalizeWebhookMessage` already handles on live dispatch. - `loadBlueBubblesCatchupCursor` / `saveBlueBubblesCatchupCursor` persist a single `{lastSeenMs, updatedAt}` per account at `<stateDir>/bluebubbles/catchup/<accountId>__<hash>.json`, using the plugin-sdk's atomic JSON helpers. File layout mirrors the inbound-dedupe store from #66230. - `runBlueBubblesCatchup(target)` orchestrates: clamp config, skip recent runs (<30s), clamp window to `maxAgeMinutes`, fetch, filter `isFromMe` and pre-cursor records, dispatch to `processMessage`, advance cursor to `nowMs` on success. - `monitor.ts`: after the webhook target registers, fire catchup as a background task; errors are logged but never block readiness. - `config-schema.ts`: new optional `catchup` block (`enabled`, `maxAgeMinutes`, `perRunLimit`, `firstRunLookbackMinutes`), defaults on with 2h lookback / 50 msg cap / 30-min first-run lookback. Safety - Goes through the same `processMessage` path webhooks use, so auth, allowlist, pairing, and downstream agent dispatch all apply unchanged. - Dedupes against #66230's persistent inbound GUID cache, so a webhook delivery that already succeeded cannot be reprocessed by catchup. - Never dispatches `isFromMe` records (double-checked before and after normalization) so the agent's own sends cannot enter the inbound path. - Cursor is only advanced on a successful query; a failed fetch leaves the prior cursor in place so the next run retries the same window. - 30s minimum interval between runs prevents churn on rolling restarts. - Hard ceilings: 12h max lookback, 500 messages per run. Validation - New scoped tests in `extensions/bluebubbles/src/catchup.test.ts` (14 cases covering cursor round-trip, account scoping, FS-unsafe account IDs, firstRunLookback, maxAge clamp, enabled:false, rapid restart skip, isFromMe filter, query failure preserving cursor, per-message failure isolation, and pre-cursor defense-in-depth). - Full BlueBubbles suite (403/403) and `pnpm check` green. - End-to-end verified on macOS with a live BB Server 1.9.x: stop gateway, send N messages to monitored threads (verified 3/3 ECONNREFUSED in BB's `main.log`), restart gateway; catchup queried and re-POSTed all N missed messages through `processMessage`, and subsequent boots were no-ops (empty delta). Retires the reference implementation at `openclaw-agents/lobster/scripts/bb-catchup.sh` in the Lobster workspace, which has been running this exact shape for ~4 weeks. Closes #66721. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
8cd511d to
4a89afb
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 8cd511dae7
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
Greptile SummaryAdds an in-process startup catchup pass to the BlueBubbles channel that queries BB Server for messages missed while the gateway was unreachable and replays them through the existing Confidence Score: 5/5Safe to merge; all findings are P2 style/observability suggestions that don't affect correctness. No P0/P1 issues found. The two P2 items are: (1) resolveStateDirFromEnv is duplicated between catchup.ts and inbound-dedupe.ts — a drift risk but both copies are currently identical; and (2) no warning is emitted when fetchedCount hits perRunLimit, making silent truncation during long outages hard to diagnose. Core logic, dedupe integration, cursor semantics, and test coverage are solid. extensions/bluebubbles/src/catchup.ts — see comments on resolveStateDirFromEnv duplication and the perRunLimit truncation warning. Prompt To Fix All With AIThis is a comment left during a code review.
Path: extensions/bluebubbles/src/catchup.ts
Line: 50-59
Comment:
**Duplicated `resolveStateDirFromEnv` across catchup.ts and inbound-dedupe.ts**
This function is byte-for-byte identical to the one in `inbound-dedupe.ts`. If either copy drifts (e.g., adding a new env-var override), the two stores will resolve different directories and cursor/dedupe files will be written to different roots, silently breaking the catchup ↔ dedupe pairing. Extract to a shared local utility (e.g. `state-dir.ts` inside the plugin) and import it from both files.
How can I resolve this? If you propose a fix, please make it concise.
---
This is a comment left during a code review.
Path: extensions/bluebubbles/src/catchup.ts
Line: 308-319
Comment:
**No warning logged when fetch results are at the `perRunLimit` cap**
When BB returns exactly `perRunLimit` messages it has likely truncated the result set — the remaining messages (with `dateCreated` between the last fetched message and `nowMs`) are permanently unreachable because the cursor advances to `nowMs`. The existing log line shows `fetched=N` but gives no indication that `N` equals the configured cap. Emitting a distinct warning when `messages.length >= perRunLimit` would help operators know to raise the limit before a long outage silently drops messages.
How can I resolve this? If you propose a fix, please make it concise.Reviews (1): Last reviewed commit: "feat(bluebubbles): replay missed webhook..." | Re-trigger Greptile |
…rn on perRunLimit truncation Addresses review feedback on PR #66760: - Greptile P2: catchup.ts and inbound-dedupe.ts each had their own `resolveStateDirFromEnv`. Catchup's was a half-baked local copy that missed `~` expansion and the legacy/new state-dir fallback that the canonical `openclaw/plugin-sdk/state-paths.resolveStateDir` provides. Switch catchup to the canonical resolver so the catchup cursor and the inbound-dedupe state are guaranteed to land under the same root, preserving the dedupe-after-replay invariant the catchup design depends on. Test isolation preserved by honoring an explicit `OPENCLAW_STATE_DIR` override before the VITEST/test-mode default. - Greptile P2 + Aisle CWE-532-adjacent: when the BB result hits `perRunLimit`, emit a distinct WARNING through the runtime `error` channel pointing at the config knob to raise. Previously the run summary just showed `fetched=N` with no signal that older messages in the window may have been silently truncated (the cursor still advances to nowMs, so they're unreachable on the next sweep). - Tests: 2 new cases covering "warn on cap-hit" and "no warn below cap"; full BB suite remains 405/405 (was 403/403; +2 new). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 0e8ddbaa61
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
…kew gate Addresses Codex review findings on PR #66760: - P1: Avoid committing catchup cursor after replay failures Previously the cursor advanced to nowMs unconditionally, so a transient processMessage failure during startup catchup permanently dropped the failing record — the next sweep queried only `after nowMs` and never retried it. Now we track the earliest timestamp where processMessage threw and hold the cursor just before it (still clamped to >= the prior cursor and <= nowMs), so retries pick up exactly the failed records on the next run. The inbound-dedupe cache from #66230 keeps successfully-replayed messages from being re-processed on retry. Normalize failures (record didn't yield a usable NormalizedWebhookMessage) are still treated as permanent skips, so a malformed payload doesn't wedge catchup forever. - P2: Skip min-interval gate when stored cursor is in the future If the host clock jumped backwards (NTP correction, manual adjust), `nowMs - existing.lastSeenMs` was negative, which still satisfies `< MIN_INTERVAL_MS`, so catchup returned null repeatedly until wall time caught up. Added `nowMs >= existing.lastSeenMs` precondition so future-dated cursors stop disabling catchup. Tests: 3 new cases (cursor-held-on-failure, clamp-to-prior-cursor, clock-skew-bypass-gate); BB suite remains 408/408. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…e catchup overrides Addresses two more Codex P2 findings on PR #66760: - Enforce maxAgeMinutes on first catchup run Previously the no-cursor branch used `nowMs - firstRunLookbackMs` directly without applying `earliestAllowed`, so a config like `maxAgeMinutes: 5, firstRunLookbackMinutes: 30` silently scanned 30 minutes on first startup. Now both branches clamp against the earliestAllowed ceiling. - Preserve channel-level `catchup.enabled` in account overrides BlueBubbles account merging only deep-merges `network`. With catchup outside that list, an override like `accounts.work.catchup: { perRunLimit: 20 }` would replace the whole channel-level catchup block and silently drop a global `catchup.enabled: false`, re-enabling replay for that account. Adding `catchup` to `nestedObjectKeys` deep-merges the override on top of inherited defaults, matching the existing `network` precedent. Tests: 1 new case for the first-run maxAge clamp; BB suite 409/409. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 00de3cc1b7
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
…covers via firstRunLookback Addresses Codex P2 finding on PR #66760 (cursor in future causes silent skip). Previously, after a host clock rollback (NTP correction or manual adjust), `existing.lastSeenMs` could be ahead of `nowMs`. The earlier `Math.max(existing.lastSeenMs, earliestAllowed)` then computed `windowStartMs` in the future, so the BB query ran with `after=future`, returned zero records, and we still saved cursor = nowMs at the end of the run — permanently skipping every real message delivered in the [earliestAllowed, nowMs] window. Now: a cursor strictly in the future is treated as if no cursor exists at all. The window falls through to the firstRunLookback path (clamped to maxAge), so catchup looks back a sensible distance and replays through the existing dedupe-protected pipeline. Cursor save at the end of the run repairs the future timestamp. Test: extends the previous "skips rate-limit gate" case to assert that the recovery window is firstRunLookback-clamped (not nowMs) and that the cursor is repaired to nowMs after the run. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 7990aeb7da
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 80d2393528
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
…ge boundary on truncation Addresses two more Codex P1 findings on PR #66760: - Run catchup even when cursor age is under MIN_INTERVAL Catchup is the *only* mechanism that recovers messages dropped during the gateway-down window, and it runs once per gateway startup. The 30s MIN_INTERVAL_MS gate was protecting against ~nothing (a few extra BB queries on rolling restarts) at the cost of permanent message loss when restarts happened within that window. A real failure: t0 startup, cursor=t0; t0+10s gateway down, webhook ECONNREFUSED at t0+15s; t0+20s restart skips catchup entirely. Gate removed; bounded by perRunLimit/maxAge + dedupe-protected so the cost of always running is capped. - Keep cursor behind unfetched pages when limit is hit Previously a long outage with >perRunLimit messages would fetch the oldest perRunLimit (sort:ASC), process them, and then advance the cursor to nowMs — permanently skipping the unfetched newer tail. Now we track the latest fetched timestamp regardless of fate, and on truncation (fetchedCount === perRunLimit) the cursor advances only to that page boundary so the next gateway startup picks up the rest. Updated the existing perRunLimit warn to reflect the new recoverable-on-next-startup semantics. Tests: replaced obsolete "skips rapid restart" with "runs catchup even on rapid restarts"; added "advances cursor only to last fetched ts when result is truncated". BB suite 410/410. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Review feedback round-upThanks for the thorough reviews. Pushed 5 follow-up commits since the original Greptile (✅ both addressed)
Codex (✅ all 7 addressed)
Tests grew from 14 → 21 cases; full BlueBubbles suite remains 410/410; Aisle (5 findings, summary)
Live end-to-end repro of the original failure mode is in the PR body and reproduced post-fix on macOS 26.3 / BB Server 1.9.x. Ready for another look. |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: ed4e403d7c
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| } else if (isTruncated) { | ||
| // Use latestFetchedTs (clamped to >= prior cursor and <= nowMs) so the | ||
| // next run starts where this page ended. | ||
| nextCursorMs = Math.min(Math.max(latestFetchedTs, cursorBefore ?? windowStartMs), nowMs); |
There was a problem hiding this comment.
Avoid dropping same-timestamp tail at truncation boundary
When a catchup run hits perRunLimit, this branch advances the cursor to latestFetchedTs. If there are additional unfetched messages with that exact same dateCreated value (which is realistic when timestamps are coarse or bursts occur), the next run starts from that same timestamp and the loop’s ts <= windowStartMs guard drops them, so they are never replayed. In outage backlogs this causes permanent message loss even though catchup is enabled; truncated pagination needs a stable tie-breaker (or a cursor rewind below the boundary) to avoid skipping equal-timestamp records.
Useful? React with 👍 / 👎.
4d63af6 to
82bce26
Compare
…66810 The inbound-dedupe PR was reopened as #66810 (#66230 closed without merging at 21:10:28Z, #66810 created 7s later on the same lobster/bb-inbound-dedupe branch). Updating code comments and the catchup CHANGELOG entry to point at the live PR. The branch dependency chain (this PR stacked on lobster/bb-inbound-dedupe) is unchanged. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…66816 The inbound-dedupe PR was reopened again as #66816 (closed-without-merge trail: #66230 → #66810 → #66816). The branch was force-pushed and the new PR uses the parallel `fix/bb-inbound-dedupe` branch. Updating code comments and the catchup CHANGELOG entry to point at the live PR. Stacking on top of the dedupe branch will be addressed in a follow-up rebase. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…rn on perRunLimit truncation Addresses review feedback on PR #66760: - Greptile P2: catchup.ts and inbound-dedupe.ts each had their own `resolveStateDirFromEnv`. Catchup's was a half-baked local copy that missed `~` expansion and the legacy/new state-dir fallback that the canonical `openclaw/plugin-sdk/state-paths.resolveStateDir` provides. Switch catchup to the canonical resolver so the catchup cursor and the inbound-dedupe state are guaranteed to land under the same root, preserving the dedupe-after-replay invariant the catchup design depends on. Test isolation preserved by honoring an explicit `OPENCLAW_STATE_DIR` override before the VITEST/test-mode default. - Greptile P2 + Aisle CWE-532-adjacent: when the BB result hits `perRunLimit`, emit a distinct WARNING through the runtime `error` channel pointing at the config knob to raise. Previously the run summary just showed `fetched=N` with no signal that older messages in the window may have been silently truncated (the cursor still advances to nowMs, so they're unreachable on the next sweep). - Tests: 2 new cases covering "warn on cap-hit" and "no warn below cap"; full BB suite remains 405/405 (was 403/403; +2 new). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…kew gate Addresses Codex review findings on PR #66760: - P1: Avoid committing catchup cursor after replay failures Previously the cursor advanced to nowMs unconditionally, so a transient processMessage failure during startup catchup permanently dropped the failing record — the next sweep queried only `after nowMs` and never retried it. Now we track the earliest timestamp where processMessage threw and hold the cursor just before it (still clamped to >= the prior cursor and <= nowMs), so retries pick up exactly the failed records on the next run. The inbound-dedupe cache from #66230 keeps successfully-replayed messages from being re-processed on retry. Normalize failures (record didn't yield a usable NormalizedWebhookMessage) are still treated as permanent skips, so a malformed payload doesn't wedge catchup forever. - P2: Skip min-interval gate when stored cursor is in the future If the host clock jumped backwards (NTP correction, manual adjust), `nowMs - existing.lastSeenMs` was negative, which still satisfies `< MIN_INTERVAL_MS`, so catchup returned null repeatedly until wall time caught up. Added `nowMs >= existing.lastSeenMs` precondition so future-dated cursors stop disabling catchup. Tests: 3 new cases (cursor-held-on-failure, clamp-to-prior-cursor, clock-skew-bypass-gate); BB suite remains 408/408. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…e catchup overrides Addresses two more Codex P2 findings on PR #66760: - Enforce maxAgeMinutes on first catchup run Previously the no-cursor branch used `nowMs - firstRunLookbackMs` directly without applying `earliestAllowed`, so a config like `maxAgeMinutes: 5, firstRunLookbackMinutes: 30` silently scanned 30 minutes on first startup. Now both branches clamp against the earliestAllowed ceiling. - Preserve channel-level `catchup.enabled` in account overrides BlueBubbles account merging only deep-merges `network`. With catchup outside that list, an override like `accounts.work.catchup: { perRunLimit: 20 }` would replace the whole channel-level catchup block and silently drop a global `catchup.enabled: false`, re-enabling replay for that account. Adding `catchup` to `nestedObjectKeys` deep-merges the override on top of inherited defaults, matching the existing `network` precedent. Tests: 1 new case for the first-run maxAge clamp; BB suite 409/409. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…covers via firstRunLookback Addresses Codex P2 finding on PR #66760 (cursor in future causes silent skip). Previously, after a host clock rollback (NTP correction or manual adjust), `existing.lastSeenMs` could be ahead of `nowMs`. The earlier `Math.max(existing.lastSeenMs, earliestAllowed)` then computed `windowStartMs` in the future, so the BB query ran with `after=future`, returned zero records, and we still saved cursor = nowMs at the end of the run — permanently skipping every real message delivered in the [earliestAllowed, nowMs] window. Now: a cursor strictly in the future is treated as if no cursor exists at all. The window falls through to the firstRunLookback path (clamped to maxAge), so catchup looks back a sensible distance and replays through the existing dedupe-protected pipeline. Cursor save at the end of the run repairs the future timestamp. Test: extends the previous "skips rate-limit gate" case to assert that the recovery window is firstRunLookback-clamped (not nowMs) and that the cursor is repaired to nowMs after the run. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ge boundary on truncation Addresses two more Codex P1 findings on PR #66760: - Run catchup even when cursor age is under MIN_INTERVAL Catchup is the *only* mechanism that recovers messages dropped during the gateway-down window, and it runs once per gateway startup. The 30s MIN_INTERVAL_MS gate was protecting against ~nothing (a few extra BB queries on rolling restarts) at the cost of permanent message loss when restarts happened within that window. A real failure: t0 startup, cursor=t0; t0+10s gateway down, webhook ECONNREFUSED at t0+15s; t0+20s restart skips catchup entirely. Gate removed; bounded by perRunLimit/maxAge + dedupe-protected so the cost of always running is capped. - Keep cursor behind unfetched pages when limit is hit Previously a long outage with >perRunLimit messages would fetch the oldest perRunLimit (sort:ASC), process them, and then advance the cursor to nowMs — permanently skipping the unfetched newer tail. Now we track the latest fetched timestamp regardless of fate, and on truncation (fetchedCount === perRunLimit) the cursor advances only to that page boundary so the next gateway startup picks up the rest. Updated the existing perRunLimit warn to reflect the new recoverable-on-next-startup semantics. Tests: replaced obsolete "skips rapid restart" with "runs catchup even on rapid restarts"; added "advances cursor only to last fetched ts when result is truncated". BB suite 410/410. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Summary
Fixes #66721. Adds an in-process startup catchup pass to the BlueBubbles channel that queries BB Server for messages delivered while the gateway was unreachable and replays them through the existing
processMessagepipeline.The hole this closes: BB Server's
WebhookServiceis fire-and-forget on POST failure (no retries) and BB'sMessagePolleronly re-fires webhooks on BB-side reconnection events (Messages.app / APNs), not on webhook-receiver recovery. So inbound messages delivered while the OpenClaw gateway was down, restarting, or wedged were permanently lost.This PR is stacked on top of #66810 (persistent inbound dedupe). The dedupe makes catchup safe to be aggressive: if a webhook delivery already succeeded for a GUID, the catchup replay of that same GUID is dropped at the dedupe boundary. Recommend landing #66810 first; this PR will rebase cleanly onto main once it does.
Design
extensions/bluebubbles/src/catchup.ts(new, ~290 LoC):fetchBlueBubblesMessagesSince(sinceMs, limit, opts)calls/api/v1/message/querywith{after, sort:\"ASC\", with:[chat, chat.participants, attachment]}so replays carry the same shapenormalizeWebhookMessagealready handles on live dispatch.loadBlueBubblesCatchupCursor/saveBlueBubblesCatchupCursorpersist a single{lastSeenMs, updatedAt}per account at<stateDir>/bluebubbles/catchup/<accountId>__<hash>.json, using the plugin-sdk's atomic JSON helpers. File layout mirrors the inbound-dedupe store from fix(bluebubbles): dedupe inbound webhooks across restarts (#19176, #12053) #66810.runBlueBubblesCatchup(target)orchestrates: clamp config, skip recent runs (<30s), clamp window tomaxAgeMinutes, fetch, filterisFromMeand pre-cursor records, dispatch toprocessMessage, advance cursor tonowMson success.monitor.ts: after the webhook target registers, fire catchup as a background task; errors are logged but never block the channel-ready signal.config-schema.ts: new optionalcatchupblock (enabled,maxAgeMinutes,perRunLimit,firstRunLookbackMinutes); defaults are on with 2h lookback / 50 msg cap / 30-min first-run lookback.Why this approach
The fix mirrors a workspace-level shell script that's been running on a real OpenClaw install for ~4 weeks (~100 LoC of bash + python doing the same query/filter/POST flow). Porting it into the BB channel itself means every install gets recovery for free, calls
processMessagedirectly (no re-POST hop), and benefits from #66810's persistent dedupe automatically.Safety
processMessagepath webhooks use, so auth, allowlist, pairing, and downstream agent dispatch all apply unchanged.isFromMerecords (double-checked before and after normalization) so the agent's own sends cannot enter the inbound path.nowMsrather than the latest observed message timestamp, so messages withdateCreated > nowMs(BB host vs gateway host clock skew) are not silently skipped — they replay on the next sweep, where dedupe handles any duplicate.Validation
Automated
extensions/bluebubbles/src/catchup.test.ts(14 cases): cursor round-trip, per-account scoping, FS-unsafe account IDs,firstRunLookbackMinutes,maxAgeMinutesclamp,enabled: false, rapid-restart skip,isFromMefilter (pre- and post-normalization), query-failure-preserves-cursor, per-message failure isolation, pre-cursor defense-in-depth.pnpm checkgreen (madge, tsgo, oxlint, webhook-auth-body-order, no-pairing-store-group, pairing-account-scope).Live end-to-end (macOS, BB Server 1.9.x, 2026-04-14)
Repeating the original repro from #66721's issue body, this time against the new in-process catchup with the workspace shim disabled:
connect ECONNREFUSED 127.0.0.1:18789and never retried (confirms the hole this PR fixes is real and that BB does not replay on receiver-side reachability).~/.openclaw/bluebubbles/catchup/<accountId>__<hash>.json. A subsequent gateway restart with no new inbound activity loggedreplayed=0 fetched=0(no-op, as expected).Test plan
pnpm test extensions/bluebubbles/src/catchup.test.ts— 14/14pnpm test extensions/bluebubbles/— 403/403pnpm check— greenOrder of operations
#66810 (persistent inbound dedupe) is a prerequisite. Once it lands, this branch will rebase onto main and the PR will retarget automatically.