Skip to content

feat(bluebubbles): replay missed webhook messages after gateway restart (#66721)#66760

Closed
omarshahine wants to merge 9 commits intolobster/bb-inbound-dedupefrom
lobster/bb-catchup
Closed

feat(bluebubbles): replay missed webhook messages after gateway restart (#66721)#66760
omarshahine wants to merge 9 commits intolobster/bb-inbound-dedupefrom
lobster/bb-catchup

Conversation

@omarshahine
Copy link
Copy Markdown
Contributor

@omarshahine omarshahine commented Apr 14, 2026

Summary

Fixes #66721. Adds an in-process startup catchup pass to the BlueBubbles channel that queries BB Server for messages delivered while the gateway was unreachable and replays them through the existing processMessage pipeline.

The hole this closes: BB Server's WebhookService is fire-and-forget on POST failure (no retries) and BB's MessagePoller only re-fires webhooks on BB-side reconnection events (Messages.app / APNs), not on webhook-receiver recovery. So inbound messages delivered while the OpenClaw gateway was down, restarting, or wedged were permanently lost.

This PR is stacked on top of #66810 (persistent inbound dedupe). The dedupe makes catchup safe to be aggressive: if a webhook delivery already succeeded for a GUID, the catchup replay of that same GUID is dropped at the dedupe boundary. Recommend landing #66810 first; this PR will rebase cleanly onto main once it does.

Design

  • extensions/bluebubbles/src/catchup.ts (new, ~290 LoC):
    • fetchBlueBubblesMessagesSince(sinceMs, limit, opts) calls /api/v1/message/query with {after, sort:\"ASC\", with:[chat, chat.participants, attachment]} so replays carry the same shape normalizeWebhookMessage already handles on live dispatch.
    • loadBlueBubblesCatchupCursor / saveBlueBubblesCatchupCursor persist a single {lastSeenMs, updatedAt} per account at <stateDir>/bluebubbles/catchup/<accountId>__<hash>.json, using the plugin-sdk's atomic JSON helpers. File layout mirrors the inbound-dedupe store from fix(bluebubbles): dedupe inbound webhooks across restarts (#19176, #12053) #66810.
    • runBlueBubblesCatchup(target) orchestrates: clamp config, skip recent runs (<30s), clamp window to maxAgeMinutes, fetch, filter isFromMe and pre-cursor records, dispatch to processMessage, advance cursor to nowMs on success.
  • monitor.ts: after the webhook target registers, fire catchup as a background task; errors are logged but never block the channel-ready signal.
  • config-schema.ts: new optional catchup block (enabled, maxAgeMinutes, perRunLimit, firstRunLookbackMinutes); defaults are on with 2h lookback / 50 msg cap / 30-min first-run lookback.

Why this approach

The fix mirrors a workspace-level shell script that's been running on a real OpenClaw install for ~4 weeks (~100 LoC of bash + python doing the same query/filter/POST flow). Porting it into the BB channel itself means every install gets recovery for free, calls processMessage directly (no re-POST hop), and benefits from #66810's persistent dedupe automatically.

Safety

  • Goes through the same processMessage path webhooks use, so auth, allowlist, pairing, and downstream agent dispatch all apply unchanged.
  • Dedupes against fix(bluebubbles): dedupe inbound webhooks across restarts (#19176, #12053) #66810's persistent inbound GUID cache: a webhook delivery that already succeeded cannot be reprocessed by catchup.
  • Never dispatches isFromMe records (double-checked before and after normalization) so the agent's own sends cannot enter the inbound path.
  • Cursor is only advanced on a successful query; a failed fetch leaves the prior cursor in place so the next run retries the same window.
  • 30s minimum interval between runs prevents churn on rolling restarts.
  • Hard ceilings: 12h max lookback, 500 messages per run.
  • Cursor uses nowMs rather than the latest observed message timestamp, so messages with dateCreated > nowMs (BB host vs gateway host clock skew) are not silently skipped — they replay on the next sweep, where dedupe handles any duplicate.

Validation

Automated

  • New scoped tests in extensions/bluebubbles/src/catchup.test.ts (14 cases): cursor round-trip, per-account scoping, FS-unsafe account IDs, firstRunLookbackMinutes, maxAgeMinutes clamp, enabled: false, rapid-restart skip, isFromMe filter (pre- and post-normalization), query-failure-preserves-cursor, per-message failure isolation, pre-cursor defense-in-depth.
  • Full BlueBubbles suite passes: 403/403.
  • pnpm check green (madge, tsgo, oxlint, webhook-auth-body-order, no-pairing-store-group, pairing-account-scope).

Live end-to-end (macOS, BB Server 1.9.x, 2026-04-14)

Repeating the original repro from #66721's issue body, this time against the new in-process catchup with the workspace shim disabled:

  1. Stopped gateway cleanly. Verified port refused, no process.
  2. Sent 3 distinct iMessages from a second device. BB-server log shows all 3 dispatches failed with connect ECONNREFUSED 127.0.0.1:18789 and never retried (confirms the hole this PR fixes is real and that BB does not replay on receiver-side reachability).
  3. Started gateway. Webhook target registered; catchup fired in the background.
  4. Gateway log:
    [bluebubbles] [default] BlueBubbles catchup:
      replayed=3 skipped_fromMe=0 skipped_preCursor=0 failed=0 fetched=3 window_ms=517184
    
  5. All 3 messages produced agent replies that were delivered back via BB outbound. Persistent cursor file appeared at ~/.openclaw/bluebubbles/catchup/<accountId>__<hash>.json. A subsequent gateway restart with no new inbound activity logged replayed=0 fetched=0 (no-op, as expected).

Test plan

  • pnpm test extensions/bluebubbles/src/catchup.test.ts — 14/14
  • pnpm test extensions/bluebubbles/ — 403/403
  • pnpm check — green
  • Live macOS end-to-end repro: stop gateway, send N messages (verified N×ECONNREFUSED in BB log), start gateway, assert N messages replayed via the catchup path with cursor advanced and dedupe state populated, and no-op on subsequent restart
  • Maintainer review

Order of operations

#66810 (persistent inbound dedupe) is a prerequisite. Once it lands, this branch will rebase onto main and the PR will retarget automatically.

…2053)

BlueBubbles MessagePoller replays its ~1-week lookback window as new-message
webhooks after BB Server restart or reconnect. Without persistent dedup, the
gateway re-replies to messages it already handled before the restart.

Add a persistent file-backed GUID dedupe (TTL=7d, matching BB's lookback
window) at the top of processMessage, using the same createClaimableDedupe
SDK primitive as Feishu. The on-disk store at
~/.openclaw/bluebubbles/inbound-dedupe/<account>.json survives gateway
restarts. Claim/finalize/release semantics ensure transient delivery failures
release the GUID so a later replay can retry, while successful deliveries are
committed and block future replays.

Fixes #19176, #12053.
@omarshahine omarshahine requested a review from a team as a code owner April 14, 2026 19:29
@aisle-research-bot
Copy link
Copy Markdown

aisle-research-bot Bot commented Apr 14, 2026

🔒 Aisle Security Analysis

We found 3 potential security issue(s) in this PR:

# Severity Title
1 🟡 Medium BlueBubbles catchup sends API password in URL query string
2 🟡 Medium Insecure test-state directory under /tmp allows symlink/hardlink overwrite of persisted JSON state
3 🟡 Medium Local file overwrite via symlinked state subdirectories when persisting BlueBubbles cursor/dedupe JSON
1. 🟡 BlueBubbles catchup sends API password in URL query string
Property Value
Severity Medium
CWE CWE-598
Location extensions/bluebubbles/src/catchup.ts:120-125

Description

fetchBlueBubblesMessagesSince() constructs the BlueBubbles API URL using buildBlueBubblesApiUrl(..., password) which appends the password as a ?password=... query parameter.

Putting credentials in the URL is sensitive because:

  • URLs are commonly logged by reverse proxies, HTTP access logs, observability/telemetry tooling, and error reports
  • URLs may be stored in browser/network caches or intermediate components
  • If any redirect occurs, the full URL can be propagated (and may end up in Referer headers in other contexts)

Vulnerable code:

const url = buildBlueBubblesApiUrl({
  baseUrl: opts.baseUrl,
  path: "/api/v1/message/query",
  password: opts.password,
});

This newly added catchup path increases the surface area where the password-in-URL pattern is used (additional endpoint and new network call on startup).

Recommendation

Avoid placing secrets in URLs. Prefer an HTTP header (e.g., Authorization) or a dedicated header expected by BlueBubbles.

Example (using a header instead of a query param):

const url = buildBlueBubblesApiUrl({
  baseUrl: opts.baseUrl,
  path: "/api/v1/message/query",
});

const res = await blueBubblesFetchWithTimeout(
  url,
  {
    method: "POST",
    headers: {
      "Content-Type": "application/json",// Prefer a header; exact scheme depends on BB server support
      "Authorization": `Bearer ${opts.password}`,
    },
    body,
  },
  opts.timeoutMs ?? FETCH_TIMEOUT_MS,
  ssrfPolicy,
);

If BlueBubbles only supports password as a query parameter, mitigate by:

  • ensuring request/URL logging is disabled or redacts query strings
  • redacting password from any error reporting paths
  • using TLS and restricting network access to the BB server
2. 🟡 Insecure test-state directory under /tmp allows symlink/hardlink overwrite of persisted JSON state
Property Value
Severity Medium
CWE CWE-59
Location extensions/bluebubbles/src/inbound-dedupe.ts:23-29

Description

The BlueBubbles extension switches its persistence root to a predictable per-PID directory under os.tmpdir() whenever VITEST is set or NODE_ENV === "test".

If this code path is reachable in a non-test deployment (e.g., environment misconfiguration, untrusted env injection, or accidentally shipping with NODE_ENV=test), a local attacker on the same host can pre-create /tmp/openclaw-vitest-<pid> (or subpaths) as a symlink to another location. Subsequent writes via writeJsonFileAtomically() / createClaimableDedupe() ultimately call writeJsonAtomic()/writeTextAtomic(), which:

  • creates directories with fs.mkdir(..., { recursive: true }) (follows symlinks)
  • writes a temp file and then rename()s/copies it to the final path
  • does not perform lstat/realpath validation on the full path components
  • does not use O_NOFOLLOW (not available portably in Node, but symlink-safe patterns exist)

This enables link-following writes outside the intended state directory, potentially overwriting arbitrary files writable by the gateway process.

Vulnerable code (state dir selection):

if (env.VITEST || env.NODE_ENV === "test") {
  const name = "openclaw-vitest-" + process.pid;
  return path.join(os.tmpdir(), name);
}

Affected persistence sinks include:

  • extensions/bluebubbles/src/inbound-dedupe.ts → persistent dedupe JSON file
  • extensions/bluebubbles/src/catchup.ts → catchup cursor JSON file

Recommendation

Avoid predictable, attacker-precreatable directories under /tmp for any code path that can run outside a fully controlled test harness.

Options (prefer A):

A. Always require an explicit per-test state dir override (e.g., OPENCLAW_STATE_DIR set by the test harness) and remove the implicit /tmp fallback.

B. Use fs.mkdtemp to create an unguessable directory, and store it in a process-scoped global so all module instances share it during the process lifetime.

C. Add symlink defenses before writing: resolve realpath of the parent directory and verify it is within the intended root; reject if any path component is a symlink (via lstat).

Example using mkdtemp (process-scoped):

import fs from "node:fs/promises";
import os from "node:os";
import path from "node:path";

let testStateDir: string | null = null;

async function resolveStateDirFromEnv(env: NodeJS.ProcessEnv = process.env): Promise<string> {
  if (env.OPENCLAW_STATE_DIR?.trim()) return resolveStateDir(env);
  if (env.VITEST || env.NODE_ENV === "test") {
    if (!testStateDir) {
      testStateDir = await fs.mkdtemp(path.join(os.tmpdir(), "openclaw-vitest-"));
    }
    return testStateDir;
  }
  return resolveStateDir(env);
}

Additionally, consider hard-failing startup if NODE_ENV==="test" is detected outside an explicit test runner context.

3. 🟡 Local file overwrite via symlinked state subdirectories when persisting BlueBubbles cursor/dedupe JSON
Property Value
Severity Medium
CWE CWE-59
Location extensions/bluebubbles/src/catchup.ts:94-101

Description

The BlueBubbles extension persists two new pieces of state to disk (catchup cursor and inbound dedupe cache) under the OpenClaw state directory:

  • catchup.ts writes ${stateDir}/bluebubbles/catchup/<prefix>__<hash>.json
  • inbound-dedupe.ts writes ${stateDir}/bluebubbles/inbound-dedupe/<prefix>__<hash>.json

These paths are ultimately written using writeJsonFileAtomically()writeJsonAtomic()writeTextAtomic(), which creates directories with fs.mkdir(path.dirname(filePath), { recursive: true }) and then renames/copies a temp file into place.

On POSIX and Windows, this directory creation / open / rename sequence does not defend against pre-existing symlinks/junctions in parent directory components (e.g., ${stateDir}/bluebubbles being a symlink to /etc). If an attacker can write within the state directory tree (or can cause OPENCLAW_STATE_DIR to point at a shared/writable location), they can pre-create a symlinked path so the gateway will write these JSON files outside the intended state directory, potentially overwriting arbitrary files writable by the gateway process.

Vulnerable write sites include:

// extensions/bluebubbles/src/catchup.ts
await writeJsonFileAtomically(filePath, cursor);// extensions/bluebubbles/src/inbound-dedupe.ts (via createClaimableDedupe)
resolveFilePath: resolveNamespaceFilePath,

This is primarily a local attack scenario, but is relevant in multi-tenant/shared-host deployments or when environment variables / state directory permissions are not tightly controlled.

Recommendation

Harden state-file writes against link-following in parent directories and against unsafe state dir overrides.

Recommended defense-in-depth (choose what is feasible cross-platform):

  1. Ensure the state root is not attacker-writable and is created with 0700 (already attempted via ensureDirMode) and document this requirement.

  2. Reject symlinks/junctions in the state path components before writing. For example, walk each path segment under stateDir and lstat() it; if any component is a symbolic link (or on Windows, a reparse point), refuse to write.

  3. Prefer opening the destination securely (POSIX): use open() with O_NOFOLLOW (where available) and write via file-descriptor-based APIs, or use a library that supports O_NOFOLLOW and secure atomic replace.

  4. Constrain OPENCLAW_STATE_DIR: if running in a context where untrusted parties can influence env vars, ignore OPENCLAW_STATE_DIR or require it to be an absolute path under an allowlisted base.

Example (conceptual) path-component validation:

import fs from "node:fs/promises";
import path from "node:path";

async function assertNoSymlinksUnder(root: string, targetDir: string) {
  const rel = path.relative(root, targetDir);
  const parts = rel.split(path.sep).filter(Boolean);
  let cur = root;
  for (const p of parts) {
    cur = path.join(cur, p);
    const st = await fs.lstat(cur);
    if (st.isSymbolicLink()) throw new Error(`refusing to write under symlink: ${cur}`);
  }
}

Apply this check (or equivalent) before calling writeJsonFileAtomically / persistent dedupe writes.


Analyzed PR: #66760 at commit f53bcc6

Last updated on: 2026-04-14T21:36:39Z

@openclaw-barnacle openclaw-barnacle Bot added channel: bluebubbles Channel integration: bluebubbles channel: slack Channel integration: slack gateway Gateway runtime scripts Repository scripts agents Agent runtime and tooling extensions: qa-lab size: XL maintainer Maintainer-authored PR labels Apr 14, 2026
…rt (#66721)

Adds a per-account startup catchup pass that queries BB Server for messages
delivered since a persisted cursor and re-feeds each through the existing
processMessage pipeline. Fixes the missed-message hole documented in #66721:
BB's WebhookService is fire-and-forget on POST failure and MessagePoller
does not replay webhook-receiver recovery, so inbound messages delivered
while the gateway was down/wedged/restarting were permanently lost.

Design

- New `extensions/bluebubbles/src/catchup.ts`:
  - `fetchBlueBubblesMessagesSince(sinceMs, limit, opts)` calls
    `/api/v1/message/query` with `{after, sort:"ASC", with:[chat,
    chat.participants, attachment]}` so replays carry the same shape
    `normalizeWebhookMessage` already handles on live dispatch.
  - `loadBlueBubblesCatchupCursor` / `saveBlueBubblesCatchupCursor`
    persist a single `{lastSeenMs, updatedAt}` per account at
    `<stateDir>/bluebubbles/catchup/<accountId>__<hash>.json`, using
    the plugin-sdk's atomic JSON helpers. File layout mirrors the
    inbound-dedupe store from #66230.
  - `runBlueBubblesCatchup(target)` orchestrates: clamp config, skip
    recent runs (<30s), clamp window to `maxAgeMinutes`, fetch, filter
    `isFromMe` and pre-cursor records, dispatch to `processMessage`,
    advance cursor to `nowMs` on success.
- `monitor.ts`: after the webhook target registers, fire catchup as a
  background task; errors are logged but never block readiness.
- `config-schema.ts`: new optional `catchup` block (`enabled`,
  `maxAgeMinutes`, `perRunLimit`, `firstRunLookbackMinutes`), defaults
  on with 2h lookback / 50 msg cap / 30-min first-run lookback.

Safety

- Goes through the same `processMessage` path webhooks use, so auth,
  allowlist, pairing, and downstream agent dispatch all apply unchanged.
- Dedupes against #66230's persistent inbound GUID cache, so a webhook
  delivery that already succeeded cannot be reprocessed by catchup.
- Never dispatches `isFromMe` records (double-checked before and after
  normalization) so the agent's own sends cannot enter the inbound path.
- Cursor is only advanced on a successful query; a failed fetch leaves
  the prior cursor in place so the next run retries the same window.
- 30s minimum interval between runs prevents churn on rolling restarts.
- Hard ceilings: 12h max lookback, 500 messages per run.

Validation

- New scoped tests in `extensions/bluebubbles/src/catchup.test.ts`
  (14 cases covering cursor round-trip, account scoping, FS-unsafe
  account IDs, firstRunLookback, maxAge clamp, enabled:false, rapid
  restart skip, isFromMe filter, query failure preserving cursor,
  per-message failure isolation, and pre-cursor defense-in-depth).
- Full BlueBubbles suite (403/403) and `pnpm check` green.
- End-to-end verified on macOS with a live BB Server 1.9.x: stop
  gateway, send N messages to monitored threads (verified 3/3
  ECONNREFUSED in BB's `main.log`), restart gateway; catchup queried
  and re-POSTed all N missed messages through `processMessage`, and
  subsequent boots were no-ops (empty delta).

Retires the reference implementation at
`openclaw-agents/lobster/scripts/bb-catchup.sh` in the Lobster
workspace, which has been running this exact shape for ~4 weeks.

Closes #66721.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@openclaw-barnacle openclaw-barnacle Bot added size: L and removed channel: slack Channel integration: slack gateway Gateway runtime scripts Repository scripts agents Agent runtime and tooling extensions: qa-lab size: XL labels Apr 14, 2026
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 8cd511dae7

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread extensions/bluebubbles/src/catchup.ts Outdated
Comment thread extensions/bluebubbles/src/catchup.ts Outdated
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Apr 14, 2026

Greptile Summary

Adds an in-process startup catchup pass to the BlueBubbles channel that queries BB Server for messages missed while the gateway was unreachable and replays them through the existing processMessage pipeline. The implementation is well-designed: it integrates cleanly with #66230's persistent inbound-dedupe (which drops already-handled GUIDs), double-filters isFromMe records, uses atomic cursor persistence, and is fire-and-forget from monitor.ts so it never blocks the channel-ready signal.

Confidence Score: 5/5

Safe to merge; all findings are P2 style/observability suggestions that don't affect correctness.

No P0/P1 issues found. The two P2 items are: (1) resolveStateDirFromEnv is duplicated between catchup.ts and inbound-dedupe.ts — a drift risk but both copies are currently identical; and (2) no warning is emitted when fetchedCount hits perRunLimit, making silent truncation during long outages hard to diagnose. Core logic, dedupe integration, cursor semantics, and test coverage are solid.

extensions/bluebubbles/src/catchup.ts — see comments on resolveStateDirFromEnv duplication and the perRunLimit truncation warning.

Prompt To Fix All With AI
This is a comment left during a code review.
Path: extensions/bluebubbles/src/catchup.ts
Line: 50-59

Comment:
**Duplicated `resolveStateDirFromEnv` across catchup.ts and inbound-dedupe.ts**

This function is byte-for-byte identical to the one in `inbound-dedupe.ts`. If either copy drifts (e.g., adding a new env-var override), the two stores will resolve different directories and cursor/dedupe files will be written to different roots, silently breaking the catchup ↔ dedupe pairing. Extract to a shared local utility (e.g. `state-dir.ts` inside the plugin) and import it from both files.

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: extensions/bluebubbles/src/catchup.ts
Line: 308-319

Comment:
**No warning logged when fetch results are at the `perRunLimit` cap**

When BB returns exactly `perRunLimit` messages it has likely truncated the result set — the remaining messages (with `dateCreated` between the last fetched message and `nowMs`) are permanently unreachable because the cursor advances to `nowMs`. The existing log line shows `fetched=N` but gives no indication that `N` equals the configured cap. Emitting a distinct warning when `messages.length >= perRunLimit` would help operators know to raise the limit before a long outage silently drops messages.

How can I resolve this? If you propose a fix, please make it concise.

Reviews (1): Last reviewed commit: "feat(bluebubbles): replay missed webhook..." | Re-trigger Greptile

Comment thread extensions/bluebubbles/src/catchup.ts
Comment thread extensions/bluebubbles/src/catchup.ts Outdated
…rn on perRunLimit truncation

Addresses review feedback on PR #66760:

- Greptile P2: catchup.ts and inbound-dedupe.ts each had their own
  `resolveStateDirFromEnv`. Catchup's was a half-baked local copy that
  missed `~` expansion and the legacy/new state-dir fallback that the
  canonical `openclaw/plugin-sdk/state-paths.resolveStateDir` provides.
  Switch catchup to the canonical resolver so the catchup cursor and the
  inbound-dedupe state are guaranteed to land under the same root,
  preserving the dedupe-after-replay invariant the catchup design depends
  on. Test isolation preserved by honoring an explicit
  `OPENCLAW_STATE_DIR` override before the VITEST/test-mode default.
- Greptile P2 + Aisle CWE-532-adjacent: when the BB result hits
  `perRunLimit`, emit a distinct WARNING through the runtime `error`
  channel pointing at the config knob to raise. Previously the run
  summary just showed `fetched=N` with no signal that older messages in
  the window may have been silently truncated (the cursor still advances
  to nowMs, so they're unreachable on the next sweep).
- Tests: 2 new cases covering "warn on cap-hit" and "no warn below cap";
  full BB suite remains 405/405 (was 403/403; +2 new).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 0e8ddbaa61

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread extensions/bluebubbles/src/catchup.ts Outdated
Comment thread extensions/bluebubbles/src/catchup.ts
Lobster and others added 2 commits April 14, 2026 13:09
…kew gate

Addresses Codex review findings on PR #66760:

- P1: Avoid committing catchup cursor after replay failures
  Previously the cursor advanced to nowMs unconditionally, so a
  transient processMessage failure during startup catchup permanently
  dropped the failing record — the next sweep queried only `after
  nowMs` and never retried it. Now we track the earliest timestamp
  where processMessage threw and hold the cursor just before it (still
  clamped to >= the prior cursor and <= nowMs), so retries pick up
  exactly the failed records on the next run. The inbound-dedupe cache
  from #66230 keeps successfully-replayed messages from being
  re-processed on retry. Normalize failures (record didn't yield a
  usable NormalizedWebhookMessage) are still treated as permanent
  skips, so a malformed payload doesn't wedge catchup forever.

- P2: Skip min-interval gate when stored cursor is in the future
  If the host clock jumped backwards (NTP correction, manual adjust),
  `nowMs - existing.lastSeenMs` was negative, which still satisfies
  `< MIN_INTERVAL_MS`, so catchup returned null repeatedly until wall
  time caught up. Added `nowMs >= existing.lastSeenMs` precondition so
  future-dated cursors stop disabling catchup.

Tests: 3 new cases (cursor-held-on-failure, clamp-to-prior-cursor,
clock-skew-bypass-gate); BB suite remains 408/408.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…e catchup overrides

Addresses two more Codex P2 findings on PR #66760:

- Enforce maxAgeMinutes on first catchup run
  Previously the no-cursor branch used `nowMs - firstRunLookbackMs`
  directly without applying `earliestAllowed`, so a config like
  `maxAgeMinutes: 5, firstRunLookbackMinutes: 30` silently scanned 30
  minutes on first startup. Now both branches clamp against the
  earliestAllowed ceiling.

- Preserve channel-level `catchup.enabled` in account overrides
  BlueBubbles account merging only deep-merges `network`. With catchup
  outside that list, an override like `accounts.work.catchup:
  { perRunLimit: 20 }` would replace the whole channel-level catchup
  block and silently drop a global `catchup.enabled: false`,
  re-enabling replay for that account. Adding `catchup` to
  `nestedObjectKeys` deep-merges the override on top of inherited
  defaults, matching the existing `network` precedent.

Tests: 1 new case for the first-run maxAge clamp; BB suite 409/409.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 00de3cc1b7

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread extensions/bluebubbles/src/catchup.ts Outdated
…covers via firstRunLookback

Addresses Codex P2 finding on PR #66760 (cursor in future causes silent skip).

Previously, after a host clock rollback (NTP correction or manual
adjust), `existing.lastSeenMs` could be ahead of `nowMs`. The earlier
`Math.max(existing.lastSeenMs, earliestAllowed)` then computed
`windowStartMs` in the future, so the BB query ran with `after=future`,
returned zero records, and we still saved cursor = nowMs at the end of
the run — permanently skipping every real message delivered in the
[earliestAllowed, nowMs] window.

Now: a cursor strictly in the future is treated as if no cursor exists
at all. The window falls through to the firstRunLookback path (clamped
to maxAge), so catchup looks back a sensible distance and replays
through the existing dedupe-protected pipeline. Cursor save at the end
of the run repairs the future timestamp.

Test: extends the previous "skips rate-limit gate" case to assert that
the recovery window is firstRunLookback-clamped (not nowMs) and that
the cursor is repaired to nowMs after the run.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 7990aeb7da

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread extensions/bluebubbles/src/catchup.ts Outdated
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 80d2393528

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread extensions/bluebubbles/src/catchup.ts
…ge boundary on truncation

Addresses two more Codex P1 findings on PR #66760:

- Run catchup even when cursor age is under MIN_INTERVAL
  Catchup is the *only* mechanism that recovers messages dropped during
  the gateway-down window, and it runs once per gateway startup. The
  30s MIN_INTERVAL_MS gate was protecting against ~nothing (a few
  extra BB queries on rolling restarts) at the cost of permanent
  message loss when restarts happened within that window. A real
  failure: t0 startup, cursor=t0; t0+10s gateway down, webhook
  ECONNREFUSED at t0+15s; t0+20s restart skips catchup entirely. Gate
  removed; bounded by perRunLimit/maxAge + dedupe-protected so the
  cost of always running is capped.

- Keep cursor behind unfetched pages when limit is hit
  Previously a long outage with >perRunLimit messages would fetch the
  oldest perRunLimit (sort:ASC), process them, and then advance the
  cursor to nowMs — permanently skipping the unfetched newer tail. Now
  we track the latest fetched timestamp regardless of fate, and on
  truncation (fetchedCount === perRunLimit) the cursor advances only
  to that page boundary so the next gateway startup picks up the rest.
  Updated the existing perRunLimit warn to reflect the new
  recoverable-on-next-startup semantics.

Tests: replaced obsolete "skips rapid restart" with "runs catchup even
on rapid restarts"; added "advances cursor only to last fetched ts when
result is truncated". BB suite 410/410.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@omarshahine
Copy link
Copy Markdown
Contributor Author

Review feedback round-up

Thanks for the thorough reviews. Pushed 5 follow-up commits since the original 4a89afbe. Summary of what's in / not in:

Greptile (✅ both addressed)

  • Duplicated resolveStateDirFromEnv0e8ddbaa61 switches catchup to the canonical openclaw/plugin-sdk/state-paths.resolveStateDir that inbound-dedupe already uses, so the two stores can never drift apart. (Note: catchup tests use OPENCLAW_STATE_DIR for per-test isolation, so the test-mode default short-circuit now only fires when that env var is unset; explicit overrides win.)
  • No warning on perRunLimit truncation0e8ddbaa61 adds a distinct WARN line; ed4e403d7c then made truncation actually recoverable (cursor advances only to the page boundary), and the warn message was updated to reflect the new "next startup picks up the rest" semantics.

Codex (✅ all 7 addressed)

  • P1: cursor advances after replay failure00de3cc1b7 tracks the earliest processMessage failure timestamp and holds the cursor just before it. Successful replays from this run are dedupe-protected by fix(bluebubbles): dedupe inbound webhooks across restarts (#19176, #12053) #66230. Normalize failures (record didn't yield a NormalizedWebhookMessage) are still treated as permanent skips so a malformed payload doesn't wedge catchup forever.
  • P2: clock-skew negative-delta hits gate → addressed inside the broader fix in ed4e403d7c (see below).
  • P2: maxAgeMinutes not enforced on first run7990aeb7da clamps both the existing-cursor and first-run branches to earliestAllowed.
  • P2: account override drops catchup.enabled7990aeb7da adds catchup to nestedObjectKeys in accounts.ts, mirroring the existing network precedent so deep-merge semantics apply.
  • P2: future cursor causes silent skip80d2393528 treats a strictly-future cursor as if no cursor exists, falling through to firstRunLookback so the [earliestAllowed, nowMs] window does get queried; cursor save at the end repairs the future timestamp.
  • P1: run catchup even when cursor age is under MIN_INTERVALed4e403d7c removes the MIN_INTERVAL_MS gate entirely. Catchup runs once per gateway startup and is the only recovery mechanism, so skipping the second startup's pass after a quick restart loses any messages that arrived during the brief downtime. The bounded query (perRunLimit, maxAge) and fix(bluebubbles): dedupe inbound webhooks across restarts (#19176, #12053) #66230's dedupe cap the cost of running every restart. (This also obviates the clock-skew gate that the P2 above asked for — solved by removal rather than >= precondition.)
  • P1: keep cursor behind unfetched pages when limit is hited4e403d7c tracks latestFetchedTs regardless of message fate, and on truncation (fetchedCount === perRunLimit) the cursor advances only to the page boundary so the next gateway startup picks up the unfetched tail. Previously the unfetched tail past the cap would have been permanently unreachable — exactly the failure mode the feature is meant to prevent.

Tests grew from 14 → 21 cases; full BlueBubbles suite remains 410/410; pnpm check green.

Aisle (5 findings, summary)

Live end-to-end repro of the original failure mode is in the PR body and reproduced post-fix on macOS 26.3 / BB Server 1.9.x. Ready for another look.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: ed4e403d7c

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +367 to +370
} else if (isTruncated) {
// Use latestFetchedTs (clamped to >= prior cursor and <= nowMs) so the
// next run starts where this page ended.
nextCursorMs = Math.min(Math.max(latestFetchedTs, cursorBefore ?? windowStartMs), nowMs);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Avoid dropping same-timestamp tail at truncation boundary

When a catchup run hits perRunLimit, this branch advances the cursor to latestFetchedTs. If there are additional unfetched messages with that exact same dateCreated value (which is realistic when timestamps are coarse or bursts occur), the next run starts from that same timestamp and the loop’s ts <= windowStartMs guard drops them, so they are never replayed. In outage backlogs this causes permanent message loss even though catchup is enabled; truncated pagination needs a stable tie-breaker (or a cursor rewind below the boundary) to avoid skipping equal-timestamp records.

Useful? React with 👍 / 👎.

@omarshahine omarshahine force-pushed the lobster/bb-inbound-dedupe branch from 4d63af6 to 82bce26 Compare April 14, 2026 21:02
Lobster and others added 2 commits April 14, 2026 14:13
…66810

The inbound-dedupe PR was reopened as #66810 (#66230 closed without
merging at 21:10:28Z, #66810 created 7s later on the same
lobster/bb-inbound-dedupe branch). Updating code comments and the
catchup CHANGELOG entry to point at the live PR. The branch dependency
chain (this PR stacked on lobster/bb-inbound-dedupe) is unchanged.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…66816

The inbound-dedupe PR was reopened again as #66816 (closed-without-merge
trail: #66230#66810#66816). The branch was force-pushed and the
new PR uses the parallel `fix/bb-inbound-dedupe` branch. Updating code
comments and the catchup CHANGELOG entry to point at the live PR.
Stacking on top of the dedupe branch will be addressed in a follow-up
rebase.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@omarshahine omarshahine deleted the branch lobster/bb-inbound-dedupe April 14, 2026 22:48
omarshahine pushed a commit that referenced this pull request Apr 14, 2026
…rn on perRunLimit truncation

Addresses review feedback on PR #66760:

- Greptile P2: catchup.ts and inbound-dedupe.ts each had their own
  `resolveStateDirFromEnv`. Catchup's was a half-baked local copy that
  missed `~` expansion and the legacy/new state-dir fallback that the
  canonical `openclaw/plugin-sdk/state-paths.resolveStateDir` provides.
  Switch catchup to the canonical resolver so the catchup cursor and the
  inbound-dedupe state are guaranteed to land under the same root,
  preserving the dedupe-after-replay invariant the catchup design depends
  on. Test isolation preserved by honoring an explicit
  `OPENCLAW_STATE_DIR` override before the VITEST/test-mode default.
- Greptile P2 + Aisle CWE-532-adjacent: when the BB result hits
  `perRunLimit`, emit a distinct WARNING through the runtime `error`
  channel pointing at the config knob to raise. Previously the run
  summary just showed `fetched=N` with no signal that older messages in
  the window may have been silently truncated (the cursor still advances
  to nowMs, so they're unreachable on the next sweep).
- Tests: 2 new cases covering "warn on cap-hit" and "no warn below cap";
  full BB suite remains 405/405 (was 403/403; +2 new).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
omarshahine pushed a commit that referenced this pull request Apr 14, 2026
…kew gate

Addresses Codex review findings on PR #66760:

- P1: Avoid committing catchup cursor after replay failures
  Previously the cursor advanced to nowMs unconditionally, so a
  transient processMessage failure during startup catchup permanently
  dropped the failing record — the next sweep queried only `after
  nowMs` and never retried it. Now we track the earliest timestamp
  where processMessage threw and hold the cursor just before it (still
  clamped to >= the prior cursor and <= nowMs), so retries pick up
  exactly the failed records on the next run. The inbound-dedupe cache
  from #66230 keeps successfully-replayed messages from being
  re-processed on retry. Normalize failures (record didn't yield a
  usable NormalizedWebhookMessage) are still treated as permanent
  skips, so a malformed payload doesn't wedge catchup forever.

- P2: Skip min-interval gate when stored cursor is in the future
  If the host clock jumped backwards (NTP correction, manual adjust),
  `nowMs - existing.lastSeenMs` was negative, which still satisfies
  `< MIN_INTERVAL_MS`, so catchup returned null repeatedly until wall
  time caught up. Added `nowMs >= existing.lastSeenMs` precondition so
  future-dated cursors stop disabling catchup.

Tests: 3 new cases (cursor-held-on-failure, clamp-to-prior-cursor,
clock-skew-bypass-gate); BB suite remains 408/408.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
omarshahine pushed a commit that referenced this pull request Apr 14, 2026
…e catchup overrides

Addresses two more Codex P2 findings on PR #66760:

- Enforce maxAgeMinutes on first catchup run
  Previously the no-cursor branch used `nowMs - firstRunLookbackMs`
  directly without applying `earliestAllowed`, so a config like
  `maxAgeMinutes: 5, firstRunLookbackMinutes: 30` silently scanned 30
  minutes on first startup. Now both branches clamp against the
  earliestAllowed ceiling.

- Preserve channel-level `catchup.enabled` in account overrides
  BlueBubbles account merging only deep-merges `network`. With catchup
  outside that list, an override like `accounts.work.catchup:
  { perRunLimit: 20 }` would replace the whole channel-level catchup
  block and silently drop a global `catchup.enabled: false`,
  re-enabling replay for that account. Adding `catchup` to
  `nestedObjectKeys` deep-merges the override on top of inherited
  defaults, matching the existing `network` precedent.

Tests: 1 new case for the first-run maxAge clamp; BB suite 409/409.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
omarshahine pushed a commit that referenced this pull request Apr 14, 2026
…covers via firstRunLookback

Addresses Codex P2 finding on PR #66760 (cursor in future causes silent skip).

Previously, after a host clock rollback (NTP correction or manual
adjust), `existing.lastSeenMs` could be ahead of `nowMs`. The earlier
`Math.max(existing.lastSeenMs, earliestAllowed)` then computed
`windowStartMs` in the future, so the BB query ran with `after=future`,
returned zero records, and we still saved cursor = nowMs at the end of
the run — permanently skipping every real message delivered in the
[earliestAllowed, nowMs] window.

Now: a cursor strictly in the future is treated as if no cursor exists
at all. The window falls through to the firstRunLookback path (clamped
to maxAge), so catchup looks back a sensible distance and replays
through the existing dedupe-protected pipeline. Cursor save at the end
of the run repairs the future timestamp.

Test: extends the previous "skips rate-limit gate" case to assert that
the recovery window is firstRunLookback-clamped (not nowMs) and that
the cursor is repaired to nowMs after the run.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
omarshahine pushed a commit that referenced this pull request Apr 14, 2026
…ge boundary on truncation

Addresses two more Codex P1 findings on PR #66760:

- Run catchup even when cursor age is under MIN_INTERVAL
  Catchup is the *only* mechanism that recovers messages dropped during
  the gateway-down window, and it runs once per gateway startup. The
  30s MIN_INTERVAL_MS gate was protecting against ~nothing (a few
  extra BB queries on rolling restarts) at the cost of permanent
  message loss when restarts happened within that window. A real
  failure: t0 startup, cursor=t0; t0+10s gateway down, webhook
  ECONNREFUSED at t0+15s; t0+20s restart skips catchup entirely. Gate
  removed; bounded by perRunLimit/maxAge + dedupe-protected so the
  cost of always running is capped.

- Keep cursor behind unfetched pages when limit is hit
  Previously a long outage with >perRunLimit messages would fetch the
  oldest perRunLimit (sort:ASC), process them, and then advance the
  cursor to nowMs — permanently skipping the unfetched newer tail. Now
  we track the latest fetched timestamp regardless of fate, and on
  truncation (fetchedCount === perRunLimit) the cursor advances only
  to that page boundary so the next gateway startup picks up the rest.
  Updated the existing perRunLimit warn to reflect the new
  recoverable-on-next-startup semantics.

Tests: replaced obsolete "skips rapid restart" with "runs catchup even
on rapid restarts"; added "advances cursor only to last fetched ts when
result is truncated". BB suite 410/410.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

channel: bluebubbles Channel integration: bluebubbles maintainer Maintainer-authored PR size: XL

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant