BlueBubbles/catchup: per-message retry cap for wedged messages (#66870) by omarshahine · Pull Request #67426 · openclaw/openclaw

omarshahine · 2026-04-15T23:09:23Z

Summary

What started as a retry cap for #66870 uncovered and fixed two latent bugs in the catchup/dedupe plumbing from #66857 and #66230 that would have caused duplicate replies on every gateway restart for any user with catchup enabled.

1. Per-message retry cap (#66870)

Adds catchup.maxFailureRetries (default 10, clamped [1, 1000]) so a persistently-failing message no longer wedges the catchup cursor forever.
Persists per-GUID failure counts in the cursor file. count >= max marks the GUID as "given up": catchup skips it on sight without another processMessage attempt, and the cursor advances past it.
Correctly handles the mixed case — an earlier still-retrying GUID plus a later given-up GUID: cursor holds below the still-retrying message while the given-up one is skipped.
Emits a distinct WARN on give-up transitions for operator visibility.

2. Lost-update race in persistent dedupe (found during live testing)

Root cause: the re-entrant file lock in file-lock.ts gave concurrent callers for the same file immediate access instead of serializing them. Two checkAndRecordInner calls (inbound user message + outbound agent reply) would both read the same stale file, then the last writer silently overwrote the first writer's additions. The in-memory cache masked this within a process lifetime, but after restart the lost GUID caused catchup to replay already-handled messages — producing duplicate replies.
Fix: added an in-process write queue per file path in persistent-dedupe.ts so read-modify-write cycles targeting the same dedupe file are serialized. The file lock continues to guard cross-process contention.

3. Dedupe file naming migration gap (found during live testing)

The dedupe file naming changed from ${safe}.json to ${safe}__${hash}.json between beta iterations. Upgrading started with an empty dedupe file and replayed the entire catchup window, producing duplicate replies for every recently-handled message.
Fix: one-time migration in inbound-dedupe.ts that renames the legacy file on first access. Also added a warmupBlueBubblesInboundDedupe call in catchup before the fetch so the migration and memory warmup run eagerly, not only when processMessage happens to be called.

4. Balloon events bypassing debouncer (found during live testing)

The live webhook path coalesces text + URL-preview balloon events via the debouncer. Catchup processes each query result individually. A URL balloon has a different GUID from its parent text message and no balloonBundleId in the query API response, so catchup replayed it as a standalone message — producing a duplicate reply.
Fix: catchup now skips messages with associatedMessageGuid set (tapbacks, reactions, balloons). Threaded replies use threadOriginatorGuid instead and are unaffected.

Fixes #66870.

Live testing

Dogfooded on a live BlueBubbles install with real iMessage traffic across multiple stop/restart cycles:

openclaw doctor — clean after upgrade from 2026.4.14 to beta+retry-cap
Live iMessage send → single reply, no duplicate
Stop gateway → restart → catchup runs → replayed=0 (dedupe correctly recognizes the live-handled message)
Verified dedupe file contains both inbound and outbound GUIDs after the write queue fix (previously only the outbound survived the race)
Verified legacy default.json renamed to default__37a8eec1ce19.json on first startup after migration fix
Verified replayed=0 fetched=0 on a clean bounce with no intervening messages (cursor fully caught up, no stale leftovers)
Verified balloon filter: associatedMessageGuid messages are tapbacks/reactions only (checked 200 messages), threaded replies use threadOriginatorGuid and are not filtered

Automated tests

pnpm test extensions/bluebubbles/ — 425 passed
pnpm tsgo — green
pnpm check — 0 warnings, 0 errors
pnpm config:docs:check / pnpm plugin-sdk:api:check — baselines match
14 new tests for retry cap (counter increment, give-up transition, skip-on-sight, stickiness, mixed earlier/later failures, counter clear on success, legacy cursor compat, stale entry pruning, config clamping, sanitization)
Existing 22 catchup tests + 5 dedupe persistence tests pass unchanged

🤖 Generated with Claude Code

aisle-research-bot · 2026-04-15T23:09:31Z

🔒 Aisle Security Analysis

We found 4 potential security issue(s) in this PR:

#	Severity	Title
1	🟡 Medium	Unbounded per-file promise write queue can cause memory/event-loop DoS
2	🟡 Medium	Unsafe legacy dedupe file migration allows TOCTOU/symlink-hardlink abuse in state directory
3	🟡 Medium	Catchup replay can silently drop messages based on untrusted associatedMessage* / balloonBundleId fields
4	🔵 Low	Improper neutralization of GUID and error text in BlueBubbles catchup logs (log injection/PII leakage)

1. 🟡 Unbounded per-file promise write queue can cause memory/event-loop DoS

Property	Value
Severity	Medium
CWE	CWE-400
Location	`src/plugin-sdk/persistent-dedupe.ts:167-184`

Description

createPersistentDedupe() introduces an in-process per-file write queue (fileWriteQueues) that serializes all read-modify-write cycles to the same JSON file.

Because the queue is implemented as an unbounded promise chain, an attacker (or simply high-throughput workloads) can trigger arbitrarily many concurrent checkAndRecord() calls that target the same namespace (hence the same file path), causing:

Unbounded memory growth: each queued operation captures closures/stack/promise bookkeeping until it reaches the head of the chain.
Event-loop / throughput collapse: the system becomes effectively single-threaded on that file; if the head operation slows, all subsequent work backs up.
Global stall amplification: if any queued operation hangs (e.g., stuck filesystem I/O, pathological lock contention), all subsequent enqueued operations for that file will stall indefinitely.

This is reachable in typical usage patterns where many different keys map to the same namespace file (e.g., deduping many unique message IDs under a single namespace). The existing inflight map only dedupes the exact same namespace:key and does not limit concurrency for different keys within the same file.

Vulnerable code:

const fileWriteQueues = new Map<string, Promise<unknown>>();

function enqueueFileWrite<T>(filePath: string, fn: () => Promise<T>): Promise<T> {
  const prev = fileWriteQueues.get(filePath) ?? Promise.resolve();
  const next = prev.then(fn, fn);
  fileWriteQueues.set(filePath, next);
  // ...
  return next;
}

Recommendation

Add explicit backpressure and bounded queueing per file, and ensure stalled operations cannot block the queue indefinitely.

Suggested mitigations (combine as appropriate):

Cap queue length per file path; reject or shed load when exceeded.
Coalesce writes (batch multiple keys into a single write) rather than queueing one write per call.
Add timeouts around the queued work (including file-lock acquisition and disk I/O), failing fast and allowing subsequent work to proceed.
Use a real mutex/semaphore with a bounded wait queue rather than chaining promises.

Example: bounded semaphore (pseudo-code):

import { Semaphore } from "async-mutex";

const semaphores = new Map<string, { sem: Semaphore; queued: number }>();
const MAX_WAITERS = 1000;

async function withBoundedFileMutex<T>(filePath: string, fn: () => Promise<T>): Promise<T> {
  const entry = semaphores.get(filePath) ?? { sem: new Semaphore(1), queued: 0 };
  semaphores.set(filePath, entry);
  if (entry.queued >= MAX_WAITERS) {
    throw new Error(`dedupe backlog exceeded for ${filePath}`);
  }
  entry.queued++;
  const [value] = await entry.sem.acquire();
  try {
    return await fn();
  } finally {
    entry.queued--;
    value();
    if (entry.queued === 0) semaphores.delete(filePath);
  }
}

2. 🟡 Unsafe legacy dedupe file migration allows TOCTOU/symlink-hardlink abuse in state directory

Property	Value
Severity	Medium
CWE	CWE-367
Location	`extensions/bluebubbles/src/inbound-dedupe.ts:72-80`

Description

migrateLegacyDedupeFile() performs a legacy→new dedupe file rename/removal using existsSync checks followed by renameSync/unlinkSync with no validation that the involved paths are regular files inside the expected directory.

If an attacker can write into the OpenClaw state directory (or can influence OPENCLAW_STATE_DIR to point at attacker-controlled storage), they can exploit this best-effort migration to affect which filesystem object is renamed or deleted:

TOCTOU race: existsSync(legacyPath)/existsSync(newPath) checks are separated from the subsequent renameSync/unlinkSync operations.
No type checks: the code never lstats to ensure legacyPath is a regular file (not a symlink) and not a hardlink to another file.
Potential arbitrary file deletion: if legacyPath is (or is swapped to) a hardlink to another file that the process can delete, unlinkSync(legacyPath) decrements the link count and may delete the underlying target when it reaches zero.

Vulnerable code:

if (!fs.existsSync(newPath)) {
  fs.renameSync(legacyPath, newPath);
} else {
  fs.unlinkSync(legacyPath);
}

Recommendation

Harden the migration so it only operates on expected regular files within the dedupe directory, and avoid TOCTOU patterns.

Recommended changes:

Resolve and validate the dedupe directory and both paths are within it.
Use lstat (not stat) to reject symlinks.
Only migrate regular files (optionally also reject files with nlink > 1 to reduce hardlink risk).
Prefer a single-step operation where possible and handle errors explicitly.

Example (sketch):

const dedupeDir = path.join(resolveStateDirFromEnv(), "bluebubbles", "inbound-dedupe");
const legacyPath = resolveLegacyNamespaceFilePath(namespace);

for (const p of [legacyPath, newPath]) {
  const realParent = await fs.promises.realpath(path.dirname(p));
  const realDedupe = await fs.promises.realpath(dedupeDir);
  if (!realParent.startsWith(realDedupe + path.sep)) return; // refuse
}

let st: fs.Stats;
try {
  st = await fs.promises.lstat(legacyPath);
} catch { return; }
if (!st.isFile()) return;           // rejects symlink and non-file
if (st.nlink > 1) return;           // optional hardlink defense

try {
  await fs.promises.rename(legacyPath, newPath);
} catch (e: any) {
  if (e.code === "EEXIST") {
    await fs.promises.unlink(legacyPath);
  }
}

Also ensure the state directory is created with restrictive permissions (e.g., 0700) and is not writable by untrusted users.

3. 🟡 Catchup replay can silently drop messages based on untrusted associatedMessage* / balloonBundleId fields

Property	Value
Severity	Medium
CWE	CWE-841
Location	`extensions/bluebubbles/src/catchup.ts:513-523`

Description

In runBlueBubblesCatchupInner, catchup now skips processing entirely for any fetched message record that has a non-empty associatedMessageGuid and either associatedMessageType is non-null or balloonBundleId is set.

These fields come directly from the BlueBubbles API response (rec) and are not validated against a known set of reaction/tapback types.
When the condition matches, the loop continues before normalization, dedupe, or any processing, but latestFetchedTs has already been advanced.
As a result, the cursor can advance past the skipped record, causing permanent message loss in catchup replay.

If a malicious/compromised BlueBubbles server (or a buggy upstream) marks a normal text message with these fields, the gateway will silently drop it during catchup, potentially bypassing downstream automations/moderation that rely on replay integrity.

Vulnerable code:

const assocType = rec.associatedMessageType ?? rec.associated_message_type;
const balloonId = typeof rec.balloonBundleId === "string" ? rec.balloonBundleId.trim() : "";
if (assocGuid && (assocType != null || balloonId)) {
  continue;
}

Recommendation

Harden the skip logic so only known non-message events are skipped, and avoid permanently advancing the cursor past potentially-real messages.

Suggested approaches (pick one):

Strictly recognize reactions/tapbacks:
- Only skip when associatedMessageType is a number and is in a known allowlist of tapback/reaction types.
- Consider also requiring isTapback === true (if available) for reaction skips.
Treat balloon-only events separately:
- Only skip balloon events when there is strong evidence it is a preview/sticker-only payload (e.g., empty text AND no meaningful attachments).
- Otherwise, normalize and process.
Fail safe on ambiguous cases:
- If assocGuid is set but the type/bundle is unrecognized, log a warning and still pass the message through normal processing.

Example hardening:

const assocType = rec.associatedMessageType ?? rec.associated_message_type;
const balloonId = typeof rec.balloonBundleId === "string" ? rec.balloonBundleId.trim() : "";
const isKnownReaction = typeof assocType === "number" && KNOWN_REACTION_TYPES.has(assocType);
const isBalloon = Boolean(balloonId);

if (assocGuid && (isKnownReaction || isBalloon)) {
  // optionally: increment a skipped counter + debug log
  continue;
}

Additionally, consider tracking skipped items in the summary/logs (and/or not advancing the cursor past them) to avoid silent data loss if upstream mislabels events.

4. 🔵 Improper neutralization of GUID and error text in BlueBubbles catchup logs (log injection/PII leakage)

Property	Value
Severity	Low
CWE	CWE-117
Location	`extensions/bluebubbles/src/catchup.ts:538-589`

Description

The catchup retry/give-up logic logs retryKey (derived from normalized.messageId or raw rec.guid) and String(err) directly into error logs.

normalized.messageId comes from webhook payload fields guid/id/messageId without any sanitization (see normalizeWebhookMessage).
String(err) may include attacker-influenced content (e.g., values derived from webhook payloads, remote responses, or other untrusted strings).
Because these values are interpolated verbatim, an attacker who can influence them can inject control characters (CR/LF, tabs, other Unicode control chars) to forge/split log lines, confuse structured log parsers, or cause sensitive content to be persisted in operator logs.

Vulnerable code:

error?.(
  `[${accountId}] BlueBubbles catchup: giving up on guid=${retryKey} ` +
    `after ${nextCount} consecutive failures; future sweeps will skip ` +
    `this message. timestamp=${ts}: ${String(err)}`,
);
...
error?.(
  `[${accountId}] BlueBubbles catchup: processMessage failed (retry ` +
    `${nextCount}/${maxFailureRetries}): ${String(err)}`,
);

Note: elsewhere in the codebase there is already a sanitizeForLog() helper that strips control characters before logging, but this file does not use it.

Recommendation

Sanitize/neutralize untrusted strings before writing to logs, and bound their length.

Reuse the existing sanitizeForLog() logic (or introduce a shared utility) to strip CR/LF/other control characters and truncate.
Consider logging structured error fields (e.g., err.name, err.code) and only a sanitized message.

Example:

function sanitizeForLog(value: unknown, maxLen = 200): string {
  const cleaned = String(value).replace(/[\r\n\t\p{C}]/gu, " ");
  return cleaned.length > maxLen ? cleaned.slice(0, maxLen) + "..." : cleaned;
}

const safeKey = sanitizeForLog(retryKey, 120);
const safeErr = sanitizeForLog(err, 200);

error?.(
  `[${accountId}] BlueBubbles catchup: giving up on guid=${safeKey} ` +
  `after ${nextCount} consecutive failures; timestamp=${ts}: ${safeErr}`,
);

This prevents log forging/splitting and reduces the chance of persisting sensitive payload fragments in logs.

Analyzed PR: #67426 at commit 39e3cf1

_{Last updated on: 2026-04-16T05:13:31Z}

greptile-apps · 2026-04-15T23:13:28Z

Greptile Summary

Adds a per-message retry ceiling (catchup.maxFailureRetries, default 10, clamped [1, 1000]) to the BlueBubbles catchup loop so a permanently-failing GUID no longer pins the cursor forever. The retry state is persisted in the existing cursor file as a new optional failureRetries: Record<guid, count> field, maintaining full backward-compat with old cursor files.

The core logic is well-designed: two count regimes in one map (still-retrying vs. given-up), natural per-run pruning of stale entries, a defense-in-depth size cap, and a distinct WARN on the give-up transition. All 14 new test cases pass and the edge cases from the PR description are covered.

Confidence Score: 5/5

Safe to merge — the implementation is correct, backward-compatible, and comprehensively tested.

No P0 or P1 issues found. The cursor arithmetic is sound across all edge cases (truncation, give-up + still-retrying mixed, GUID-less messages, clock rollback, first-run, legacy cursor files). All 14 new test scenarios verify the intended semantics. CHANGELOG entry is correctly placed at the end of the Fixes section per repo conventions.

No files require special attention.

_{Reviews (1): Last reviewed commit: "BlueBubbles/catchup: per-message retry c..." | Re-trigger Greptile}

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: ac69fb1694

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-04-15T23:15:05Z

+      // Success clears any accumulated retries for this GUID. Since we
+      // build `nextRetries` from scratch rather than mutating
+      // `prevRetries`, simply NOT copying the entry is the clear. (We
+      // still need this branch so readers understand the lifecycle.)


Drop stale retry count after same-GUID success in one sweep

If the BlueBubbles query returns the same GUID more than once in a single run, this success path does not clear a prior failure entry already written to nextRetries earlier in the loop. Because prevCount is read from prevRetries (not the in-run state), a fail -> success sequence for the same GUID still persists a retry count, so later sweeps can incorrectly keep retrying or eventually mark that message as given up even though it succeeded. This is reproducible whenever duplicate rows for a GUID appear in one fetch response.

Useful? React with 👍 / 👎.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 24ef78e406

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-04-16T01:33:37Z

+  // Next-run retry map. Built from scratch each run so entries for GUIDs
+  // that didn't appear in this fetch are dropped (the cursor has
+  // advanced past them and they will never be queried again). Entries we


Preserve unseen retry entries on truncated catchup pages

The run rebuilds nextRetries exclusively from GUIDs present in the current fetch and then persists that map, but a truncated fetch (fetchedCount >= perRunLimit) does not imply unseen GUIDs are stale. If a previously tracked GUID falls beyond the current page boundary, its retry/give-up state is dropped even though the cursor has not advanced past it; when it appears on a later page, it is retried from zero again, which can reintroduce avoidable cursor holding and defeats sticky give-up behavior.

Useful? React with 👍 / 👎.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 11328b7110

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-04-16T02:27:38Z

+    if (assocGuid) {
+      continue;


Keep reply messages with associatedMessageGuid in catchup

The new if (assocGuid) continue filter drops every message that has associatedMessageGuid, but BlueBubbles uses that field for more than just tapbacks/balloons (regular replies can carry it too). This means catchup can silently skip legitimate inbound replies during downtime recovery and still advance the cursor past them, causing permanent message loss for those users. The existing debouncer/dedupe logic only treats it as balloon metadata when paired with balloonBundleId, so this broader filter is too aggressive.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-04-16T02:27:38Z

+    void next.finally(() => {
+      if (fileWriteQueues.get(filePath) === next) {
+        fileWriteQueues.delete(filePath);
+      }


Handle enqueue cleanup promise rejection explicitly

enqueueFileWrite calls void next.finally(...), which creates a second promise that will reject whenever next rejects, but that rejection is never observed. In Node 22, that becomes an unhandled rejection even though callers await next, so a disk error in readJsonFileWithFallback/writeJsonFileAtomically can surface as a process-level unhandled rejection. The cleanup hook should consume that rejection (or use a non-creating pattern) so error paths stay contained.

Useful? React with 👍 / 👎.

…ger pins the cursor forever (#66870)

… catchup does not re-dispatch already-handled messages after upgrade

…race on inbound GUID commit The re-entrant file lock allowed two concurrent checkAndRecordInner calls (e.g., inbound user message + outbound agent reply) to read the same stale file contents, then the last writer silently overwrote the first. The in-memory cache masked this within a process lifetime, but after restart the lost GUID caused catchup to re-dispatch already-handled messages. Add an in-process write queue per file path so read-modify-write cycles targeting the same dedupe file are serialized. Also filter associated- message events (balloons, tapbacks) in catchup since they bypass the live path's debouncer and have distinct GUIDs from their parent message.

… write queue Address Codex P1 review feedback: - Balloon filter: only skip when associatedMessageType OR balloonBundleId is set alongside associatedMessageGuid. Threaded replies use threadOriginatorGuid and are unaffected. - Write queue: .catch(() => {}) on the cleanup promise so a rejected next doesn't surface as an unhandled rejection in Node 22+.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: d9ed931147

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-04-16T04:35:59Z

+    } else {
+      // Both exist: new file is authoritative; remove the stale legacy.
+      fs.unlinkSync(legacyPath);


Preserve legacy dedupe data when hashed file already exists

When both legacy (${safe}.json) and hashed (${safe}__${hash}.json) files are present, this branch deletes the legacy file instead of merging its GUIDs. That can happen for users who already ran an intermediate build that created the hashed file but never migrated prior history, so removing the legacy file drops still-valid dedupe entries and allows previously handled messages to be replayed after restart.

Useful? React with 👍 / 👎.

omarshahine · 2026-04-16T05:23:31Z

Merged via squash.

Prepared head SHA: 39e3cf1df5597fe63f9a6a625e1d9aebb177b1a4
Merge commit: 4af76413508d99f1b432f13fbc648c59d657182c

Thanks @omarshahine!

@omarshahine