BlueBubbles/catchup: per-message retry cap for wedged messages (#66870)#67426
BlueBubbles/catchup: per-message retry cap for wedged messages (#66870)#67426omarshahine merged 7 commits intomainfrom
Conversation
🔒 Aisle Security AnalysisWe found 4 potential security issue(s) in this PR:
1. 🟡 Unbounded per-file promise write queue can cause memory/event-loop DoS
Description
Because the queue is implemented as an unbounded promise chain, an attacker (or simply high-throughput workloads) can trigger arbitrarily many concurrent
This is reachable in typical usage patterns where many different keys map to the same namespace file (e.g., deduping many unique message IDs under a single Vulnerable code: const fileWriteQueues = new Map<string, Promise<unknown>>();
function enqueueFileWrite<T>(filePath: string, fn: () => Promise<T>): Promise<T> {
const prev = fileWriteQueues.get(filePath) ?? Promise.resolve();
const next = prev.then(fn, fn);
fileWriteQueues.set(filePath, next);
// ...
return next;
}RecommendationAdd explicit backpressure and bounded queueing per file, and ensure stalled operations cannot block the queue indefinitely. Suggested mitigations (combine as appropriate):
Example: bounded semaphore (pseudo-code): import { Semaphore } from "async-mutex";
const semaphores = new Map<string, { sem: Semaphore; queued: number }>();
const MAX_WAITERS = 1000;
async function withBoundedFileMutex<T>(filePath: string, fn: () => Promise<T>): Promise<T> {
const entry = semaphores.get(filePath) ?? { sem: new Semaphore(1), queued: 0 };
semaphores.set(filePath, entry);
if (entry.queued >= MAX_WAITERS) {
throw new Error(`dedupe backlog exceeded for ${filePath}`);
}
entry.queued++;
const [value] = await entry.sem.acquire();
try {
return await fn();
} finally {
entry.queued--;
value();
if (entry.queued === 0) semaphores.delete(filePath);
}
}2. 🟡 Unsafe legacy dedupe file migration allows TOCTOU/symlink-hardlink abuse in state directory
Description
If an attacker can write into the OpenClaw state directory (or can influence
Vulnerable code: if (!fs.existsSync(newPath)) {
fs.renameSync(legacyPath, newPath);
} else {
fs.unlinkSync(legacyPath);
}RecommendationHarden the migration so it only operates on expected regular files within the dedupe directory, and avoid TOCTOU patterns. Recommended changes:
Example (sketch): const dedupeDir = path.join(resolveStateDirFromEnv(), "bluebubbles", "inbound-dedupe");
const legacyPath = resolveLegacyNamespaceFilePath(namespace);
for (const p of [legacyPath, newPath]) {
const realParent = await fs.promises.realpath(path.dirname(p));
const realDedupe = await fs.promises.realpath(dedupeDir);
if (!realParent.startsWith(realDedupe + path.sep)) return; // refuse
}
let st: fs.Stats;
try {
st = await fs.promises.lstat(legacyPath);
} catch { return; }
if (!st.isFile()) return; // rejects symlink and non-file
if (st.nlink > 1) return; // optional hardlink defense
try {
await fs.promises.rename(legacyPath, newPath);
} catch (e: any) {
if (e.code === "EEXIST") {
await fs.promises.unlink(legacyPath);
}
}Also ensure the state directory is created with restrictive permissions (e.g., 3. 🟡 Catchup replay can silently drop messages based on untrusted associatedMessage* / balloonBundleId fields
DescriptionIn
If a malicious/compromised BlueBubbles server (or a buggy upstream) marks a normal text message with these fields, the gateway will silently drop it during catchup, potentially bypassing downstream automations/moderation that rely on replay integrity. Vulnerable code: const assocType = rec.associatedMessageType ?? rec.associated_message_type;
const balloonId = typeof rec.balloonBundleId === "string" ? rec.balloonBundleId.trim() : "";
if (assocGuid && (assocType != null || balloonId)) {
continue;
}RecommendationHarden the skip logic so only known non-message events are skipped, and avoid permanently advancing the cursor past potentially-real messages. Suggested approaches (pick one):
Example hardening: const assocType = rec.associatedMessageType ?? rec.associated_message_type;
const balloonId = typeof rec.balloonBundleId === "string" ? rec.balloonBundleId.trim() : "";
const isKnownReaction = typeof assocType === "number" && KNOWN_REACTION_TYPES.has(assocType);
const isBalloon = Boolean(balloonId);
if (assocGuid && (isKnownReaction || isBalloon)) {
// optionally: increment a skipped counter + debug log
continue;
}Additionally, consider tracking skipped items in the summary/logs (and/or not advancing the cursor past them) to avoid silent data loss if upstream mislabels events. 4. 🔵 Improper neutralization of GUID and error text in BlueBubbles catchup logs (log injection/PII leakage)
DescriptionThe catchup retry/give-up logic logs
Vulnerable code: error?.(
`[${accountId}] BlueBubbles catchup: giving up on guid=${retryKey} ` +
`after ${nextCount} consecutive failures; future sweeps will skip ` +
`this message. timestamp=${ts}: ${String(err)}`,
);
...
error?.(
`[${accountId}] BlueBubbles catchup: processMessage failed (retry ` +
`${nextCount}/${maxFailureRetries}): ${String(err)}`,
);Note: elsewhere in the codebase there is already a RecommendationSanitize/neutralize untrusted strings before writing to logs, and bound their length.
Example: function sanitizeForLog(value: unknown, maxLen = 200): string {
const cleaned = String(value).replace(/[\r\n\t\p{C}]/gu, " ");
return cleaned.length > maxLen ? cleaned.slice(0, maxLen) + "..." : cleaned;
}
const safeKey = sanitizeForLog(retryKey, 120);
const safeErr = sanitizeForLog(err, 200);
error?.(
`[${accountId}] BlueBubbles catchup: giving up on guid=${safeKey} ` +
`after ${nextCount} consecutive failures; timestamp=${ts}: ${safeErr}`,
);This prevents log forging/splitting and reduces the chance of persisting sensitive payload fragments in logs. Analyzed PR: #67426 at commit Last updated on: 2026-04-16T05:13:31Z |
Greptile SummaryAdds a per-message retry ceiling ( The core logic is well-designed: two count regimes in one map (still-retrying vs. given-up), natural per-run pruning of stale entries, a defense-in-depth size cap, and a distinct WARN on the give-up transition. All 14 new test cases pass and the edge cases from the PR description are covered. Confidence Score: 5/5Safe to merge — the implementation is correct, backward-compatible, and comprehensively tested. No P0 or P1 issues found. The cursor arithmetic is sound across all edge cases (truncation, give-up + still-retrying mixed, GUID-less messages, clock rollback, first-run, legacy cursor files). All 14 new test scenarios verify the intended semantics. CHANGELOG entry is correctly placed at the end of the Fixes section per repo conventions. No files require special attention. Reviews (1): Last reviewed commit: "BlueBubbles/catchup: per-message retry c..." | Re-trigger Greptile |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: ac69fb1694
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| // Success clears any accumulated retries for this GUID. Since we | ||
| // build `nextRetries` from scratch rather than mutating | ||
| // `prevRetries`, simply NOT copying the entry is the clear. (We | ||
| // still need this branch so readers understand the lifecycle.) |
There was a problem hiding this comment.
Drop stale retry count after same-GUID success in one sweep
If the BlueBubbles query returns the same GUID more than once in a single run, this success path does not clear a prior failure entry already written to nextRetries earlier in the loop. Because prevCount is read from prevRetries (not the in-run state), a fail -> success sequence for the same GUID still persists a retry count, so later sweeps can incorrectly keep retrying or eventually mark that message as given up even though it succeeded. This is reproducible whenever duplicate rows for a GUID appear in one fetch response.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 24ef78e406
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| // Next-run retry map. Built from scratch each run so entries for GUIDs | ||
| // that didn't appear in this fetch are dropped (the cursor has | ||
| // advanced past them and they will never be queried again). Entries we |
There was a problem hiding this comment.
Preserve unseen retry entries on truncated catchup pages
The run rebuilds nextRetries exclusively from GUIDs present in the current fetch and then persists that map, but a truncated fetch (fetchedCount >= perRunLimit) does not imply unseen GUIDs are stale. If a previously tracked GUID falls beyond the current page boundary, its retry/give-up state is dropped even though the cursor has not advanced past it; when it appears on a later page, it is retried from zero again, which can reintroduce avoidable cursor holding and defeats sticky give-up behavior.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 11328b7110
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| if (assocGuid) { | ||
| continue; |
There was a problem hiding this comment.
Keep reply messages with associatedMessageGuid in catchup
The new if (assocGuid) continue filter drops every message that has associatedMessageGuid, but BlueBubbles uses that field for more than just tapbacks/balloons (regular replies can carry it too). This means catchup can silently skip legitimate inbound replies during downtime recovery and still advance the cursor past them, causing permanent message loss for those users. The existing debouncer/dedupe logic only treats it as balloon metadata when paired with balloonBundleId, so this broader filter is too aggressive.
Useful? React with 👍 / 👎.
| void next.finally(() => { | ||
| if (fileWriteQueues.get(filePath) === next) { | ||
| fileWriteQueues.delete(filePath); | ||
| } |
There was a problem hiding this comment.
Handle enqueue cleanup promise rejection explicitly
enqueueFileWrite calls void next.finally(...), which creates a second promise that will reject whenever next rejects, but that rejection is never observed. In Node 22, that becomes an unhandled rejection even though callers await next, so a disk error in readJsonFileWithFallback/writeJsonFileAtomically can surface as a process-level unhandled rejection. The cleanup hook should consume that rejection (or use a non-creating pattern) so error paths stay contained.
Useful? React with 👍 / 👎.
…ger pins the cursor forever (#66870)
… catchup does not re-dispatch already-handled messages after upgrade
…race on inbound GUID commit The re-entrant file lock allowed two concurrent checkAndRecordInner calls (e.g., inbound user message + outbound agent reply) to read the same stale file contents, then the last writer silently overwrote the first. The in-memory cache masked this within a process lifetime, but after restart the lost GUID caused catchup to re-dispatch already-handled messages. Add an in-process write queue per file path so read-modify-write cycles targeting the same dedupe file are serialized. Also filter associated- message events (balloons, tapbacks) in catchup since they bypass the live path's debouncer and have distinct GUIDs from their parent message.
… write queue
Address Codex P1 review feedback:
- Balloon filter: only skip when associatedMessageType OR balloonBundleId
is set alongside associatedMessageGuid. Threaded replies use
threadOriginatorGuid and are unaffected.
- Write queue: .catch(() => {}) on the cleanup promise so a rejected
next doesn't surface as an unhandled rejection in Node 22+.
9a34ffa to
d9ed931
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: d9ed931147
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| } else { | ||
| // Both exist: new file is authoritative; remove the stale legacy. | ||
| fs.unlinkSync(legacyPath); |
There was a problem hiding this comment.
Preserve legacy dedupe data when hashed file already exists
When both legacy (${safe}.json) and hashed (${safe}__${hash}.json) files are present, this branch deletes the legacy file instead of merging its GUIDs. That can happen for users who already ran an intermediate build that created the hashed file but never migrated prior history, so removing the legacy file drops still-valid dedupe entries and allows previously handled messages to be replayed after restart.
Useful? React with 👍 / 👎.
|
Merged via squash.
Thanks @omarshahine! |
…law#66870) (openclaw#67426) Merged via squash. Prepared head SHA: 39e3cf1 Co-authored-by: omarshahine <10343873+omarshahine@users.noreply.github.com> Co-authored-by: omarshahine <10343873+omarshahine@users.noreply.github.com> Reviewed-by: @omarshahine
…law#66870) (openclaw#67426) Merged via squash. Prepared head SHA: 39e3cf1 Co-authored-by: omarshahine <10343873+omarshahine@users.noreply.github.com> Co-authored-by: omarshahine <10343873+omarshahine@users.noreply.github.com> Reviewed-by: @omarshahine
…law#66870) (openclaw#67426) Merged via squash. Prepared head SHA: 39e3cf1 Co-authored-by: omarshahine <10343873+omarshahine@users.noreply.github.com> Co-authored-by: omarshahine <10343873+omarshahine@users.noreply.github.com> Reviewed-by: @omarshahine
…law#66870) (openclaw#67426) Merged via squash. Prepared head SHA: 39e3cf1 Co-authored-by: omarshahine <10343873+omarshahine@users.noreply.github.com> Co-authored-by: omarshahine <10343873+omarshahine@users.noreply.github.com> Reviewed-by: @omarshahine
…law#66870) (openclaw#67426) Merged via squash. Prepared head SHA: 39e3cf1 Co-authored-by: omarshahine <10343873+omarshahine@users.noreply.github.com> Co-authored-by: omarshahine <10343873+omarshahine@users.noreply.github.com> Reviewed-by: @omarshahine
…law#66870) (openclaw#67426) Merged via squash. Prepared head SHA: 39e3cf1 Co-authored-by: omarshahine <10343873+omarshahine@users.noreply.github.com> Co-authored-by: omarshahine <10343873+omarshahine@users.noreply.github.com> Reviewed-by: @omarshahine
…law#66870) (openclaw#67426) Merged via squash. Prepared head SHA: 39e3cf1 Co-authored-by: omarshahine <10343873+omarshahine@users.noreply.github.com> Co-authored-by: omarshahine <10343873+omarshahine@users.noreply.github.com> Reviewed-by: @omarshahine
Summary
What started as a retry cap for #66870 uncovered and fixed two latent bugs in the catchup/dedupe plumbing from #66857 and #66230 that would have caused duplicate replies on every gateway restart for any user with catchup enabled.
1. Per-message retry cap (#66870)
catchup.maxFailureRetries(default 10, clamped[1, 1000]) so a persistently-failing message no longer wedges the catchup cursor forever.count >= maxmarks the GUID as "given up": catchup skips it on sight without anotherprocessMessageattempt, and the cursor advances past it.2. Lost-update race in persistent dedupe (found during live testing)
file-lock.tsgave concurrent callers for the same file immediate access instead of serializing them. TwocheckAndRecordInnercalls (inbound user message + outbound agent reply) would both read the same stale file, then the last writer silently overwrote the first writer's additions. The in-memory cache masked this within a process lifetime, but after restart the lost GUID caused catchup to replay already-handled messages — producing duplicate replies.persistent-dedupe.tsso read-modify-write cycles targeting the same dedupe file are serialized. The file lock continues to guard cross-process contention.3. Dedupe file naming migration gap (found during live testing)
${safe}.jsonto${safe}__${hash}.jsonbetween beta iterations. Upgrading started with an empty dedupe file and replayed the entire catchup window, producing duplicate replies for every recently-handled message.inbound-dedupe.tsthat renames the legacy file on first access. Also added awarmupBlueBubblesInboundDedupecall in catchup before the fetch so the migration and memory warmup run eagerly, not only whenprocessMessagehappens to be called.4. Balloon events bypassing debouncer (found during live testing)
balloonBundleIdin the query API response, so catchup replayed it as a standalone message — producing a duplicate reply.associatedMessageGuidset (tapbacks, reactions, balloons). Threaded replies usethreadOriginatorGuidinstead and are unaffected.Fixes #66870.
Live testing
Dogfooded on a live BlueBubbles install with real iMessage traffic across multiple stop/restart cycles:
openclaw doctor— clean after upgrade from 2026.4.14 to beta+retry-capreplayed=0(dedupe correctly recognizes the live-handled message)default.jsonrenamed todefault__37a8eec1ce19.jsonon first startup after migration fixreplayed=0 fetched=0on a clean bounce with no intervening messages (cursor fully caught up, no stale leftovers)associatedMessageGuidmessages are tapbacks/reactions only (checked 200 messages), threaded replies usethreadOriginatorGuidand are not filteredAutomated tests
pnpm test extensions/bluebubbles/— 425 passedpnpm tsgo— greenpnpm check— 0 warnings, 0 errorspnpm config:docs:check/pnpm plugin-sdk:api:check— baselines match🤖 Generated with Claude Code