Problem
runBlueBubblesCatchup in extensions/bluebubbles/src/catchup.ts holds the cursor just before the earliest failed message's timestamp so retries pick up where they stopped. This is correct for transient failures (e.g., disk full, network blip, downstream plugin hiccup). But for a persistently-failing message — one whose processMessage call throws every time due to a malformed payload or a schema mismatch — the cursor stays wedged forever at that message's timestamp minus 1ms. Every subsequent gateway startup re-queries the same window, hits the same failure, and advances no further.
This was flagged as a Greptile P2 on PR #66857 ("design note that a persistently-failing message permanently wedges the catchup cursor — a known, documented tradeoff that doesn't introduce incorrect behavior on the normal path").
Why the current design is intentional
Without holding the cursor, a transient failure permanently drops the failed message. That's the more severe failure mode (silent message loss vs. loud replay loop). The current tradeoff favors visibility:
- Every run logs
processMessage failed: for the wedged message
- The catchup log line shows non-zero
failed count every restart
- Operators see the pattern and can intervene
But "operators see the pattern" is a human-in-the-loop assumption. For unattended installs, the wedge can sit for a long time.
Proposed fix (Option C from #66721's implementation plan)
Add a per-message retry counter. The cursor state evolves from { lastSeenMs, updatedAt } to { lastSeenMs, updatedAt, failureRetries: { [messageGuid]: count } }. On each run:
- Count failed
processMessage attempts per message GUID.
- After N consecutive failures on the same GUID (default: 10, configurable), force-advance the cursor past that message and log a WARN (
catchup: giving up on guid=<X> after N retries; advancing cursor past timestamp=<T>).
- On successful processing of a message, clear its retry counter.
This preserves the "retry transient failures" behavior while putting a ceiling on "keep retrying forever" behavior.
Alternative (simpler)
Add a maxTotalFailuresPerRun ceiling: if any single sweep produces more than N failures, force-advance to nowMs (treating the run as "too broken to hold"). Less granular but easier to reason about.
Out of scope here
The current behavior is a documented tradeoff and not a correctness bug. This issue is a hardening follow-up. It should not block PR #66857 from merging.
Related
Problem
runBlueBubblesCatchupinextensions/bluebubbles/src/catchup.tsholds the cursor just before the earliest failed message's timestamp so retries pick up where they stopped. This is correct for transient failures (e.g., disk full, network blip, downstream plugin hiccup). But for a persistently-failing message — one whoseprocessMessagecall throws every time due to a malformed payload or a schema mismatch — the cursor stays wedged forever at that message's timestamp minus 1ms. Every subsequent gateway startup re-queries the same window, hits the same failure, and advances no further.This was flagged as a Greptile P2 on PR #66857 ("design note that a persistently-failing message permanently wedges the catchup cursor — a known, documented tradeoff that doesn't introduce incorrect behavior on the normal path").
Why the current design is intentional
Without holding the cursor, a transient failure permanently drops the failed message. That's the more severe failure mode (silent message loss vs. loud replay loop). The current tradeoff favors visibility:
processMessage failed:for the wedged messagefailedcount every restartBut "operators see the pattern" is a human-in-the-loop assumption. For unattended installs, the wedge can sit for a long time.
Proposed fix (Option C from #66721's implementation plan)
Add a per-message retry counter. The cursor state evolves from
{ lastSeenMs, updatedAt }to{ lastSeenMs, updatedAt, failureRetries: { [messageGuid]: count } }. On each run:processMessageattempts per message GUID.catchup: giving up on guid=<X> after N retries; advancing cursor past timestamp=<T>).This preserves the "retry transient failures" behavior while putting a ceiling on "keep retrying forever" behavior.
Alternative (simpler)
Add a maxTotalFailuresPerRun ceiling: if any single sweep produces more than N failures, force-advance to nowMs (treating the run as "too broken to hold"). Less granular but easier to reason about.
Out of scope here
The current behavior is a documented tradeoff and not a correctness bug. This issue is a hardening follow-up. It should not block PR #66857 from merging.
Related
extensions/bluebubbles/src/catchup.ts:runBlueBubblesCatchupInner—earliestProcessFailureTstracking