BlueBubbles catchup: persistently-failing message wedges cursor (Option C: per-message retry cap)

## Problem

`runBlueBubblesCatchup` in `extensions/bluebubbles/src/catchup.ts` holds the cursor just before the earliest failed message's timestamp so retries pick up where they stopped. This is correct for transient failures (e.g., disk full, network blip, downstream plugin hiccup). But for a **persistently-failing message** — one whose `processMessage` call throws every time due to a malformed payload or a schema mismatch — the cursor stays wedged forever at that message's timestamp minus 1ms. Every subsequent gateway startup re-queries the same window, hits the same failure, and advances no further.

This was flagged as a Greptile P2 on PR #66857 ("design note that a persistently-failing message permanently wedges the catchup cursor — a known, documented tradeoff that doesn't introduce incorrect behavior on the normal path").

## Why the current design is intentional

Without holding the cursor, a transient failure permanently drops the failed message. That's the more severe failure mode (silent message loss vs. loud replay loop). The current tradeoff favors visibility:

- Every run logs `processMessage failed:` for the wedged message
- The catchup log line shows non-zero `failed` count every restart
- Operators see the pattern and can intervene

But "operators see the pattern" is a human-in-the-loop assumption. For unattended installs, the wedge can sit for a long time.

## Proposed fix (Option C from #66721's implementation plan)

Add a per-message retry counter. The cursor state evolves from `{ lastSeenMs, updatedAt }` to `{ lastSeenMs, updatedAt, failureRetries: { [messageGuid]: count } }`. On each run:

1. Count failed `processMessage` attempts per message GUID.
2. After N consecutive failures on the same GUID (default: 10, configurable), force-advance the cursor past that message and log a WARN (`catchup: giving up on guid=<X> after N retries; advancing cursor past timestamp=<T>`).
3. On successful processing of a message, clear its retry counter.

This preserves the "retry transient failures" behavior while putting a ceiling on "keep retrying forever" behavior.

## Alternative (simpler)

Add a maxTotalFailuresPerRun ceiling: if any single sweep produces more than N failures, force-advance to nowMs (treating the run as "too broken to hold"). Less granular but easier to reason about.

## Out of scope here

The current behavior is a documented tradeoff and not a correctness bug. This issue is a hardening follow-up. It should not block PR #66857 from merging.

## Related

- Greptile P2 on PR #66857 — deferred here
- `extensions/bluebubbles/src/catchup.ts:runBlueBubblesCatchupInner` — `earliestProcessFailureTs` tracking
- Original design discussion in #66721's issue body (Option C)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

BlueBubbles catchup: persistently-failing message wedges cursor (Option C: per-message retry cap) #66870

Problem

Why the current design is intentional

Proposed fix (Option C from #66721's implementation plan)

Alternative (simpler)

Out of scope here

Related

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

BlueBubbles catchup: persistently-failing message wedges cursor (Option C: per-message retry cap) #66870

Description

Problem

Why the current design is intentional

Proposed fix (Option C from #66721's implementation plan)

Alternative (simpler)

Out of scope here

Related

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions