Skip to content

delivery-queue: head-of-line blocking in recoverPendingDeliveries prevents retry of all queued entries #27638

@tony-freedomology

Description

@tony-freedomology

Bug

recoverPendingDeliveries in src/infra/outbound/delivery-queue.ts sorts pending entries oldest-first and iterates with exponential backoff. If ANY entry's computed backoff exceeds the remaining recovery budget (maxRecoveryMs, default 60s), the code breaks the entire loop — starving every subsequent entry, regardless of their retry count.

Reproduction

  1. Have a delivery entry at retryCount >= 2 (backoff = 120s)
  2. Gateway restarts and runs recoverPendingDeliveries with default 60s budget
  3. Recovery hits that entry first (oldest-first sort), computes backoff > budget, logs Recovery time budget exceeded, and breaks
  4. All remaining entries (even retryCount=0 or retryCount=1 with 5s/25s backoff) are never attempted
  5. This repeats on every subsequent restart — permanently stuck queue

Evidence from logs

2026-02-26T00:18:03.340Z [delivery-recovery] Recovery time budget exceeded — 7 entries deferred to next restart
2026-02-26T00:59:26.896Z [delivery-recovery] Recovery time budget exceeded — 7 entries deferred to next restart
... (repeated every ~40 min for 14+ hours)
2026-02-26T14:12:45.505Z [delivery-recovery] Recovery time budget exceeded — 7 entries deferred to next restart

Each run: Found 7 pending delivery entries — starting recoveryDelivery recovery complete: 0 recovered, 0 failed, 0 skipped (max retries)

Proposed Fix

Change the break to continue so entries whose backoff exceeds the remaining budget are skipped individually rather than blocking the entire loop:

 if (backoff > 0) {
   if (now + backoff >= deadline) {
-    const deferred = pending.length - recovered - failed - skipped;
-    opts.log.warn(\`Recovery time budget exceeded — \${deferred} entries deferred to next restart\`);
-    break;
+    opts.log.info(\`Backoff \${backoff}ms exceeds budget for \${entry.id} — skipping to next entry\`);
+    continue;
   }

This ensures newer entries with shorter backoffs still get retried even when older entries need longer waits. The skipped entries will be retried on the next recovery pass when their backoff period has elapsed.

Impact

Any installation with a delivery queue entry at retryCount >= 2 will have its entire queue permanently stuck. This is a production-impacting bug for anyone using the delivery queue with announce/cron delivery.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions