-
-
Notifications
You must be signed in to change notification settings - Fork 52.6k
Description
Bug
recoverPendingDeliveries in src/infra/outbound/delivery-queue.ts sorts pending entries oldest-first and iterates with exponential backoff. If ANY entry's computed backoff exceeds the remaining recovery budget (maxRecoveryMs, default 60s), the code breaks the entire loop — starving every subsequent entry, regardless of their retry count.
Reproduction
- Have a delivery entry at
retryCount >= 2(backoff = 120s) - Gateway restarts and runs
recoverPendingDeliverieswith default 60s budget - Recovery hits that entry first (oldest-first sort), computes backoff > budget, logs
Recovery time budget exceeded, andbreaks - All remaining entries (even
retryCount=0orretryCount=1with 5s/25s backoff) are never attempted - This repeats on every subsequent restart — permanently stuck queue
Evidence from logs
2026-02-26T00:18:03.340Z [delivery-recovery] Recovery time budget exceeded — 7 entries deferred to next restart
2026-02-26T00:59:26.896Z [delivery-recovery] Recovery time budget exceeded — 7 entries deferred to next restart
... (repeated every ~40 min for 14+ hours)
2026-02-26T14:12:45.505Z [delivery-recovery] Recovery time budget exceeded — 7 entries deferred to next restart
Each run: Found 7 pending delivery entries — starting recovery → Delivery recovery complete: 0 recovered, 0 failed, 0 skipped (max retries)
Proposed Fix
Change the break to continue so entries whose backoff exceeds the remaining budget are skipped individually rather than blocking the entire loop:
if (backoff > 0) {
if (now + backoff >= deadline) {
- const deferred = pending.length - recovered - failed - skipped;
- opts.log.warn(\`Recovery time budget exceeded — \${deferred} entries deferred to next restart\`);
- break;
+ opts.log.info(\`Backoff \${backoff}ms exceeds budget for \${entry.id} — skipping to next entry\`);
+ continue;
}This ensures newer entries with shorter backoffs still get retried even when older entries need longer waits. The skipped entries will be retried on the next recovery pass when their backoff period has elapsed.
Impact
Any installation with a delivery queue entry at retryCount >= 2 will have its entire queue permanently stuck. This is a production-impacting bug for anyone using the delivery queue with announce/cron delivery.