fix(delivery-queue): increment retryCount on entries deferred when time budget exceeded#31192
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: dba6aa848b
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
src/infra/outbound/delivery-queue.ts
Outdated
| for (const r of remaining) { | ||
| try { | ||
| await failDelivery(r.id, "Recovery time budget exceeded — deferred", opts.stateDir); |
There was a problem hiding this comment.
Stop processing deferred entries after recovery deadline
Once now >= deadline, this loop performs a failDelivery read/write/rename for every remaining queue item before returning, which can take arbitrarily long on large queues or slow disks and effectively nullifies the documented maxRecoveryMs wall-clock cap for startup recovery. The same unbounded work pattern is also present in the now + backoff >= deadline branch, so recovery can still block startup well past the configured budget.
Useful? React with 👍 / 👎.
Greptile SummaryFixed critical bug where delivery queue entries deferred due to time budget exhaustion would loop forever without incrementing
The implementation correctly reuses existing Confidence Score: 5/5
Last reviewed commit: dba6aa8 |
Prevents crash when totals is undefined in byModel/byProvider/byAgent sort comparators. Fixes 'Cannot read properties of undefined (reading totalTokens)' crash that causes context overflow in active sessions.
…me budget exceeded When delivery recovery ran out of the 60s time budget, remaining pending entries were silently deferred to the next restart with no retryCount increment. This caused them to loop forever across restarts, never hitting MAX_RETRIES and never moving to failed/. Fix: call failDelivery() on each remaining entry before breaking out of the recovery loop (both the deadline check and the backoff-exceeds-deadline check). This increments retryCount so that entries eventually exhaust MAX_RETRIES and are permanently skipped. Fixes openclaw#24353
dba6aa8 to
098d276
Compare
… all remaining entries P1-C: After now >= deadline, the old code would iterate all remaining queue entries and call failDelivery() on each — O(n) work that nullified the maxRecoveryMs wall-clock cap on large queues. Fix: break out of the recovery loop immediately when the deadline is exceeded. Remaining entries are picked up on next startup unchanged (retryCount not incremented). The deadline means 'stop here', not 'fail everything remaining'.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: d8876ec76b
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if (now >= deadline) { | ||
| const deferred = pending.length - recovered - failed - skippedMaxRetries - deferredBackoff; | ||
| opts.log.warn(`Recovery time budget exceeded — ${deferred} entries deferred to next restart`); | ||
| opts.log.warn(`Recovery time budget exceeded — remaining entries deferred to next startup`); |
There was a problem hiding this comment.
Keep recovery warning text consistent with existing assertions
This changes the timeout warning from deferred to next restart to deferred to next startup, but recoverPendingDeliveries is still asserted against the old wording in src/infra/outbound/outbound.test.ts:448 (expect.stringContaining("deferred to next restart")). When maxRecoveryMs is exhausted, that test path now fails deterministically, so this commit will break the current recovery test suite unless the assertion or warning text is updated together.
Useful? React with 👍 / 👎.
|
The This PR only touches |
Problem
When delivery recovery exceeded the 60s time budget, remaining pending entries were silently skipped with a log warning but no
retryCountincrement. This caused them to loop forever across restarts — they'd always defer, never hitMAX_RETRIES, and never move tofailed/.Two code paths were affected:
now >= deadline)now + backoff >= deadline)Fix
Call
failDelivery()on each remaining entry before breaking out of the loop. This incrementsretryCountso that entries eventually exhaustMAX_RETRIESand are permanently skipped instead of spinning forever.The existing
failDeliveryhelper already handles the retryCount increment and stateDir write — no new logic needed.Fixes #24353