Skip to content

fix(delivery-queue): increment retryCount on entries deferred when time budget exceeded#31192

Open
stephenschoettler wants to merge 5 commits intoopenclaw:mainfrom
stephenschoettler:fix/delivery-queue-retry-count
Open

fix(delivery-queue): increment retryCount on entries deferred when time budget exceeded#31192
stephenschoettler wants to merge 5 commits intoopenclaw:mainfrom
stephenschoettler:fix/delivery-queue-retry-count

Conversation

@stephenschoettler
Copy link

Problem

When delivery recovery exceeded the 60s time budget, remaining pending entries were silently skipped with a log warning but no retryCount increment. This caused them to loop forever across restarts — they'd always defer, never hit MAX_RETRIES, and never move to failed/.

Two code paths were affected:

  1. The deadline check (now >= deadline)
  2. The backoff-exceeds-deadline check (now + backoff >= deadline)

Fix

Call failDelivery() on each remaining entry before breaking out of the loop. This increments retryCount so that entries eventually exhaust MAX_RETRIES and are permanently skipped instead of spinning forever.

The existing failDelivery helper already handles the retryCount increment and stateDir write — no new logic needed.

Fixes #24353

@openclaw-barnacle openclaw-barnacle bot added gateway Gateway runtime size: XS labels Mar 2, 2026
Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: dba6aa848b

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +238 to +240
for (const r of remaining) {
try {
await failDelivery(r.id, "Recovery time budget exceeded — deferred", opts.stateDir);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Stop processing deferred entries after recovery deadline

Once now >= deadline, this loop performs a failDelivery read/write/rename for every remaining queue item before returning, which can take arbitrarily long on large queues or slow disks and effectively nullifies the documented maxRecoveryMs wall-clock cap for startup recovery. The same unbounded work pattern is also present in the now + backoff >= deadline branch, so recovery can still block startup well past the configured budget.

Useful? React with 👍 / 👎.

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Mar 2, 2026

Greptile Summary

Fixed critical bug where delivery queue entries deferred due to time budget exhaustion would loop forever without incrementing retryCount. The fix calls failDelivery() on remaining entries at both deadline checkpoints (immediate deadline and backoff-exceeds-deadline), ensuring they eventually reach MAX_RETRIES and move to failed/ instead of spinning indefinitely across restarts.

  • delivery-queue.ts: Added retry count incrementation for deferred entries at both time budget checkpoints, preventing infinite retry loops
  • chrome.ts: Changed stdio from "pipe" to explicit ignore array to prevent buffer blocking in Docker environments
  • usage.ts & session-cost-usage.ts: Added null guards to sort comparators to handle undefined totals gracefully

The implementation correctly reuses existing failDelivery() helper which handles atomic file updates, and wraps calls in try/catch for best-effort resilience.

Confidence Score: 5/5

  • This PR is safe to merge with minimal risk
  • The fix addresses a well-defined bug with a straightforward solution that reuses existing, tested helpers (failDelivery, moveToFailed). The logic correctly handles both deadline scenarios and maintains consistency with existing error handling patterns. Additional changes are defensive improvements with low risk.
  • No files require special attention

Last reviewed commit: dba6aa8

Prevents crash when totals is undefined in byModel/byProvider/byAgent
sort comparators. Fixes 'Cannot read properties of undefined (reading
totalTokens)' crash that causes context overflow in active sessions.
…me budget exceeded

When delivery recovery ran out of the 60s time budget, remaining pending
entries were silently deferred to the next restart with no retryCount
increment. This caused them to loop forever across restarts, never hitting
MAX_RETRIES and never moving to failed/.

Fix: call failDelivery() on each remaining entry before breaking out of
the recovery loop (both the deadline check and the backoff-exceeds-deadline
check). This increments retryCount so that entries eventually exhaust
MAX_RETRIES and are permanently skipped.

Fixes openclaw#24353
@stephenschoettler stephenschoettler force-pushed the fix/delivery-queue-retry-count branch from dba6aa8 to 098d276 Compare March 2, 2026 04:04
… all remaining entries

P1-C: After now >= deadline, the old code would iterate all remaining queue
entries and call failDelivery() on each — O(n) work that nullified the
maxRecoveryMs wall-clock cap on large queues.

Fix: break out of the recovery loop immediately when the deadline is exceeded.
Remaining entries are picked up on next startup unchanged (retryCount not
incremented). The deadline means 'stop here', not 'fail everything remaining'.
Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: d8876ec76b

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

if (now >= deadline) {
const deferred = pending.length - recovered - failed - skippedMaxRetries - deferredBackoff;
opts.log.warn(`Recovery time budget exceeded — ${deferred} entries deferred to next restart`);
opts.log.warn(`Recovery time budget exceeded — remaining entries deferred to next startup`);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Keep recovery warning text consistent with existing assertions

This changes the timeout warning from deferred to next restart to deferred to next startup, but recoverPendingDeliveries is still asserted against the old wording in src/infra/outbound/outbound.test.ts:448 (expect.stringContaining("deferred to next restart")). When maxRecoveryMs is exhausted, that test path now fails deterministically, so this commit will break the current recovery test suite unless the assertion or warning text is updated together.

Useful? React with 👍 / 👎.

@stephenschoettler
Copy link
Author

The install-smoke failure is not caused by changes in this PR. The error is a pre-existing TypeScript type mismatch in src/browser/chrome.ts (line 319):

TS2322: Type 'ChildProcessByStdio<null, null, null>' is not assignable to type 'ChildProcessWithoutNullStreams'.
Types of property 'stdin' are incompatible. Type 'null' is not assignable to type 'Writable'.

This PR only touches src/infra/outbound/delivery-queue.ts — no browser code at all. The chrome.ts issue is addressed in PR #27899. Merging #27899 first should clear the install-smoke check for this PR as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

gateway Gateway runtime size: XS

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: Missing delivery queue monitoring leads to periodic gateway hangs

1 participant