feat(outbox): add write-ahead outbox, recovery worker, and plugin compat layer by nohat · Pull Request #29148 · openclaw/openclaw

nohat · 2026-02-27T21:44:30Z

Summary

Problem: Deliveries are fire-and-forget — if the gateway crashes mid-send, the reply is lost; plugin channels lack durable delivery guarantees; one-shot recovery only runs at startup
Why it matters: Users lose replies on crashes; plugin channels have weaker delivery semantics than built-in channels; no continuous retry for transient failures
What changed: Write-ahead pattern (enqueue outbox entry before sending, ack on success, retry on failure); continuous outbox worker replaces one-shot startup recovery; startup cutoff filter prevents double-delivery; v1/v2 plugin adapter compat layer gives plugin channels the same durability
What did NOT change (scope boundary): No inbound turn tracking yet (comes in PR 3); no deduplication of inbound messages; no orphan turn recovery

Change Type (select all)

Scope (select all touched areas)

Linked Issue/PR

User-visible / Behavior Changes

Replies are now persisted to SQLite before delivery — crash recovery is automatic via continuous worker loop
Plugin channels with outboundContract: "v2" get write-ahead delivery; v1 plugins continue to work via compat layer
outboundContract field appears in channel status/schema

Security Impact (required)

New permissions/capabilities? No
Secrets/tokens handling changed? No
New/changed network calls? No
Command/tool execution surface changed? No
Data access scope changed? No

Repro + Verification

Environment

OS: macOS
Runtime/container: Node 22+ / Bun
Model/provider: N/A
Integration/channel (if any): All outbound channels + plugin channels
Relevant config (redacted): N/A

Steps

Start gateway — verify outbox worker starts and logs recovery passes
Send a reply — verify outbox entry is created before delivery, acked after
Kill gateway mid-delivery — verify entry is recovered on restart
Test with a v1 plugin channel — verify compat layer normalizes adapter

Expected

Outbox worker runs continuously, recovering pending entries
Entries enqueued during current instance's lifetime are skipped (startup cutoff)
v1/v2 plugin adapters both work through compat layer

Actual

Verified via test suite (948 tests pass)

Evidence

Failing test/log before + passing after
Trace/log snippets
Screenshot/recording
Perf numbers (if relevant)

Human Verification (required)

Verified scenarios: pnpm build, pnpm test (948 pass), pnpm check
Edge cases checked: Startup cutoff filtering, permanent error detection, v1/v2 adapter detection, worker shutdown
What you did not verify: Live gateway crash recovery, real plugin channel delivery

Compatibility / Migration

Backward compatible? Yes
Config/env changes? No
Migration needed? No
If yes, exact upgrade steps: N/A

Failure Recovery (if this breaks)

How to disable/revert this change quickly: Revert commit; previous one-shot recovery still works from PR 1
Files/config to restore: N/A
Known bad symptoms reviewers should watch for: Duplicate deliveries (startup cutoff not filtering correctly); worker not stopping cleanly on shutdown

Risks and Mitigations

Risk: Startup cutoff filter too aggressive — skipping entries that should be recovered
- Mitigation: Only entries with no prior attempt AND enqueued after startup are skipped; entries with attempt_count > 0 or last_attempt_at IS NOT NULL are always included
Risk: Plugin compat layer misdetects adapter version
- Mitigation: Explicit outboundContract field on channel config; fallback to v1 wrapping when absent

Part 2 of 3: #29147 (SQLite outbox) → write-ahead outbox + worker → #29149 (turn tracking)
Merge after #29147. Incremental diff: git diff lifecycle/sqlite-outbox...lifecycle/write-ahead-outbox

E2E Test Results

All 4 tests pass ✅

#	Test	Result
1	Write-Ahead Delivery — send Telegram message → verify outbox entry created with `status='delivered'`	✅ `before=0, after=1`
2	Continuous Recovery Worker — seed pending outbox entries → verify worker picks them up without restart	✅ `attempt_count: 1→2` (both entries)
3	Startup Cutoff Filter — Entry A (fresh, no attempts) skipped; Entry B (prior attempt) recovered	✅ `A=queued, B=failed_retryable (attempts: 1→2)`
4	Outbox Pruning — age delivered rows by 49h → verify pruned; recent rows retained	✅ old entries pruned, recent retained

Test script: https://gist.github.com/nohat/d98c213adfeaebf196e4137ed7d693c0

Replace unbounded file-based delivery queue with queryable SQLite message_outbox table. Adds TTL/expiry for stale entries, delivery outcome retention, and one-time legacy file queue import on startup. Closes openclaw#23777, openclaw#16555, openclaw#29128

…pat layer Write-ahead delivery pattern: enqueue outbox entry before sending, ack on success, retry on failure. Continuous outbox worker replaces one-shot recovery. Plugin channels get durable delivery guarantees via v1/v2 adapter compat layer. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

greptile-apps · 2026-02-28T00:27:26Z

Greptile Summary

This PR implements a write-ahead outbox pattern for durable message delivery, replacing the previous fire-and-forget approach with persistent, crash-recoverable delivery semantics.

Key Changes

Added SQLite-backed outbox (message_outbox table in message-lifecycle.db) with proper indexing for recovery queries and turn tracking
Implemented write-ahead delivery pattern: entries are persisted to the outbox before sending, then acked on success or marked for retry on failure
Replaced one-shot startup recovery with continuous outbox worker that runs every 1s, recovering pending entries with exponential backoff (5s → 25s → 2m → 10m)
Added startup cutoff filter to prevent double-delivery: entries enqueued during the current instance's lifetime with no prior attempts are skipped by the recovery worker (they're actively being delivered on the direct path)
Created plugin compatibility layer (compat.ts) that normalizes v1 adapters (sendText/sendMedia) and v2 adapters (sendFinal) to a unified interface, giving plugin channels the same durability as built-in channels
Extended channel adapter types with outboundContract field and ChannelOutboundDeliveryMetadata for tracking turn IDs, outbox IDs, and retry attempts
Integrated lifecycle worker into gateway startup/shutdown (server-message-lifecycle.ts) with proper cleanup on close
Added permanent error detection patterns (chat not found, user blocked, etc.) that immediately mark entries as terminal rather than retrying

Architecture

The outbox worker runs continuously with a time-budgeted recovery pass (75% of interval). Entries go through states: queued → delivered (success) or failed_retryable (transient) or failed_terminal (permanent). The startup cutoff timestamp is captured before the worker starts, ensuring entries enqueued after that point with no prior attempts are recognized as "live deliveries in progress" and skipped to avoid duplication.

Test Coverage

Comprehensive test suite in outbound.test.ts covers:

Enqueue/ack lifecycle
Retry counting and backoff
Startup cutoff filtering (regression test for double-delivery)
Max retry enforcement
Permanent error detection
Partial failure handling in bestEffort mode
Recovery time budget enforcement
Backoff eligibility checks

All 948 existing tests pass, and E2E tests verify write-ahead delivery, continuous recovery, startup cutoff filtering, and outbox pruning.

Confidence Score: 5/5

This PR is safe to merge with high confidence - the implementation is well-architected, thoroughly tested, and handles edge cases correctly
The implementation demonstrates excellent engineering practices: proper transaction handling with BEGIN IMMEDIATE, comprehensive error classification (permanent vs transient), clean shutdown handling, startup cutoff logic to prevent double-delivery, extensive test coverage (948 tests pass, includes regression tests), and backward compatibility with legacy file queue import. The write-ahead pattern is correctly implemented with proper cleanup on both success and failure paths. The continuous recovery worker has time-budgeted passes to prevent overlap. No logical errors, race conditions, or security vulnerabilities were identified during review.
No files require special attention

_{Last reviewed commit: 8db9185}

chatgpt-codex-connector

💡 Codex Review

openclaw/src/infra/outbound/deliver.ts

Lines 268 to 271 in 8db9185

    
           const wrappedParams = params.onError 
        
             ? { 
        
                 ...params, 
        
                 onError: (err: unknown, payload: NormalizedOutboundPayload) => {

Track best-effort send failures without relying on onError

This wrapper only marks hadPartialFailure when a caller provides onError, but deliverOutboundPayloadsCore can still swallow per-payload failures when bestEffort is true and onError is absent. In that case the outbox row is acked as delivered even though one or more sends failed, so recovery will never retry those failed payloads (for example, current bestEffort: true call sites that do not pass onError).

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

src/gateway/server.impl.ts

…nclaw#29148)

nohat · 2026-02-28T00:41:09Z

Addressed the P1 Track best-effort send failures without relying on onError from the Codex review.

Change: When bestEffort is true we now always pass a wrapper that sets hadPartialFailure on per-payload failure, whether or not the caller provides onError. So outbox rows are correctly failed (and retried by the recovery worker) when some payloads fail, even for call sites that only pass bestEffort: true (e.g. server-restart-sentinel, server-node-events, delivery-dispatch).

deliver.ts: Wrap when params.bestEffort || params.onError; wrapper calls params.onError?.(err, payload) so it’s a no-op when onError is absent.
deliver.test.ts: New test that bestEffort partial failure without onError still calls failDelivery and does not ack.

Commit: 2c37968

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 2c379683c4

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

src/gateway/server-message-lifecycle.ts

src/infra/outbound/deliver.ts

…(P2 openclaw#29148)

…low v2 sendFinal-only plugin adapters (Codex P1+P2 openclaw#29148)

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 8e5d3dfe1c

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

src/infra/outbound/delivery-queue.ts

…Delivery

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 2b34946507

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-02-28T03:53:46Z

src/infra/outbound/deliver.ts

+    sendPayload: outbound.sendFinal
      ? async (payload, overrides) =>
-          outbound.sendPayload!({
+          outbound.sendFinal({
            ...resolveCtx(overrides),


Preserve chunking/media fan-out for inferred sendFinal adapters

createPluginHandler now enables sendPayload whenever outbound.sendFinal exists, but normalizeChannelOutboundAdapter synthesizes sendFinal for plain sendText/sendMedia adapters too. That means any payload with channelData now takes the sendPayload fast path and skips the normal chunking and per-media loop in deliverOutboundPayloadsCore; with the inferred implementation this can drop all but the first mediaUrl (and bypass text chunking), so channelData messages with attachments or long text can be delivered incorrectly.

Useful? React with 👍 / 👎.

Fixed in aabc291: sendPayload is now only set when normalized.contract === "v2" (native sendFinal). Inferred sendFinal from v1 sendText/sendMedia adapters no longer takes the sendPayload fast path, so channelData payloads continue through the normal chunking and per-media fan-out loop.

Already addressed in aabc291, and further refined in 377a635 which also preserves the sendPayload path for v1 adapters with native sendPayload (e.g. LINE).

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: aabc291c50

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-02-28T04:34:34Z

src/infra/outbound/deliver.ts

-        mediaUrl,
-      }),
+    sendPayload:
+      isNativeV2 && outbound.sendFinal


Preserve sendPayload path for legacy adapters

Restricting sendPayload to normalized.contract === "v2" drops channel-data delivery for existing v1 adapters that still implement sendPayload (for example extensions/line/src/channel.ts handles quick replies/templates via payload.channelData but does not declare outboundContract: "v2"). In that configuration, deliverOutboundPayloadsCore no longer calls the adapter’s sendPayload and falls back to sendText/sendMedia, which silently strips LINE-specific rich message metadata.

Useful? React with 👍 / 👎.

Good catch. Fixed in 377a635: createPluginHandler now also exposes sendPayload when the original v1 adapter implements it natively (e.g. LINE). The isNativeV2 guard still prevents inferred sendFinal from taking the fast path, but v1 adapters with their own sendPayload (like LINE's channelData handler) are passed through directly via hasNativeSendPayload.

…tive sendPayload

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 377a635868

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-02-28T16:01:51Z

src/auto-reply/reply/reply-dispatcher.ts

+          queueId = await enqueueDelivery({
+            channel: deliveryQueueContext.channel,
+            to: deliveryQueueContext.to,
+            accountId: deliveryQueueContext.accountId,
+            payloads: [normalized],


Preserve dispatch kind when queuing dispatcher payloads

These rows enqueue every normalized payload without recording whether it was a tool, block, or final send, but crash recovery replays all queued rows through deliverOutboundPayloads (see src/gateway/server-message-lifecycle.ts). That means channels that intentionally suppress non-final sends in their live dispatcher (for example src/web/auto-reply/monitor/process-message.ts returns early when info.kind !== "final") can leak those suppressed payloads after a crash if the process dies between enqueueDelivery and ackDelivery; on restart, recovery will treat them as normal outbound messages and send them to users.

Useful? React with 👍 / 👎.

Fixed in 46581bc. enqueueDelivery now accepts and stores dispatchKind ("tool" / "block" / "final") in the outbox payload JSON. The reply dispatcher passes kind through at the call site.

During recoverPendingDeliveries, after loading pending entries, non-final entries (those with dispatchKind set and not "final") are marked failed_terminal with terminal_reason='non_final_recovery_skip' and excluded from recovery. This is safe because the turn recovery worker replays the entire turn, regenerating tool/block outputs — recovering them from the outbox would bypass channel-specific kind filtering (e.g. web channel suppresses non-final sends in the live path).

Entries without dispatchKind (legacy rows, or entries from deliverOutboundPayloads) are treated as final and recovered normally.

…overy

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 46581bc267

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-02-28T16:29:50Z

src/channels/plugins/outbound/slack.ts


 export const slackOutbound: ChannelOutboundAdapter = {
  deliveryMode: "direct",
+  outboundContract: "v2",


Keep text/media adapters on v1 until sendFinal is native

Fresh evidence in this commit is that core adapters like Slack are now marked outboundContract: "v2" even though they only implement sendText/sendMedia. That makes createPluginHandler treat them as native v2 and route any channelData payload through sendPayload/sendFinal; for inferred sendFinal, compat.ts only forwards the first media URL, so multi-media and chunked deliveries can be truncated when channelData is present (the fast path is taken whenever effectivePayload.channelData exists). Keep these adapters on v1 until they provide a real sendFinal that preserves fan-out/chunk semantics.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-02-28T16:29:50Z

src/infra/outbound/delivery-queue.ts

  } catch (err) {
-    const code =
-      err && typeof err === "object" && "code" in err
-        ? String((err as { code?: unknown }).code)
-        : null;
-    if (code !== "ENOENT") {
-      throw err;
-    }
-    // Already removed — no-op.
+    logVerbose(`delivery-queue: ackDelivery failed: ${String(err)}`);
  }


Propagate ackDelivery failures instead of swallowing them

ackDelivery now catches and suppresses every DB error, but recovery still counts the entry as recovered immediately afterward. If the SQLite update fails (for example SQLITE_BUSY during concurrent access), the row remains pending and will be replayed on later worker ticks, producing duplicate outbound sends while logs report success. Letting this error propagate (or returning an explicit failure) would allow recovery to treat it as a failed attempt instead of silently requeueing duplicates.

Useful? React with 👍 / 👎.

…l and separate ackDelivery errors in recovery

…compat layer

openclaw-barnacle · 2026-02-28T17:23:18Z