feat(gateway): unified durable message lifecycle — SQLite turns+outbox, continuous workers, plugin compat layer by nohat · Pull Request #28941 · openclaw/openclaw

nohat · 2026-02-27T16:27:04Z

Summary

Background

OpenClaw currently implements reliability guarantees as a collection of subsystem-specific mechanisms (inbound dedupe caches, outbound delivery queue, channel-specific update offsets/watermarks, restart catch-up logic, idempotency keys in selected flows, and local retry/permanent-error classification rules). The test suite demonstrates that these guarantees are important and intentionally maintained in isolation, but issue/PR history shows repeated failures at subsystem boundaries, especially during restart/crash/reconnect windows.

Recurring user-visible failures are:

accepted user messages that never receive a reply after restart/crash/network interruption,
duplicate message processing or duplicate replies caused by retries/replays/reconnects,
stale queued deliveries replayed long after relevance,
inconsistent abort/supersession behavior where canceled work is later retried or delivered,
channel-specific catch-up gaps that lose messages sent during downtime.

Root architectural gap: there is no single durable lifecycle model for a turn spanning:

inbound acceptance/idempotency,
run execution state (including retry/abort/supersession),
reply materialization,
outbound delivery and delivery confirmation.

Without a unified durable state machine, reliability semantics are repeatedly encoded as local rules (dedupe, skipQueue, retry classifiers, startup heuristics, pending markers, watermarks). This increases code volume, causes semantic drift, and makes restart correctness depend on special-case recovery code instead of structural guarantees.

Solution approach

Implements a unified durable message lifecycle for single-node OpenClaw, replacing the current fragmented per-subsystem reliability mechanisms (inbound dedupe caches, outbound file queue, startup-only orphan replay, channel-specific retry rules) with one state machine and continuous workers.

What this replaces: Previous draft PR #27939 added SQLite journaling + startup orphan replay on top of the existing fragmented reliability model. That approach received review feedback identifying 9 issues. Rather than patch those individually, this PR implements the structural fix the feedback was pointing toward: one state machine, continuous workers, no startup special-case recovery path.

Storage layer (src/infra/message-lifecycle/):

message_turns table: one row per accepted inbound turn. States: accepted | running | delivery_pending | failed_retryable | delivered | aborted | failed_terminal. Unique index on dedupe_key (when set). WAL + NORMAL synchronous. In-memory fallback on open failure.
message_outbox table: one row per outbound payload, linked to turn_id. States: queued | failed_retryable | delivered | failed_terminal | expired. Replaces delivery-queue/*.json files.
Transaction helpers, schema bootstrap, lifecycle-scoped DB cache with exit cleanup.

Delivery queue (src/infra/outbound/delivery-queue.ts): Migrated from file-based to message_outbox. Existing enqueueDelivery/ackDelivery/failDelivery/moveToFailed API preserved. getOutboxStatusForTurn added. Turn finalization triggered when all outbox rows for a turn reach terminal state. importLegacyFileQueue does one-time import of any delivery-queue/*.json artifacts (preserving lastAttemptAt for correct backoff).

Inbound dispatch (src/auto-reply/dispatch.ts): dispatchInboundMessageInternal now:

Calls acceptTurn (records turn, checks dedupe — currently using in-memory dedup path; see below).
On dedupe skip: calls dispatcher.markComplete() + waitForIdle() before returning (fixes r2862217589).
Sets deliveryQueueContext on the dispatcher so final replies are durably persisted before send.
After dispatchReplyFromConfig completes: finalizes turn state based on outbox status (delivery_pending, delivered, or failed_terminal).
dispatchResumedTurn exported for use by the recovery worker — skips acceptTurn, sets resumeTurnId.

Reply dispatcher (src/auto-reply/reply/reply-dispatcher.ts): When deliveryQueueContext is set, enqueues each final reply to message_outbox before calling deliver, acks on success, records error on failure. Adds setDeliveryQueueContext / getDeliveryStats to the ReplyDispatcher interface.

Continuous lifecycle workers (src/gateway/server-message-lifecycle.ts):

outbox-worker: polls recoverPendingDeliveries on interval (default 1s).
turn-worker: polls listRecoverableTurns, checks outbox state per turn, calls dispatchResumedTurn for turns with no outbox evidence. Marks turns terminal based on outbox outcomes. setDeliveryQueueContext is cleared on the recovery dispatcher (outbox rows are written by routeReply -> deliverOutboundPayloads in the deliver closure; clearing prevents duplicate outbox rows on the direct-delivery path).
Startup: import legacy queue -> start workers. No orphan replay loop, no min-age filter, no startup-specific reconciliation.
server.impl.ts: replaces the one-shot recoverPendingDeliveries call with startMessageLifecycleWorkers; calls lifecycleWorkers.stop() on shutdown.

Plugin outbound compat (src/channels/plugins/outbound/compat.ts, src/plugin-sdk/outbound-adapter.ts):

normalizeChannelOutboundAdapter: wraps v1 adapters (sendText+sendMedia or sendPayload) into a v2 sendFinal function, emits one-time runtime warning for v1 compat mode.
createCompatOutboundAdapter: SDK helper exported from plugin-sdk/index.ts for plugin authors.
resolveOutboundContractVersion: detects declared contract version.
All five built-in adapters declare outboundContract: "v2". direct-text-media.ts adds explicit sendFinal; others go through the compat wrapper.
buildChannelAccountSnapshot now surfaces outboundContract in channel status snapshots.

Config additions: messages.delivery.maxAgeMs (default 30 min) and messages.delivery.expireAction ("fail" | "deliver"). Config help and labels updated.

Guarantees

Every accepted inbound turn is durably recorded and deduped.
Restart behavior is continuation of non-terminal work, not startup-only orphan replay.
Turn terminal states are explicit and final: delivered | aborted | failed_terminal.
Outbound retry/permanent/expiry classification is centralized.
Abort/supersession transitions suppress replay.
For providers without hard idempotency, duplicate risk is bounded best-effort (not strict exactly-once).

Change Type

Scope

Linked Issues / Prior Work

Replaces approach in fix(gateway): message delivery reliability with SQLite journal (inbound dedup + orphan recovery) #27939 (closed in favor of this structural fix)
Closes [Bug]: Telegram sendMessage fails during gateway restart — messages lost with no retry #22376, Message runs interrupted by network errors are not retried, causing silent message loss #9208, WhatsApp: Messages silently dropped during reconnection window #14827, [Feature]: Message catch-up on gateway restart for Telegram and Discord #26783 (lost replies after crash/restart)
Closes Telegram reply dispatcher silently swallows delivery errors #15772 (delivery errors surfaced as explicit state)
Closes [Bug]: Delivery Queue Retries Permanently-Failed Entries Indefinitely #23777, [Feature]: Add TTL/Expiry for Delivery Queue Messages #16555 (permanent-error classification + TTL termination)
Closes [Bug]: Telegram inbound message can be re-queued on model fallback/rate-limit, causing duplicate user turns and missing outbound delivery #26764 (one durable turn identity prevents duplicate execution)
Closes [Feature]: Feishu inbound dedup cache lost on SIGUSR1 restart, causing duplicate message processing #14431, Message deduplication needed to prevent duplicate replies #19226 (persistent dedup replaces volatile cache — partially, see Deferred below)
Partially addresses Slack web client retry policy continues to cause duplicate messages #22780, [Bug]: Slack Duplicate Reply Message #19373, Interrupt queue mode sends duplicate response when second message arrives during generation #19426 (non-idempotent provider duplicate suppression)

Deferred / Explicitly Out of Scope

Persistent dedup disabled in this PR (disablePersistentDedupe = true in turns.ts): The dedupe_key unique index exists and is correct (peer/thread-scoped), but insert uses NULL dedupe key so the unique index is never exercised. Cross-restart duplicate suppression continues to use the existing in-memory path. Rationale: per-channel MessageSid identity semantics are not fully normalized (Telegram message IDs are per-chat; callback/query IDs vs message IDs differ). Enabling persistent dedup requires per-channel audit of what MessageSid maps to. This is the correct call for the first merge — enabling it as a follow-up is straightforward (flip the flag + per-channel validation).

Streaming/tool/block replies: Not tracked in message_outbox. Only final replies participate in durable delivery semantics.

Multi-node exactly-once: Single-process SQLite; no distributed consensus.

User-visible / Behavior Changes

None. Internal reliability improvement. Automatic migration from old file queue on first startup.

Security Impact

New permissions/capabilities? No
Secrets/tokens handling changed? No
New/changed network calls? No
Command/tool execution surface changed? No
Data access scope changed? No — new message-lifecycle.db in existing state dir, same permissions (0o700)

Repro + Verification

Environment

OS: macOS 15
Runtime: Node 22+ / Bun
Integration/channel: works across all built-in channels

Test Scenarios

Start gateway, send message, kill -9 mid-reply, restart -> turn worker resumes and delivers.
Send two messages with same ID from same peer -> second is deduped (in-memory path still active).
Send message, crash after enqueue but before ack -> outbox worker retries and delivers.
Permanent error (blocked account) -> turn and outbox rows finalize as failed_terminal after max retries.
Plugin with only sendText/sendMedia — compat wrapper normalizes to sendFinal, one-time warning logged.
Downgrade: delete ~/.openclaw/message-lifecycle.db -> gateway degrades gracefully (in-memory fallback active).

E2E scripts (macOS + Telegram): https://gist.github.com/nohat/657942433bb4c4e2a5fed2e12d49940b

Evidence

Tests passing: pnpm check (0 warnings), pnpm build, pnpm test (2294 tests across lifecycle, outbound, auto-reply, channels, gateway)
Full Telegram E2E campaign passed (7/7 tests, after phase against codex/unified-lifecycle-main):
- Test 1: DB created on first startup; delivery-queue files migrated to message_outbox
- Test 2: Inbound dedup persists across restart — lifecycle record survives
- Test 3: Abort confirmed in chat, recorded as aborted in message_turns
- Test 4: Orphan recovery — kill -9 mid-reply, restart, turn-worker delivers
- Test 5: Aborted delivery not re-queued after restart
- Test 6: Pruning bounded DB — aged turns removed at restart
- Test 7: No stuck running turns at end of campaign

Compatibility / Migration

Backward compatible? Yes
Config/env changes? No (new optional messages.delivery.* keys)
Migration needed? No — automatic on first startup via importLegacyFileQueue

Downgrade: delivery-queue/ entries are migrated and deleted. Delete message-lifecycle.db to reset.

Post-stabilization cleanup: after a validation window, the legacy orphan-recovery paths and startup-specific reconciliation code will be removed. This PR intentionally leaves them in place until the continuous workers are confirmed stable in production.

Failure Recovery

How to disable/revert quickly: delete ~/.openclaw/message-lifecycle.db; gateway falls back to in-memory dedup and best-effort delivery.
Known bad symptoms: getLifecycleDb() throwing on every turn (SQLite file corrupt or missing write permission); message-lifecycle: legacy queue import failed at startup.

Risks and Mitigations

SQLite write contention: WAL mode + single DatabaseSync singleton per process. All multi-step ops wrapped in BEGIN IMMEDIATE transactions.
Turn worker re-dispatching already-delivered turns: Turn worker checks getOutboxStatusForTurn before dispatching; delivered turns are finalized without re-dispatch. finalizeTurn uses WHERE status IN (non-terminal states) guard.
Plugin adapters without sendFinal: normalizeChannelOutboundAdapter returns undefined for adapters with neither sendText/sendMedia nor sendPayload; createPluginHandler returns null (same behavior as before for fully-incompatible adapters).
Node < 22.5 (no node:sqlite): getLifecycleDb falls back to in-memory DB with throttled warning — same degradation as the memory subsystem.

AI-assisted (Codex + Claude Code). Review and verification by @nohat.

AI-assisted
Author has verified E2E behavior (full 7-test Telegram campaign passed)

…tbox

…rage issues

… model

Design doc belongs in the PR description, not the Mintlify tree. E2E scripts require macOS + Telegram + personal state dir; moving to a gist rather than committing as repo test artifacts. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…n restart Entries enqueued after gateway startup with no prior attempt are live deliveries in flight — picking them up in the outbox worker would cause every reply to be sent twice (regression introduced by the unified lifecycle PR). Fix: loadPendingDeliveries now accepts an optional startupCutoff timestamp (set to Date.now() before workers start). Entries with queued_at >= cutoff AND last_attempt_at IS NULL AND attempt_count = 0 are excluded from recovery passes. Crash survivors (queued before startup) and transient failures (already had an attempt) are always included. Also adds RecoverySummary.skippedStartupCutoff counter so gateway logs surface how many live-delivery entries are filtered per pass — providing observable validation that no duplicate sends are occurring.

nohat · 2026-03-01T00:20:18Z

Closing — superseded by the full lifecycle stack (#29997–#30012):

feat(outbound): prefer sendPayload for all payloads when adapter supports it #29997 (sendPayload adapter unification)
feat(outbox): migrate delivery queue from file-based to SQLite outbox #29998 (SQLite outbox migration)
feat(outbox): write-ahead outbox with recovery worker and delivery tracking #30009 (write-ahead outbox with recovery worker)
feat(lifecycle): inbound turn tracking with orphan recovery and abort coordination #30011 (inbound turn tracking with orphan recovery)
feat(lifecycle): persistent inbound dedup across gateway restarts #30012 (persistent inbound dedup)
feat(adapters): add sendPayload to batch-a (BlueBubbles, iMessage, Signal, Telegram, WhatsApp) #30141–feat(adapters): add sendPayload to batch-d (Zalo, Zalouser, core outbound plugins) #30144 (adapter sendPayload batches a–d)

The unified durable lifecycle goals from this PR are now decomposed across the above PRs for incremental review and merge.

openclaw-barnacle bot added docs Improvements or additions to documentation app: web-ui App: web-ui gateway Gateway runtime size: XL labels Feb 27, 2026

nohat added 2 commits February 27, 2026 08:37

feat(gateway): unified durable message lifecycle with SQLite turns+ou…

ee0de46

…tbox

fix(gateway): resolve double-enqueue, routedFinalCount, and test cove…

2523cd8

…rage issues

nohat force-pushed the codex/unified-lifecycle-main branch from 1a78974 to 2523cd8 Compare February 27, 2026 16:47

openclaw-barnacle bot added the agents Agent runtime and tooling label Feb 27, 2026

test(gateway): add lifecycle E2E test scripts for unified turn+outbox…

da4745d

… model

openclaw-barnacle bot added the scripts Repository scripts label Feb 27, 2026

openclaw-barnacle bot removed docs Improvements or additions to documentation scripts Repository scripts labels Feb 27, 2026

nohat closed this Mar 1, 2026

nohat mentioned this pull request Mar 2, 2026

Message Reliability: Durable SQLite Outbox, Recovery Worker, and Unified sendPayload #32063

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(gateway): unified durable message lifecycle — SQLite turns+outbox, continuous workers, plugin compat layer#28941

feat(gateway): unified durable message lifecycle — SQLite turns+outbox, continuous workers, plugin compat layer#28941
nohat wants to merge 5 commits intoopenclaw:mainfrom
nohat:codex/unified-lifecycle-main

nohat commented Feb 27, 2026 •

edited

Loading

Uh oh!

nohat commented Mar 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

nohat commented Feb 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Background

Solution approach

Guarantees

Change Type

Scope

Linked Issues / Prior Work

Deferred / Explicitly Out of Scope

User-visible / Behavior Changes

Security Impact

Repro + Verification

Environment

Test Scenarios

Evidence

Compatibility / Migration

Failure Recovery

Risks and Mitigations

Uh oh!

nohat commented Mar 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

nohat commented Feb 27, 2026 •

edited

Loading