feat(gateway): unified durable message lifecycle — SQLite turns+outbox, continuous workers, plugin compat layer#28941
Closed
nohat wants to merge 5 commits intoopenclaw:mainfrom
Closed
feat(gateway): unified durable message lifecycle — SQLite turns+outbox, continuous workers, plugin compat layer#28941nohat wants to merge 5 commits intoopenclaw:mainfrom
nohat wants to merge 5 commits intoopenclaw:mainfrom
Conversation
1a78974 to
2523cd8
Compare
Design doc belongs in the PR description, not the Mintlify tree. E2E scripts require macOS + Telegram + personal state dir; moving to a gist rather than committing as repo test artifacts. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…n restart Entries enqueued after gateway startup with no prior attempt are live deliveries in flight — picking them up in the outbox worker would cause every reply to be sent twice (regression introduced by the unified lifecycle PR). Fix: loadPendingDeliveries now accepts an optional startupCutoff timestamp (set to Date.now() before workers start). Entries with queued_at >= cutoff AND last_attempt_at IS NULL AND attempt_count = 0 are excluded from recovery passes. Crash survivors (queued before startup) and transient failures (already had an attempt) are always included. Also adds RecoverySummary.skippedStartupCutoff counter so gateway logs surface how many live-delivery entries are filtered per pass — providing observable validation that no duplicate sends are occurring.
This was referenced Feb 27, 2026
Contributor
Author
|
Closing — superseded by the full lifecycle stack (#29997–#30012):
The unified durable lifecycle goals from this PR are now decomposed across the above PRs for incremental review and merge. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Background
OpenClaw currently implements reliability guarantees as a collection of subsystem-specific mechanisms (inbound dedupe caches, outbound delivery queue, channel-specific update offsets/watermarks, restart catch-up logic, idempotency keys in selected flows, and local retry/permanent-error classification rules). The test suite demonstrates that these guarantees are important and intentionally maintained in isolation, but issue/PR history shows repeated failures at subsystem boundaries, especially during restart/crash/reconnect windows.
Recurring user-visible failures are:
Root architectural gap: there is no single durable lifecycle model for a turn spanning:
Without a unified durable state machine, reliability semantics are repeatedly encoded as local rules (
dedupe,skipQueue, retry classifiers, startup heuristics, pending markers, watermarks). This increases code volume, causes semantic drift, and makes restart correctness depend on special-case recovery code instead of structural guarantees.Solution approach
Implements a unified durable message lifecycle for single-node OpenClaw, replacing the current fragmented per-subsystem reliability mechanisms (inbound dedupe caches, outbound file queue, startup-only orphan replay, channel-specific retry rules) with one state machine and continuous workers.
What this replaces: Previous draft PR #27939 added SQLite journaling + startup orphan replay on top of the existing fragmented reliability model. That approach received review feedback identifying 9 issues. Rather than patch those individually, this PR implements the structural fix the feedback was pointing toward: one state machine, continuous workers, no startup special-case recovery path.
Storage layer (
src/infra/message-lifecycle/):message_turnstable: one row per accepted inbound turn. States:accepted | running | delivery_pending | failed_retryable | delivered | aborted | failed_terminal. Unique index ondedupe_key(when set). WAL + NORMAL synchronous. In-memory fallback on open failure.message_outboxtable: one row per outbound payload, linked toturn_id. States:queued | failed_retryable | delivered | failed_terminal | expired. Replacesdelivery-queue/*.jsonfiles.Delivery queue (
src/infra/outbound/delivery-queue.ts): Migrated from file-based tomessage_outbox. ExistingenqueueDelivery/ackDelivery/failDelivery/moveToFailedAPI preserved.getOutboxStatusForTurnadded. Turn finalization triggered when all outbox rows for a turn reach terminal state.importLegacyFileQueuedoes one-time import of anydelivery-queue/*.jsonartifacts (preservinglastAttemptAtfor correct backoff).Inbound dispatch (
src/auto-reply/dispatch.ts):dispatchInboundMessageInternalnow:acceptTurn(records turn, checks dedupe — currently using in-memory dedup path; see below).dispatcher.markComplete()+waitForIdle()before returning (fixes r2862217589).deliveryQueueContexton the dispatcher so final replies are durably persisted before send.dispatchReplyFromConfigcompletes: finalizes turn state based on outbox status (delivery_pending,delivered, orfailed_terminal).dispatchResumedTurnexported for use by the recovery worker — skipsacceptTurn, setsresumeTurnId.Reply dispatcher (
src/auto-reply/reply/reply-dispatcher.ts): WhendeliveryQueueContextis set, enqueues each final reply tomessage_outboxbefore callingdeliver, acks on success, records error on failure. AddssetDeliveryQueueContext/getDeliveryStatsto theReplyDispatcherinterface.Continuous lifecycle workers (
src/gateway/server-message-lifecycle.ts):outbox-worker: pollsrecoverPendingDeliverieson interval (default 1s).turn-worker: pollslistRecoverableTurns, checks outbox state per turn, callsdispatchResumedTurnfor turns with no outbox evidence. Marks turns terminal based on outbox outcomes.setDeliveryQueueContextis cleared on the recovery dispatcher (outbox rows are written byrouteReply->deliverOutboundPayloadsin the deliver closure; clearing prevents duplicate outbox rows on the direct-delivery path).server.impl.ts: replaces the one-shotrecoverPendingDeliveriescall withstartMessageLifecycleWorkers; callslifecycleWorkers.stop()on shutdown.Plugin outbound compat (
src/channels/plugins/outbound/compat.ts,src/plugin-sdk/outbound-adapter.ts):normalizeChannelOutboundAdapter: wraps v1 adapters (sendText+sendMedia or sendPayload) into a v2sendFinalfunction, emits one-time runtime warning for v1 compat mode.createCompatOutboundAdapter: SDK helper exported fromplugin-sdk/index.tsfor plugin authors.resolveOutboundContractVersion: detects declared contract version.outboundContract: "v2".direct-text-media.tsadds explicitsendFinal; others go through the compat wrapper.buildChannelAccountSnapshotnow surfacesoutboundContractin channel status snapshots.Config additions:
messages.delivery.maxAgeMs(default 30 min) andmessages.delivery.expireAction("fail" | "deliver"). Config help and labels updated.Guarantees
delivered | aborted | failed_terminal.Change Type
Scope
Linked Issues / Prior Work
Deferred / Explicitly Out of Scope
Persistent dedup disabled in this PR (
disablePersistentDedupe = trueinturns.ts): Thededupe_keyunique index exists and is correct (peer/thread-scoped), but insert usesNULLdedupe key so the unique index is never exercised. Cross-restart duplicate suppression continues to use the existing in-memory path. Rationale: per-channelMessageSididentity semantics are not fully normalized (Telegram message IDs are per-chat; callback/query IDs vs message IDs differ). Enabling persistent dedup requires per-channel audit of whatMessageSidmaps to. This is the correct call for the first merge — enabling it as a follow-up is straightforward (flip the flag + per-channel validation).Streaming/tool/block replies: Not tracked in
message_outbox. Only final replies participate in durable delivery semantics.Multi-node exactly-once: Single-process SQLite; no distributed consensus.
User-visible / Behavior Changes
None. Internal reliability improvement. Automatic migration from old file queue on first startup.
Security Impact
message-lifecycle.dbin existing state dir, same permissions (0o700)Repro + Verification
Environment
Test Scenarios
kill -9mid-reply, restart -> turn worker resumes and delivers.failed_terminalafter max retries.sendText/sendMedia— compat wrapper normalizes tosendFinal, one-time warning logged.~/.openclaw/message-lifecycle.db-> gateway degrades gracefully (in-memory fallback active).E2E scripts (macOS + Telegram): https://gist.github.com/nohat/657942433bb4c4e2a5fed2e12d49940b
Evidence
pnpm check(0 warnings),pnpm build,pnpm test(2294 tests across lifecycle, outbound, auto-reply, channels, gateway)afterphase againstcodex/unified-lifecycle-main):message_outboxabortedinmessage_turnskill -9mid-reply, restart, turn-worker deliversrunningturns at end of campaignCompatibility / Migration
messages.delivery.*keys)importLegacyFileQueueDowngrade:
delivery-queue/entries are migrated and deleted. Deletemessage-lifecycle.dbto reset.Post-stabilization cleanup: after a validation window, the legacy orphan-recovery paths and startup-specific reconciliation code will be removed. This PR intentionally leaves them in place until the continuous workers are confirmed stable in production.
Failure Recovery
~/.openclaw/message-lifecycle.db; gateway falls back to in-memory dedup and best-effort delivery.getLifecycleDb()throwing on every turn (SQLite file corrupt or missing write permission);message-lifecycle: legacy queue import failedat startup.Risks and Mitigations
DatabaseSyncsingleton per process. All multi-step ops wrapped inBEGIN IMMEDIATEtransactions.getOutboxStatusForTurnbefore dispatching; delivered turns are finalized without re-dispatch.finalizeTurnusesWHERE status IN (non-terminal states)guard.normalizeChannelOutboundAdapterreturnsundefinedfor adapters with neither sendText/sendMedia nor sendPayload;createPluginHandlerreturns null (same behavior as before for fully-incompatible adapters).node:sqlite):getLifecycleDbfalls back to in-memory DB with throttled warning — same degradation as the memory subsystem.AI-assisted (Codex + Claude Code). Review and verification by @nohat.