Skip to content

feat(gateway): unified durable message lifecycle — SQLite turns+outbox, continuous workers, plugin compat layer#28941

Closed
nohat wants to merge 5 commits intoopenclaw:mainfrom
nohat:codex/unified-lifecycle-main
Closed

feat(gateway): unified durable message lifecycle — SQLite turns+outbox, continuous workers, plugin compat layer#28941
nohat wants to merge 5 commits intoopenclaw:mainfrom
nohat:codex/unified-lifecycle-main

Conversation

@nohat
Copy link
Contributor

@nohat nohat commented Feb 27, 2026

Summary

Background

OpenClaw currently implements reliability guarantees as a collection of subsystem-specific mechanisms (inbound dedupe caches, outbound delivery queue, channel-specific update offsets/watermarks, restart catch-up logic, idempotency keys in selected flows, and local retry/permanent-error classification rules). The test suite demonstrates that these guarantees are important and intentionally maintained in isolation, but issue/PR history shows repeated failures at subsystem boundaries, especially during restart/crash/reconnect windows.

Recurring user-visible failures are:

  • accepted user messages that never receive a reply after restart/crash/network interruption,
  • duplicate message processing or duplicate replies caused by retries/replays/reconnects,
  • stale queued deliveries replayed long after relevance,
  • inconsistent abort/supersession behavior where canceled work is later retried or delivered,
  • channel-specific catch-up gaps that lose messages sent during downtime.

Root architectural gap: there is no single durable lifecycle model for a turn spanning:

  1. inbound acceptance/idempotency,
  2. run execution state (including retry/abort/supersession),
  3. reply materialization,
  4. outbound delivery and delivery confirmation.

Without a unified durable state machine, reliability semantics are repeatedly encoded as local rules (dedupe, skipQueue, retry classifiers, startup heuristics, pending markers, watermarks). This increases code volume, causes semantic drift, and makes restart correctness depend on special-case recovery code instead of structural guarantees.

Solution approach

Implements a unified durable message lifecycle for single-node OpenClaw, replacing the current fragmented per-subsystem reliability mechanisms (inbound dedupe caches, outbound file queue, startup-only orphan replay, channel-specific retry rules) with one state machine and continuous workers.

What this replaces: Previous draft PR #27939 added SQLite journaling + startup orphan replay on top of the existing fragmented reliability model. That approach received review feedback identifying 9 issues. Rather than patch those individually, this PR implements the structural fix the feedback was pointing toward: one state machine, continuous workers, no startup special-case recovery path.

Storage layer (src/infra/message-lifecycle/):

  • message_turns table: one row per accepted inbound turn. States: accepted | running | delivery_pending | failed_retryable | delivered | aborted | failed_terminal. Unique index on dedupe_key (when set). WAL + NORMAL synchronous. In-memory fallback on open failure.
  • message_outbox table: one row per outbound payload, linked to turn_id. States: queued | failed_retryable | delivered | failed_terminal | expired. Replaces delivery-queue/*.json files.
  • Transaction helpers, schema bootstrap, lifecycle-scoped DB cache with exit cleanup.

Delivery queue (src/infra/outbound/delivery-queue.ts): Migrated from file-based to message_outbox. Existing enqueueDelivery/ackDelivery/failDelivery/moveToFailed API preserved. getOutboxStatusForTurn added. Turn finalization triggered when all outbox rows for a turn reach terminal state. importLegacyFileQueue does one-time import of any delivery-queue/*.json artifacts (preserving lastAttemptAt for correct backoff).

Inbound dispatch (src/auto-reply/dispatch.ts): dispatchInboundMessageInternal now:

  1. Calls acceptTurn (records turn, checks dedupe — currently using in-memory dedup path; see below).
  2. On dedupe skip: calls dispatcher.markComplete() + waitForIdle() before returning (fixes r2862217589).
  3. Sets deliveryQueueContext on the dispatcher so final replies are durably persisted before send.
  4. After dispatchReplyFromConfig completes: finalizes turn state based on outbox status (delivery_pending, delivered, or failed_terminal).
  5. dispatchResumedTurn exported for use by the recovery worker — skips acceptTurn, sets resumeTurnId.

Reply dispatcher (src/auto-reply/reply/reply-dispatcher.ts): When deliveryQueueContext is set, enqueues each final reply to message_outbox before calling deliver, acks on success, records error on failure. Adds setDeliveryQueueContext / getDeliveryStats to the ReplyDispatcher interface.

Continuous lifecycle workers (src/gateway/server-message-lifecycle.ts):

  • outbox-worker: polls recoverPendingDeliveries on interval (default 1s).
  • turn-worker: polls listRecoverableTurns, checks outbox state per turn, calls dispatchResumedTurn for turns with no outbox evidence. Marks turns terminal based on outbox outcomes. setDeliveryQueueContext is cleared on the recovery dispatcher (outbox rows are written by routeReply -> deliverOutboundPayloads in the deliver closure; clearing prevents duplicate outbox rows on the direct-delivery path).
  • Startup: import legacy queue -> start workers. No orphan replay loop, no min-age filter, no startup-specific reconciliation.
  • server.impl.ts: replaces the one-shot recoverPendingDeliveries call with startMessageLifecycleWorkers; calls lifecycleWorkers.stop() on shutdown.

Plugin outbound compat (src/channels/plugins/outbound/compat.ts, src/plugin-sdk/outbound-adapter.ts):

  • normalizeChannelOutboundAdapter: wraps v1 adapters (sendText+sendMedia or sendPayload) into a v2 sendFinal function, emits one-time runtime warning for v1 compat mode.
  • createCompatOutboundAdapter: SDK helper exported from plugin-sdk/index.ts for plugin authors.
  • resolveOutboundContractVersion: detects declared contract version.
  • All five built-in adapters declare outboundContract: "v2". direct-text-media.ts adds explicit sendFinal; others go through the compat wrapper.
  • buildChannelAccountSnapshot now surfaces outboundContract in channel status snapshots.

Config additions: messages.delivery.maxAgeMs (default 30 min) and messages.delivery.expireAction ("fail" | "deliver"). Config help and labels updated.

Guarantees

  1. Every accepted inbound turn is durably recorded and deduped.
  2. Restart behavior is continuation of non-terminal work, not startup-only orphan replay.
  3. Turn terminal states are explicit and final: delivered | aborted | failed_terminal.
  4. Outbound retry/permanent/expiry classification is centralized.
  5. Abort/supersession transitions suppress replay.
  6. For providers without hard idempotency, duplicate risk is bounded best-effort (not strict exactly-once).

Change Type

  • Bug fix
  • Feature
  • Refactor
  • Docs
  • Security hardening
  • Chore/infra

Scope

  • Gateway / orchestration
  • Skills / tool execution
  • Auth / tokens
  • Memory / storage
  • Integrations
  • API / contracts
  • UI / DX
  • CI/CD / infra

Linked Issues / Prior Work

Deferred / Explicitly Out of Scope

Persistent dedup disabled in this PR (disablePersistentDedupe = true in turns.ts): The dedupe_key unique index exists and is correct (peer/thread-scoped), but insert uses NULL dedupe key so the unique index is never exercised. Cross-restart duplicate suppression continues to use the existing in-memory path. Rationale: per-channel MessageSid identity semantics are not fully normalized (Telegram message IDs are per-chat; callback/query IDs vs message IDs differ). Enabling persistent dedup requires per-channel audit of what MessageSid maps to. This is the correct call for the first merge — enabling it as a follow-up is straightforward (flip the flag + per-channel validation).

Streaming/tool/block replies: Not tracked in message_outbox. Only final replies participate in durable delivery semantics.

Multi-node exactly-once: Single-process SQLite; no distributed consensus.

User-visible / Behavior Changes

None. Internal reliability improvement. Automatic migration from old file queue on first startup.

Security Impact

  • New permissions/capabilities? No
  • Secrets/tokens handling changed? No
  • New/changed network calls? No
  • Command/tool execution surface changed? No
  • Data access scope changed? No — new message-lifecycle.db in existing state dir, same permissions (0o700)

Repro + Verification

Environment

  • OS: macOS 15
  • Runtime: Node 22+ / Bun
  • Integration/channel: works across all built-in channels

Test Scenarios

  1. Start gateway, send message, kill -9 mid-reply, restart -> turn worker resumes and delivers.
  2. Send two messages with same ID from same peer -> second is deduped (in-memory path still active).
  3. Send message, crash after enqueue but before ack -> outbox worker retries and delivers.
  4. Permanent error (blocked account) -> turn and outbox rows finalize as failed_terminal after max retries.
  5. Plugin with only sendText/sendMedia — compat wrapper normalizes to sendFinal, one-time warning logged.
  6. Downgrade: delete ~/.openclaw/message-lifecycle.db -> gateway degrades gracefully (in-memory fallback active).

E2E scripts (macOS + Telegram): https://gist.github.com/nohat/657942433bb4c4e2a5fed2e12d49940b

Evidence

  • Tests passing: pnpm check (0 warnings), pnpm build, pnpm test (2294 tests across lifecycle, outbound, auto-reply, channels, gateway)
  • Full Telegram E2E campaign passed (7/7 tests, after phase against codex/unified-lifecycle-main):
    • Test 1: DB created on first startup; delivery-queue files migrated to message_outbox
    • Test 2: Inbound dedup persists across restart — lifecycle record survives
    • Test 3: Abort confirmed in chat, recorded as aborted in message_turns
    • Test 4: Orphan recovery — kill -9 mid-reply, restart, turn-worker delivers
    • Test 5: Aborted delivery not re-queued after restart
    • Test 6: Pruning bounded DB — aged turns removed at restart
    • Test 7: No stuck running turns at end of campaign

Compatibility / Migration

  • Backward compatible? Yes
  • Config/env changes? No (new optional messages.delivery.* keys)
  • Migration needed? No — automatic on first startup via importLegacyFileQueue

Downgrade: delivery-queue/ entries are migrated and deleted. Delete message-lifecycle.db to reset.

Post-stabilization cleanup: after a validation window, the legacy orphan-recovery paths and startup-specific reconciliation code will be removed. This PR intentionally leaves them in place until the continuous workers are confirmed stable in production.

Failure Recovery

  • How to disable/revert quickly: delete ~/.openclaw/message-lifecycle.db; gateway falls back to in-memory dedup and best-effort delivery.
  • Known bad symptoms: getLifecycleDb() throwing on every turn (SQLite file corrupt or missing write permission); message-lifecycle: legacy queue import failed at startup.

Risks and Mitigations

  • SQLite write contention: WAL mode + single DatabaseSync singleton per process. All multi-step ops wrapped in BEGIN IMMEDIATE transactions.
  • Turn worker re-dispatching already-delivered turns: Turn worker checks getOutboxStatusForTurn before dispatching; delivered turns are finalized without re-dispatch. finalizeTurn uses WHERE status IN (non-terminal states) guard.
  • Plugin adapters without sendFinal: normalizeChannelOutboundAdapter returns undefined for adapters with neither sendText/sendMedia nor sendPayload; createPluginHandler returns null (same behavior as before for fully-incompatible adapters).
  • Node < 22.5 (no node:sqlite): getLifecycleDb falls back to in-memory DB with throttled warning — same degradation as the memory subsystem.

AI-assisted (Codex + Claude Code). Review and verification by @nohat.

  • AI-assisted
  • Author has verified E2E behavior (full 7-test Telegram campaign passed)

@openclaw-barnacle openclaw-barnacle bot added docs Improvements or additions to documentation app: web-ui App: web-ui gateway Gateway runtime size: XL labels Feb 27, 2026
@nohat nohat force-pushed the codex/unified-lifecycle-main branch from 1a78974 to 2523cd8 Compare February 27, 2026 16:47
@openclaw-barnacle openclaw-barnacle bot added the agents Agent runtime and tooling label Feb 27, 2026
@openclaw-barnacle openclaw-barnacle bot added the scripts Repository scripts label Feb 27, 2026
Design doc belongs in the PR description, not the Mintlify tree.
E2E scripts require macOS + Telegram + personal state dir;
moving to a gist rather than committing as repo test artifacts.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@openclaw-barnacle openclaw-barnacle bot removed docs Improvements or additions to documentation scripts Repository scripts labels Feb 27, 2026
…n restart

Entries enqueued after gateway startup with no prior attempt are live
deliveries in flight — picking them up in the outbox worker would cause
every reply to be sent twice (regression introduced by the unified lifecycle PR).

Fix: loadPendingDeliveries now accepts an optional startupCutoff timestamp
(set to Date.now() before workers start). Entries with queued_at >= cutoff
AND last_attempt_at IS NULL AND attempt_count = 0 are excluded from
recovery passes. Crash survivors (queued before startup) and transient
failures (already had an attempt) are always included.

Also adds RecoverySummary.skippedStartupCutoff counter so gateway logs
surface how many live-delivery entries are filtered per pass — providing
observable validation that no duplicate sends are occurring.
@nohat
Copy link
Contributor Author

nohat commented Mar 1, 2026

Closing — superseded by the full lifecycle stack (#29997#30012):

The unified durable lifecycle goals from this PR are now decomposed across the above PRs for incremental review and merge.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agents Agent runtime and tooling app: web-ui App: web-ui gateway Gateway runtime size: XL

Projects

None yet

1 participant