Skip to content

Message Reliability: Durable SQLite Outbox, Recovery Worker, and Unified sendPayload #32063

@nohat

Description

@nohat

Background

OpenClaw currently implements outbound delivery reliability as a collection of subsystem-specific mechanisms: a file-based delivery queue (delivery-queue/*.json), channel-specific retry rules, startup-only orphan replay, and local permanent-error classification. Issue and PR history shows repeated failures at subsystem boundaries, especially during restart, crash, and reconnect windows.

Recurring user-visible failures:

The file-based queue is not atomic, not queryable, and accumulates silently — there is no status tracking, no per-entry retry state, and no pruning. A crash mid-write produces a corrupt entry. The outbound path also splits between sendText and sendMedia as separate adapter methods, forcing the delivery layer to coordinate text chunking, media ordering, and channelData dispatch across two code paths — behavior is inconsistent across channels.

Root architectural gap: there is no single durable lifecycle for an outbound delivery spanning enqueue, send attempt, retry/backoff, permanent-error detection, and delivery confirmation. Without a unified durable state machine, reliability semantics are encoded as local rules (skipQueue, retry classifiers, startup heuristics, pending markers). This increases code volume, causes semantic drift, and makes restart correctness depend on special-case recovery code instead of structural guarantees.

Prior work

Goals

  • At-least-once delivery for all outbound messages across gateway restarts, crashes, and transient network failures
  • Bounded retry with exponential backoff (5s → 25s → 2m → 10m → 10m; max 5 attempts) and configurable message expiry
  • Permanent error detection — stop retrying unrecoverable failures (chat not found, user blocked, bot kicked) immediately
  • Unified sendPayload adapter method across all 23 channel adapters (built-in + extensions) — single entry point for text, media, and channelData
  • Automatic legacy migration from file-based queue on upgrade, with rollback safety
  • Crash-safe recovery worker that runs on startup with startupCutoff + markAttemptStarted guards to prevent double-delivery

Non-Goals

  • Exactly-once delivery — at-least-once only; idempotency_key column exists in the schema but is not yet populated
  • Cross-instance coordination — single-gateway assumption; no distributed locking or multi-node consensus
  • Delivery monitoring dashboard — no openclaw status --delivery command (future work)
  • Per-message delivery receipts exposed to end users
  • Persistent inbound dedupdedupe_key column exists but persistent dedup is deferred (requires per-channel MessageSid audit)
  • Config-resolved chunk overrides in adapters — chunk configuration is the delivery layer's responsibility, not individual adapters'

Technical Approach

Write-ahead outbox pattern — every outbound message is persisted to a SQLite message_outbox table before the first send attempt. If the process crashes mid-send, the entry survives and the recovery worker retries it on next startup. This converts "message lost on crash" into "message delivered after restart."

SQLite via Node's native node:sqlite DatabaseSync — synchronous API avoids race conditions between enqueue and send. WAL mode + busy_timeout=5000 give concurrent-read performance with crash durability. In-memory fallback if disk is unavailable (degrades gracefully to pre-change behavior — same reliability as the old file queue).

Status lifecyclequeueddelivered | failed_retryablefailed_terminal | expired. Terminal states have a status guard (WHERE status IN ('queued','failed_retryable')) preventing resurrection. Rows are pruned after 48 hours.

Startup safetystartupCutoff (recorded at process boot) prevents the recovery worker from double-delivering entries that the live send path is already handling. markAttemptStarted pushes next_attempt_at forward by 25s as defense-in-depth against slow channels (e.g., Signal's 10s RPC timeout).

Permanent error detection — regex patterns match known unrecoverable errors ("chat not found", "user blocked", "bot was kicked", etc.) and terminalize immediately rather than burning retries. Non-final dispatch rows (tool/block replies) are pruned on recovery since they have no standalone recovery path.

Unified sendPayload — each adapter implements a single sendPayload(ctx) method that handles text + media + channelData atomically. The adapter decides chunking (using its own chunker/textChunkLimit), media ordering, and channel-specific behavior. The delivery layer prefers sendPayload when available, falling back to sendText/sendMedia for adapters that don't declare it.

Turn lifecycle coordination — outbox rows are linked to turns via turn_id. The recovery worker uses outbox status as the source of truth for turn finalization: if all outbox rows for a turn are delivered, the turn is finalized as delivered; if any are terminal-failed with none queued, the turn is finalized as failed.

Technical Details

SQLite Outbox Storage (src/infra/message-lifecycle/db.ts)
  • Database file: <stateDir>/message-lifecycle.db
  • message_outbox table columns:
    • id (TEXT PK), turn_id, channel, account_id, target, dispatch_kind
    • payload (TEXT — JSON blob with channel, to, accountId, payloads[], threadId, replyToId, bestEffort, gifPlayback, silent, mirror)
    • queued_at, status, attempt_count, next_attempt_at, last_attempt_at
    • last_error, error_class (permanent | terminal), delivered_at, terminal_reason, completed_at
    • idempotency_key (reserved, not yet populated)
  • Indexes:
    • idx_message_outbox_idem — unique on idempotency_key where not null
    • idx_message_outbox_turn_status — on (turn_id, status) for turn lifecycle queries
    • idx_message_outbox_resume — on (status, next_attempt_at, queued_at) for recovery worker scans
  • PRAGMA: journal_mode=WAL, synchronous=NORMAL, busy_timeout=5000
  • In-memory fallback: sticky for process lifetime if disk open fails; isLifecycleDbInMemory() exposed for callers
  • Transaction helper: runLifecycleTransaction with BEGIN IMMEDIATE / COMMIT / ROLLBACK
Delivery Queue (src/infra/outbound/delivery-queue.ts)

Core functions:

  • enqueueDelivery — inserts row as queued, returns entry ID
  • ackDelivery — sets status delivered, triggers turn finalization via maybeFinalizeTurnDelivered
  • failDelivery — checks permanent error patterns, increments attempt_count, transitions to failed_retryable (with backoff) or failed_terminal; status guard prevents overwriting terminal rows
  • moveToFailed — terminal-only guard (AND status IN ('queued','failed_retryable')) prevents overwriting already-acked entries
  • markAttemptStarted — sets last_attempt_at and pushes next_attempt_at forward by 25s before send starts
  • loadPendingDeliveries — queries non-terminal entries where next_attempt_at <= now, filters dispatch_kind IS NULL OR dispatch_kind = 'final', respects startupCutoff
  • recoverPendingDeliveries — expires stale entries per maxAgeMs/expireAction, iterates pending with time budget (maxRecoveryMs, default 60s), marks orphaned non-final rows terminal
  • importLegacyFileQueue — one-time migration from delivery-queue/*.json; keeps files if DB is in-memory; preserves lastAttemptAt for correct backoff
  • pruneOutbox — deletes terminal rows older than 48h; guards against invalid ageMs
  • isPermanentDeliveryError — regex array: "no conversation reference found", "chat not found", "user not found", "bot was blocked", "bot was kicked", "chat_id is empty", "outbound not configured"
  • computeBackoffMs — lookup table: [5s, 25s, 120s, 600s]
  • Constants: MAX_RETRIES = 5, OUTBOX_PRUNE_AGE_MS = 48h, DEFAULT_OUTBOX_MAX_AGE_MS = 30min
sendPayload Adapter Interface (src/channels/plugins/types.adapters.ts)
  • New optional method on ChannelOutboundAdapter:
    sendPayload?: (ctx: ChannelOutboundPayloadContext) => Promise<OutboundDeliveryResult>
  • ChannelOutboundPayloadContext = ChannelOutboundContext & { payload: ReplyPayload }
  • Each adapter's sendPayload handles:
    • Empty payload guard: no text + no media → no-op return
    • Media dispatch: iterates mediaUrls (or single mediaUrl), caption on first only
    • Text chunking: uses adapter's own chunker/textChunkLimit for defense-in-depth
    • Empty chunk guard: handles edge case where chunker returns []
Delivery Path Integration (src/infra/outbound/deliver.ts)
  • Prefers sendPayload when adapter supports it (for all payloads, not just channelData)
  • Write-ahead: enqueues to outbox before send, acks/fails after completion (best-effort — queue write failure doesn't block delivery)
  • markAttemptStarted called before send to prevent recovery-vs-live-send race
Reply Dispatcher + Dispatch Integration

src/auto-reply/reply/reply-dispatcher.ts:

  • When deliveryQueueContext is set, enqueues each final reply before calling deliver
  • Chains ackDelivery/failDelivery to the delivery promise (best-effort)

src/auto-reply/dispatch.ts:

  • resolveDeliveryQueueContext extracts channel, target, account ID, thread/reply IDs
  • Interaction-scoped dispatchers (Slack slash, Discord native commands) skip outbox to avoid replaying to wrong destination
  • Turn finalization uses outbox status as truth source

src/gateway/server-message-lifecycle.ts:

  • Outbox worker: polls recoverPendingDeliveries on interval (default 1s)
  • Turn worker: polls listRecoverableTurns, checks outbox state, calls dispatchResumedTurn for turns with no outbox evidence
Adapters with sendPayload (23 total across 4 PRs)

Core outbound plugins (src/channels/plugins/outbound/):

  • direct-text-media.ts (shared factory for iMessage/Signal pattern)
  • discord.ts, slack.ts, whatsapp.ts

Extension adapters (extensions/*/src/channel.ts or outbound.ts):

PR Dependency Graph

Independent (merge in any order):
  #30144  batch-d: Zalo, Zalouser, core outbound plugins          ✅ MERGED
  #30141  batch-a: BlueBubbles, iMessage, Signal, Telegram, WhatsApp
  #30142  batch-b: Discord, Google Chat, Mattermost, MS Teams, Slack, Synology
  #30143  batch-c: Feishu, IRC, Matrix, Nextcloud-Talk, Nostr, Tlon, Twitch

Then sequentially:
  #29998  SQLite outbox migration
    └── #30009  Recovery worker + delivery tracking
          └── #29997  Prefer sendPayload for all payloads

Adapter batch PRs are independent of each other and of the infrastructure PRs. They add additive sendPayload methods with no behavioral change to existing delivery. The infrastructure PRs must land after the adapter PRs so that sendPayload is available across all channels before the delivery layer switches to prefer it.

Configuration

Key Type Default Description
messages.delivery.maxAgeMs number 1800000 (30 min) Maximum queued age before expiry check
messages.delivery.expireAction "fail" or "deliver" "deliver" What to do with entries older than maxAgeMs: "fail" marks them expired; "deliver" allows late delivery

Internal constants (not user-configurable):

Constant Value Description
MAX_RETRIES 5 Maximum delivery attempts before terminal failure
Backoff schedule 5s, 25s, 2m, 10m, 10m Exponential backoff per retry
OUTBOX_PRUNE_AGE_MS 48h Terminal row cleanup threshold
markAttemptStarted guard 25s Recovery exclusion window for in-flight sends

Migration and Upgrade

  • Automatic: on first startup after upgrade, importLegacyFileQueue reads existing *.json files from <stateDir>/delivery-queue/ and inserts them into the SQLite outbox. Files are deleted after successful import (unless DB fell back to in-memory). Preserves lastAttemptAt for correct backoff.
  • Rollback safe: if downgraded, old code ignores the .db file and continues using file-based queue. No data loss in either direction.
  • In-memory fallback: if the filesystem is read-only or disk is full, the outbox runs in-memory (same reliability as the old file queue). Logged at verbose level.
  • No config changes required: messages.delivery.* keys have sensible defaults.

Testing

~150 tests added across all 7 PRs:

  • Per-adapter *.sendpayload.test.ts files: text-only, single media, multi-media, empty payload, long-text chunking, empty chunk guard
  • src/infra/outbound/outbound.test.ts: enqueue/ack/fail lifecycle, backoff calculation, permanent error detection, recovery worker, legacy import, pruning, expiry, dispatch_kind filtering, startupCutoff, moveToFailed status guard
  • src/infra/outbound/deliver.test.ts: sendPayload routing preference, chunking integration

Related Issues

Closes:

Addresses (partially):

Prior work (closed):

Pull Requests

PR Title Status
#30144 feat(adapters): add sendPayload to batch-d (Zalo, Zalouser, core outbound plugins) ✅ Merged
#30141 feat(adapters): add sendPayload to batch-a (BlueBubbles, iMessage, Signal, Telegram, WhatsApp) Open
#30142 feat(adapters): add sendPayload to batch-b (Discord, Google Chat, Mattermost, MS Teams, Slack, Synology) Open
#30143 feat(adapters): add sendPayload to batch-c (Feishu, IRC, Matrix, Nextcloud-Talk, Nostr, Tlon, Twitch) Open
#29998 feat(outbox): migrate delivery queue from file-based to SQLite outbox Open
#30009 feat(outbox): write-ahead outbox with recovery worker and delivery tracking Open
#29997 feat(outbound): prefer sendPayload for all payloads when adapter supports it Open

Future Work

  • Populate idempotency_key for channels that support deduplication (requires per-channel MessageSid audit)
  • Persistent inbound dedup (flip disablePersistentDedupe flag + per-channel validation)
  • openclaw status --delivery monitoring command
  • Periodic recovery timer (not just startup) for long-running gateway sessions
  • Cross-instance coordination for multi-gateway deployments
  • Configurable maxRetries per channel (Feature Request: delivery.maxRetries config for delivery queue #30496)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions