Message Reliability: Durable SQLite Outbox, Recovery Worker, and Unified sendPayload

## Background

OpenClaw currently implements outbound delivery reliability as a collection of subsystem-specific mechanisms: a file-based delivery queue (`delivery-queue/*.json`), channel-specific retry rules, startup-only orphan replay, and local permanent-error classification. Issue and PR history shows repeated failures at subsystem boundaries, especially during restart, crash, and reconnect windows.

Recurring user-visible failures:

- **Messages lost after crash/restart** — if the gateway dies after generating a reply but before the file-queue write completes (or the send succeeds), the message is lost permanently (#22376, #9208, #14827, #26783).
- **Retries run forever** — failed entries have no TTL or permanent-error classification; they retry indefinitely, consuming resources and causing periodic hangs (#23777, #24353).
- **No message expiry** — stale queued deliveries from hours or days ago can be delivered unexpectedly when the queue is finally processed (#16555).
- **Recovery only at startup** — the queue processor only fires on gateway boot; a failed send during normal operation can wait hours for the next restart (#21722).
- **Duplicate replies after restart** — channels like Telegram and WhatsApp redeliver unacknowledged messages; after a gateway restart the in-memory seen-set is cleared and the bot replies twice (#26764, #14431).

The file-based queue is not atomic, not queryable, and accumulates silently — there is no status tracking, no per-entry retry state, and no pruning. A crash mid-write produces a corrupt entry. The outbound path also splits between `sendText` and `sendMedia` as separate adapter methods, forcing the delivery layer to coordinate text chunking, media ordering, and `channelData` dispatch across two code paths — behavior is inconsistent across channels.

Root architectural gap: there is no single durable lifecycle for an outbound delivery spanning enqueue, send attempt, retry/backoff, permanent-error detection, and delivery confirmation. Without a unified durable state machine, reliability semantics are encoded as local rules (`skipQueue`, retry classifiers, startup heuristics, pending markers). This increases code volume, causes semantic drift, and makes restart correctness depend on special-case recovery code instead of structural guarantees.

### Prior work

- #14725 proposed a persistent outbox queue (closed — implemented by this work).
- #27939 added SQLite journaling + startup orphan replay on top of the existing fragmented model. Review feedback identified 9 issues with that incremental approach.
- #28941 was the structural rewrite (unified durable lifecycle, continuous workers, plugin compat layer). It was closed in favor of the current incremental PR series, which delivers the same guarantees in smaller, independently reviewable pieces.

## Goals

- **At-least-once delivery** for all outbound messages across gateway restarts, crashes, and transient network failures
- **Bounded retry** with exponential backoff (5s → 25s → 2m → 10m → 10m; max 5 attempts) and configurable message expiry
- **Permanent error detection** — stop retrying unrecoverable failures (chat not found, user blocked, bot kicked) immediately
- **Unified `sendPayload`** adapter method across all 23 channel adapters (built-in + extensions) — single entry point for text, media, and channelData
- **Automatic legacy migration** from file-based queue on upgrade, with rollback safety
- **Crash-safe recovery worker** that runs on startup with `startupCutoff` + `markAttemptStarted` guards to prevent double-delivery

## Non-Goals

- **Exactly-once delivery** — at-least-once only; `idempotency_key` column exists in the schema but is not yet populated
- **Cross-instance coordination** — single-gateway assumption; no distributed locking or multi-node consensus
- **Delivery monitoring dashboard** — no `openclaw status --delivery` command (future work)
- **Per-message delivery receipts** exposed to end users
- **Persistent inbound dedup** — `dedupe_key` column exists but persistent dedup is deferred (requires per-channel `MessageSid` audit)
- **Config-resolved chunk overrides in adapters** — chunk configuration is the delivery layer's responsibility, not individual adapters'

## Technical Approach

**Write-ahead outbox pattern** — every outbound message is persisted to a SQLite `message_outbox` table before the first send attempt. If the process crashes mid-send, the entry survives and the recovery worker retries it on next startup. This converts "message lost on crash" into "message delivered after restart."

**SQLite via Node's native `node:sqlite` DatabaseSync** — synchronous API avoids race conditions between enqueue and send. WAL mode + `busy_timeout=5000` give concurrent-read performance with crash durability. In-memory fallback if disk is unavailable (degrades gracefully to pre-change behavior — same reliability as the old file queue).

**Status lifecycle** — `queued` → `delivered` | `failed_retryable` → `failed_terminal` | `expired`. Terminal states have a status guard (`WHERE status IN ('queued','failed_retryable')`) preventing resurrection. Rows are pruned after 48 hours.

**Startup safety** — `startupCutoff` (recorded at process boot) prevents the recovery worker from double-delivering entries that the live send path is already handling. `markAttemptStarted` pushes `next_attempt_at` forward by 25s as defense-in-depth against slow channels (e.g., Signal's 10s RPC timeout).

**Permanent error detection** — regex patterns match known unrecoverable errors ("chat not found", "user blocked", "bot was kicked", etc.) and terminalize immediately rather than burning retries. Non-final dispatch rows (tool/block replies) are pruned on recovery since they have no standalone recovery path.

**Unified sendPayload** — each adapter implements a single `sendPayload(ctx)` method that handles text + media + channelData atomically. The adapter decides chunking (using its own `chunker`/`textChunkLimit`), media ordering, and channel-specific behavior. The delivery layer prefers `sendPayload` when available, falling back to `sendText`/`sendMedia` for adapters that don't declare it.

**Turn lifecycle coordination** — outbox rows are linked to turns via `turn_id`. The recovery worker uses outbox status as the source of truth for turn finalization: if all outbox rows for a turn are delivered, the turn is finalized as delivered; if any are terminal-failed with none queued, the turn is finalized as failed.

## Technical Details

<details>
<summary>SQLite Outbox Storage (<code>src/infra/message-lifecycle/db.ts</code>)</summary>

- Database file: `<stateDir>/message-lifecycle.db`
- `message_outbox` table columns:
 - `id` (TEXT PK), `turn_id`, `channel`, `account_id`, `target`, `dispatch_kind`
 - `payload` (TEXT — JSON blob with channel, to, accountId, payloads[], threadId, replyToId, bestEffort, gifPlayback, silent, mirror)
 - `queued_at`, `status`, `attempt_count`, `next_attempt_at`, `last_attempt_at`
 - `last_error`, `error_class` (permanent | terminal), `delivered_at`, `terminal_reason`, `completed_at`
 - `idempotency_key` (reserved, not yet populated)
- Indexes:
 - `idx_message_outbox_idem` — unique on `idempotency_key` where not null
 - `idx_message_outbox_turn_status` — on `(turn_id, status)` for turn lifecycle queries
 - `idx_message_outbox_resume` — on `(status, next_attempt_at, queued_at)` for recovery worker scans
- PRAGMA: `journal_mode=WAL`, `synchronous=NORMAL`, `busy_timeout=5000`
- In-memory fallback: sticky for process lifetime if disk open fails; `isLifecycleDbInMemory()` exposed for callers
- Transaction helper: `runLifecycleTransaction` with `BEGIN IMMEDIATE` / `COMMIT` / `ROLLBACK`

</details>

<details>
<summary>Delivery Queue (<code>src/infra/outbound/delivery-queue.ts</code>)</summary>

Core functions:
- `enqueueDelivery` — inserts row as `queued`, returns entry ID
- `ackDelivery` — sets status `delivered`, triggers turn finalization via `maybeFinalizeTurnDelivered`
- `failDelivery` — checks permanent error patterns, increments `attempt_count`, transitions to `failed_retryable` (with backoff) or `failed_terminal`; status guard prevents overwriting terminal rows
- `moveToFailed` — terminal-only guard (`AND status IN ('queued','failed_retryable')`) prevents overwriting already-acked entries
- `markAttemptStarted` — sets `last_attempt_at` and pushes `next_attempt_at` forward by 25s before send starts
- `loadPendingDeliveries` — queries non-terminal entries where `next_attempt_at <= now`, filters `dispatch_kind IS NULL OR dispatch_kind = 'final'`, respects `startupCutoff`
- `recoverPendingDeliveries` — expires stale entries per `maxAgeMs`/`expireAction`, iterates pending with time budget (`maxRecoveryMs`, default 60s), marks orphaned non-final rows terminal
- `importLegacyFileQueue` — one-time migration from `delivery-queue/*.json`; keeps files if DB is in-memory; preserves `lastAttemptAt` for correct backoff
- `pruneOutbox` — deletes terminal rows older than 48h; guards against invalid `ageMs`
- `isPermanentDeliveryError` — regex array: "no conversation reference found", "chat not found", "user not found", "bot was blocked", "bot was kicked", "chat_id is empty", "outbound not configured"
- `computeBackoffMs` — lookup table: [5s, 25s, 120s, 600s]
- Constants: `MAX_RETRIES = 5`, `OUTBOX_PRUNE_AGE_MS = 48h`, `DEFAULT_OUTBOX_MAX_AGE_MS = 30min`

</details>

<details>
<summary>sendPayload Adapter Interface (<code>src/channels/plugins/types.adapters.ts</code>)</summary>

- New optional method on `ChannelOutboundAdapter`:
 ```typescript
 sendPayload?: (ctx: ChannelOutboundPayloadContext) => Promise<OutboundDeliveryResult>
 ```
- `ChannelOutboundPayloadContext = ChannelOutboundContext & { payload: ReplyPayload }`
- Each adapter's `sendPayload` handles:
 - **Empty payload guard**: no text + no media → no-op return
 - **Media dispatch**: iterates `mediaUrls` (or single `mediaUrl`), caption on first only
 - **Text chunking**: uses adapter's own `chunker`/`textChunkLimit` for defense-in-depth
 - **Empty chunk guard**: handles edge case where chunker returns `[]`

</details>

<details>
<summary>Delivery Path Integration (<code>src/infra/outbound/deliver.ts</code>)</summary>

- Prefers `sendPayload` when adapter supports it (for all payloads, not just `channelData`)
- Write-ahead: enqueues to outbox before send, acks/fails after completion (best-effort — queue write failure doesn't block delivery)
- `markAttemptStarted` called before send to prevent recovery-vs-live-send race

</details>

<details>
<summary>Reply Dispatcher + Dispatch Integration</summary>

`src/auto-reply/reply/reply-dispatcher.ts`:
- When `deliveryQueueContext` is set, enqueues each final reply before calling `deliver`
- Chains `ackDelivery`/`failDelivery` to the delivery promise (best-effort)

`src/auto-reply/dispatch.ts`:
- `resolveDeliveryQueueContext` extracts channel, target, account ID, thread/reply IDs
- Interaction-scoped dispatchers (Slack slash, Discord native commands) skip outbox to avoid replaying to wrong destination
- Turn finalization uses outbox status as truth source

`src/gateway/server-message-lifecycle.ts`:
- Outbox worker: polls `recoverPendingDeliveries` on interval (default 1s)
- Turn worker: polls `listRecoverableTurns`, checks outbox state, calls `dispatchResumedTurn` for turns with no outbox evidence

</details>

<details>
<summary>Adapters with sendPayload (23 total across 4 PRs)</summary>

**Core outbound plugins** (`src/channels/plugins/outbound/`):
- `direct-text-media.ts` (shared factory for iMessage/Signal pattern)
- `discord.ts`, `slack.ts`, `whatsapp.ts`

**Extension adapters** (`extensions/*/src/channel.ts` or `outbound.ts`):
- **Batch A** (#30141): BlueBubbles, iMessage, Signal, Telegram, WhatsApp
- **Batch B** (#30142): Discord, Google Chat, Mattermost, MS Teams, Slack, Synology Chat
- **Batch C** (#30143): Feishu, IRC, Matrix, Nextcloud Talk, Nostr, Tlon, Twitch
- **Batch D** (#30144 — merged): Zalo, Zalouser

</details>

## PR Dependency Graph

```
Independent (merge in any order):
 #30144 batch-d: Zalo, Zalouser, core outbound plugins ✅ MERGED
 #30141 batch-a: BlueBubbles, iMessage, Signal, Telegram, WhatsApp
 #30142 batch-b: Discord, Google Chat, Mattermost, MS Teams, Slack, Synology
 #30143 batch-c: Feishu, IRC, Matrix, Nextcloud-Talk, Nostr, Tlon, Twitch

Then sequentially:
 #29998 SQLite outbox migration
 └── #30009 Recovery worker + delivery tracking
 └── #29997 Prefer sendPayload for all payloads
```

Adapter batch PRs are independent of each other and of the infrastructure PRs. They add additive `sendPayload` methods with no behavioral change to existing delivery. The infrastructure PRs must land after the adapter PRs so that `sendPayload` is available across all channels before the delivery layer switches to prefer it.

## Configuration

| Key | Type | Default | Description |
|-----|------|---------|-------------|
| `messages.delivery.maxAgeMs` | `number` | `1800000` (30 min) | Maximum queued age before expiry check |
| `messages.delivery.expireAction` | `"fail"` or `"deliver"` | `"deliver"` | What to do with entries older than `maxAgeMs`: `"fail"` marks them expired; `"deliver"` allows late delivery |

Internal constants (not user-configurable):

| Constant | Value | Description |
|----------|-------|-------------|
| `MAX_RETRIES` | 5 | Maximum delivery attempts before terminal failure |
| Backoff schedule | 5s, 25s, 2m, 10m, 10m | Exponential backoff per retry |
| `OUTBOX_PRUNE_AGE_MS` | 48h | Terminal row cleanup threshold |
| `markAttemptStarted` guard | 25s | Recovery exclusion window for in-flight sends |

## Migration and Upgrade

- **Automatic**: on first startup after upgrade, `importLegacyFileQueue` reads existing `*.json` files from `<stateDir>/delivery-queue/` and inserts them into the SQLite outbox. Files are deleted after successful import (unless DB fell back to in-memory). Preserves `lastAttemptAt` for correct backoff.
- **Rollback safe**: if downgraded, old code ignores the `.db` file and continues using file-based queue. No data loss in either direction.
- **In-memory fallback**: if the filesystem is read-only or disk is full, the outbox runs in-memory (same reliability as the old file queue). Logged at verbose level.
- **No config changes required**: `messages.delivery.*` keys have sensible defaults.

## Testing

~150 tests added across all 7 PRs:
- Per-adapter `*.sendpayload.test.ts` files: text-only, single media, multi-media, empty payload, long-text chunking, empty chunk guard
- `src/infra/outbound/outbound.test.ts`: enqueue/ack/fail lifecycle, backoff calculation, permanent error detection, recovery worker, legacy import, pruning, expiry, dispatch_kind filtering, startupCutoff, moveToFailed status guard
- `src/infra/outbound/deliver.test.ts`: sendPayload routing preference, chunking integration

## Related Issues

**Closes:**
- #14725 — Persistent outbox queue for outbound message reliability
- #16555 — Add TTL/expiry for delivery queue messages
- #21722 — Delivery queue processor only runs on recovery, causing multi-hour delays
- #22376 — Telegram sendMessage fails during gateway restart — messages lost with no retry
- #23777 — Delivery queue retries permanently-failed entries indefinitely
- #24353 — Missing delivery queue monitoring leads to periodic gateway hangs

**Addresses (partially):**
- #9208 — Message runs interrupted by network errors are not retried
- #14827 — WhatsApp messages silently dropped during reconnection window
- #26783 — Message catch-up on gateway restart for Telegram and Discord

**Prior work (closed):**
- #27939 — SQLite journal approach (closed in favor of incremental PR series)
- #28941 — Unified durable lifecycle (closed in favor of incremental PR series)

## Pull Requests

| PR | Title | Status |
|----|-------|--------|
| #30144 | feat(adapters): add sendPayload to batch-d (Zalo, Zalouser, core outbound plugins) | ✅ Merged |
| #30141 | feat(adapters): add sendPayload to batch-a (BlueBubbles, iMessage, Signal, Telegram, WhatsApp) | Open |
| #30142 | feat(adapters): add sendPayload to batch-b (Discord, Google Chat, Mattermost, MS Teams, Slack, Synology) | Open |
| #30143 | feat(adapters): add sendPayload to batch-c (Feishu, IRC, Matrix, Nextcloud-Talk, Nostr, Tlon, Twitch) | Open |
| #29998 | feat(outbox): migrate delivery queue from file-based to SQLite outbox | Open |
| #30009 | feat(outbox): write-ahead outbox with recovery worker and delivery tracking | Open |
| #29997 | feat(outbound): prefer sendPayload for all payloads when adapter supports it | Open |

## Future Work

- Populate `idempotency_key` for channels that support deduplication (requires per-channel MessageSid audit)
- Persistent inbound dedup (flip `disablePersistentDedupe` flag + per-channel validation)
- `openclaw status --delivery` monitoring command
- Periodic recovery timer (not just startup) for long-running gateway sessions
- Cross-instance coordination for multi-gateway deployments
- Configurable `maxRetries` per channel (#30496)

Key	Type	Default	Description
`messages.delivery.maxAgeMs`	`number`	`1800000` (30 min)	Maximum queued age before expiry check
`messages.delivery.expireAction`	`"fail"` or `"deliver"`	`"deliver"`	What to do with entries older than `maxAgeMs`: `"fail"` marks them expired; `"deliver"` allows late delivery

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Message Reliability: Durable SQLite Outbox, Recovery Worker, and Unified sendPayload #32063

Background

Prior work

Goals

Non-Goals

Technical Approach

Technical Details

PR Dependency Graph

Configuration

Migration and Upgrade

Testing

Related Issues

Pull Requests

Future Work

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Constant	Value	Description
`MAX_RETRIES`	5	Maximum delivery attempts before terminal failure
Backoff schedule	5s, 25s, 2m, 10m, 10m	Exponential backoff per retry
`OUTBOX_PRUNE_AGE_MS`	48h	Terminal row cleanup threshold
`markAttemptStarted` guard	25s	Recovery exclusion window for in-flight sends

PR	Title	Status
#30144	feat(adapters): add sendPayload to batch-d (Zalo, Zalouser, core outbound plugins)	✅ Merged
#30141	feat(adapters): add sendPayload to batch-a (BlueBubbles, iMessage, Signal, Telegram, WhatsApp)	Open
#30142	feat(adapters): add sendPayload to batch-b (Discord, Google Chat, Mattermost, MS Teams, Slack, Synology)	Open
#30143	feat(adapters): add sendPayload to batch-c (Feishu, IRC, Matrix, Nextcloud-Talk, Nostr, Tlon, Twitch)	Open
#29998	feat(outbox): migrate delivery queue from file-based to SQLite outbox	Open
#30009	feat(outbox): write-ahead outbox with recovery worker and delivery tracking	Open
#29997	feat(outbound): prefer sendPayload for all payloads when adapter supports it	Open

Uh oh!

Message Reliability: Durable SQLite Outbox, Recovery Worker, and Unified sendPayload #32063

Description

Background

Prior work

Goals

Non-Goals

Technical Approach

Technical Details

PR Dependency Graph

Configuration

Migration and Upgrade

Testing

Related Issues

Pull Requests

Future Work

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions