-
-
Notifications
You must be signed in to change notification settings - Fork 52.7k
Description
Background
OpenClaw currently implements outbound delivery reliability as a collection of subsystem-specific mechanisms: a file-based delivery queue (delivery-queue/*.json), channel-specific retry rules, startup-only orphan replay, and local permanent-error classification. Issue and PR history shows repeated failures at subsystem boundaries, especially during restart, crash, and reconnect windows.
Recurring user-visible failures:
- Messages lost after crash/restart — if the gateway dies after generating a reply but before the file-queue write completes (or the send succeeds), the message is lost permanently ([Bug]: Telegram sendMessage fails during gateway restart — messages lost with no retry #22376, Message runs interrupted by network errors are not retried, causing silent message loss #9208, WhatsApp: Messages silently dropped during reconnection window #14827, [Feature]: Message catch-up on gateway restart for Telegram and Discord #26783).
- Retries run forever — failed entries have no TTL or permanent-error classification; they retry indefinitely, consuming resources and causing periodic hangs ([Bug]: Delivery Queue Retries Permanently-Failed Entries Indefinitely #23777, [Bug]: Missing delivery queue monitoring leads to periodic gateway hangs #24353).
- No message expiry — stale queued deliveries from hours or days ago can be delivered unexpectedly when the queue is finally processed ([Feature]: Add TTL/Expiry for Delivery Queue Messages #16555).
- Recovery only at startup — the queue processor only fires on gateway boot; a failed send during normal operation can wait hours for the next restart (Delivery Queue Processor Only Runs on Recovery, Causing Multi-Hour Delays #21722).
- Duplicate replies after restart — channels like Telegram and WhatsApp redeliver unacknowledged messages; after a gateway restart the in-memory seen-set is cleared and the bot replies twice ([Bug]: Telegram inbound message can be re-queued on model fallback/rate-limit, causing duplicate user turns and missing outbound delivery #26764, [Feature]: Feishu inbound dedup cache lost on SIGUSR1 restart, causing duplicate message processing #14431).
The file-based queue is not atomic, not queryable, and accumulates silently — there is no status tracking, no per-entry retry state, and no pruning. A crash mid-write produces a corrupt entry. The outbound path also splits between sendText and sendMedia as separate adapter methods, forcing the delivery layer to coordinate text chunking, media ordering, and channelData dispatch across two code paths — behavior is inconsistent across channels.
Root architectural gap: there is no single durable lifecycle for an outbound delivery spanning enqueue, send attempt, retry/backoff, permanent-error detection, and delivery confirmation. Without a unified durable state machine, reliability semantics are encoded as local rules (skipQueue, retry classifiers, startup heuristics, pending markers). This increases code volume, causes semantic drift, and makes restart correctness depend on special-case recovery code instead of structural guarantees.
Prior work
- [Feature]: Persistent outbox queue for outbound message reliability #14725 proposed a persistent outbox queue (closed — implemented by this work).
- fix(gateway): message delivery reliability with SQLite journal (inbound dedup + orphan recovery) #27939 added SQLite journaling + startup orphan replay on top of the existing fragmented model. Review feedback identified 9 issues with that incremental approach.
- feat(gateway): unified durable message lifecycle — SQLite turns+outbox, continuous workers, plugin compat layer #28941 was the structural rewrite (unified durable lifecycle, continuous workers, plugin compat layer). It was closed in favor of the current incremental PR series, which delivers the same guarantees in smaller, independently reviewable pieces.
Goals
- At-least-once delivery for all outbound messages across gateway restarts, crashes, and transient network failures
- Bounded retry with exponential backoff (5s → 25s → 2m → 10m → 10m; max 5 attempts) and configurable message expiry
- Permanent error detection — stop retrying unrecoverable failures (chat not found, user blocked, bot kicked) immediately
- Unified
sendPayloadadapter method across all 23 channel adapters (built-in + extensions) — single entry point for text, media, and channelData - Automatic legacy migration from file-based queue on upgrade, with rollback safety
- Crash-safe recovery worker that runs on startup with
startupCutoff+markAttemptStartedguards to prevent double-delivery
Non-Goals
- Exactly-once delivery — at-least-once only;
idempotency_keycolumn exists in the schema but is not yet populated - Cross-instance coordination — single-gateway assumption; no distributed locking or multi-node consensus
- Delivery monitoring dashboard — no
openclaw status --deliverycommand (future work) - Per-message delivery receipts exposed to end users
- Persistent inbound dedup —
dedupe_keycolumn exists but persistent dedup is deferred (requires per-channelMessageSidaudit) - Config-resolved chunk overrides in adapters — chunk configuration is the delivery layer's responsibility, not individual adapters'
Technical Approach
Write-ahead outbox pattern — every outbound message is persisted to a SQLite message_outbox table before the first send attempt. If the process crashes mid-send, the entry survives and the recovery worker retries it on next startup. This converts "message lost on crash" into "message delivered after restart."
SQLite via Node's native node:sqlite DatabaseSync — synchronous API avoids race conditions between enqueue and send. WAL mode + busy_timeout=5000 give concurrent-read performance with crash durability. In-memory fallback if disk is unavailable (degrades gracefully to pre-change behavior — same reliability as the old file queue).
Status lifecycle — queued → delivered | failed_retryable → failed_terminal | expired. Terminal states have a status guard (WHERE status IN ('queued','failed_retryable')) preventing resurrection. Rows are pruned after 48 hours.
Startup safety — startupCutoff (recorded at process boot) prevents the recovery worker from double-delivering entries that the live send path is already handling. markAttemptStarted pushes next_attempt_at forward by 25s as defense-in-depth against slow channels (e.g., Signal's 10s RPC timeout).
Permanent error detection — regex patterns match known unrecoverable errors ("chat not found", "user blocked", "bot was kicked", etc.) and terminalize immediately rather than burning retries. Non-final dispatch rows (tool/block replies) are pruned on recovery since they have no standalone recovery path.
Unified sendPayload — each adapter implements a single sendPayload(ctx) method that handles text + media + channelData atomically. The adapter decides chunking (using its own chunker/textChunkLimit), media ordering, and channel-specific behavior. The delivery layer prefers sendPayload when available, falling back to sendText/sendMedia for adapters that don't declare it.
Turn lifecycle coordination — outbox rows are linked to turns via turn_id. The recovery worker uses outbox status as the source of truth for turn finalization: if all outbox rows for a turn are delivered, the turn is finalized as delivered; if any are terminal-failed with none queued, the turn is finalized as failed.
Technical Details
SQLite Outbox Storage (src/infra/message-lifecycle/db.ts)
- Database file:
<stateDir>/message-lifecycle.db message_outboxtable columns:id(TEXT PK),turn_id,channel,account_id,target,dispatch_kindpayload(TEXT — JSON blob with channel, to, accountId, payloads[], threadId, replyToId, bestEffort, gifPlayback, silent, mirror)queued_at,status,attempt_count,next_attempt_at,last_attempt_atlast_error,error_class(permanent | terminal),delivered_at,terminal_reason,completed_atidempotency_key(reserved, not yet populated)
- Indexes:
idx_message_outbox_idem— unique onidempotency_keywhere not nullidx_message_outbox_turn_status— on(turn_id, status)for turn lifecycle queriesidx_message_outbox_resume— on(status, next_attempt_at, queued_at)for recovery worker scans
- PRAGMA:
journal_mode=WAL,synchronous=NORMAL,busy_timeout=5000 - In-memory fallback: sticky for process lifetime if disk open fails;
isLifecycleDbInMemory()exposed for callers - Transaction helper:
runLifecycleTransactionwithBEGIN IMMEDIATE/COMMIT/ROLLBACK
Delivery Queue (src/infra/outbound/delivery-queue.ts)
Core functions:
enqueueDelivery— inserts row asqueued, returns entry IDackDelivery— sets statusdelivered, triggers turn finalization viamaybeFinalizeTurnDeliveredfailDelivery— checks permanent error patterns, incrementsattempt_count, transitions tofailed_retryable(with backoff) orfailed_terminal; status guard prevents overwriting terminal rowsmoveToFailed— terminal-only guard (AND status IN ('queued','failed_retryable')) prevents overwriting already-acked entriesmarkAttemptStarted— setslast_attempt_atand pushesnext_attempt_atforward by 25s before send startsloadPendingDeliveries— queries non-terminal entries wherenext_attempt_at <= now, filtersdispatch_kind IS NULL OR dispatch_kind = 'final', respectsstartupCutoffrecoverPendingDeliveries— expires stale entries permaxAgeMs/expireAction, iterates pending with time budget (maxRecoveryMs, default 60s), marks orphaned non-final rows terminalimportLegacyFileQueue— one-time migration fromdelivery-queue/*.json; keeps files if DB is in-memory; preserveslastAttemptAtfor correct backoffpruneOutbox— deletes terminal rows older than 48h; guards against invalidageMsisPermanentDeliveryError— regex array: "no conversation reference found", "chat not found", "user not found", "bot was blocked", "bot was kicked", "chat_id is empty", "outbound not configured"computeBackoffMs— lookup table: [5s, 25s, 120s, 600s]- Constants:
MAX_RETRIES = 5,OUTBOX_PRUNE_AGE_MS = 48h,DEFAULT_OUTBOX_MAX_AGE_MS = 30min
sendPayload Adapter Interface (src/channels/plugins/types.adapters.ts)
- New optional method on
ChannelOutboundAdapter:sendPayload?: (ctx: ChannelOutboundPayloadContext) => Promise<OutboundDeliveryResult>
ChannelOutboundPayloadContext = ChannelOutboundContext & { payload: ReplyPayload }- Each adapter's
sendPayloadhandles:- Empty payload guard: no text + no media → no-op return
- Media dispatch: iterates
mediaUrls(or singlemediaUrl), caption on first only - Text chunking: uses adapter's own
chunker/textChunkLimitfor defense-in-depth - Empty chunk guard: handles edge case where chunker returns
[]
Delivery Path Integration (src/infra/outbound/deliver.ts)
- Prefers
sendPayloadwhen adapter supports it (for all payloads, not justchannelData) - Write-ahead: enqueues to outbox before send, acks/fails after completion (best-effort — queue write failure doesn't block delivery)
markAttemptStartedcalled before send to prevent recovery-vs-live-send race
Reply Dispatcher + Dispatch Integration
src/auto-reply/reply/reply-dispatcher.ts:
- When
deliveryQueueContextis set, enqueues each final reply before callingdeliver - Chains
ackDelivery/failDeliveryto the delivery promise (best-effort)
src/auto-reply/dispatch.ts:
resolveDeliveryQueueContextextracts channel, target, account ID, thread/reply IDs- Interaction-scoped dispatchers (Slack slash, Discord native commands) skip outbox to avoid replaying to wrong destination
- Turn finalization uses outbox status as truth source
src/gateway/server-message-lifecycle.ts:
- Outbox worker: polls
recoverPendingDeliverieson interval (default 1s) - Turn worker: polls
listRecoverableTurns, checks outbox state, callsdispatchResumedTurnfor turns with no outbox evidence
Adapters with sendPayload (23 total across 4 PRs)
Core outbound plugins (src/channels/plugins/outbound/):
direct-text-media.ts(shared factory for iMessage/Signal pattern)discord.ts,slack.ts,whatsapp.ts
Extension adapters (extensions/*/src/channel.ts or outbound.ts):
- Batch A (feat(adapters): add sendPayload to batch-a (BlueBubbles, iMessage, Signal, Telegram, WhatsApp) #30141): BlueBubbles, iMessage, Signal, Telegram, WhatsApp
- Batch B (feat(adapters): add sendPayload to batch-b (Discord, Google Chat, Mattermost, MS Teams, Slack, Synology) #30142): Discord, Google Chat, Mattermost, MS Teams, Slack, Synology Chat
- Batch C (feat(adapters): add sendPayload to batch-c (Feishu, IRC, Matrix, Nextcloud-Talk, Nostr, Tlon, Twitch) #30143): Feishu, IRC, Matrix, Nextcloud Talk, Nostr, Tlon, Twitch
- Batch D (feat(adapters): add sendPayload to batch-d (Zalo, Zalouser, core outbound plugins) #30144 — merged): Zalo, Zalouser
PR Dependency Graph
Independent (merge in any order):
#30144 batch-d: Zalo, Zalouser, core outbound plugins ✅ MERGED
#30141 batch-a: BlueBubbles, iMessage, Signal, Telegram, WhatsApp
#30142 batch-b: Discord, Google Chat, Mattermost, MS Teams, Slack, Synology
#30143 batch-c: Feishu, IRC, Matrix, Nextcloud-Talk, Nostr, Tlon, Twitch
Then sequentially:
#29998 SQLite outbox migration
└── #30009 Recovery worker + delivery tracking
└── #29997 Prefer sendPayload for all payloads
Adapter batch PRs are independent of each other and of the infrastructure PRs. They add additive sendPayload methods with no behavioral change to existing delivery. The infrastructure PRs must land after the adapter PRs so that sendPayload is available across all channels before the delivery layer switches to prefer it.
Configuration
| Key | Type | Default | Description |
|---|---|---|---|
messages.delivery.maxAgeMs |
number |
1800000 (30 min) |
Maximum queued age before expiry check |
messages.delivery.expireAction |
"fail" or "deliver" |
"deliver" |
What to do with entries older than maxAgeMs: "fail" marks them expired; "deliver" allows late delivery |
Internal constants (not user-configurable):
| Constant | Value | Description |
|---|---|---|
MAX_RETRIES |
5 | Maximum delivery attempts before terminal failure |
| Backoff schedule | 5s, 25s, 2m, 10m, 10m | Exponential backoff per retry |
OUTBOX_PRUNE_AGE_MS |
48h | Terminal row cleanup threshold |
markAttemptStarted guard |
25s | Recovery exclusion window for in-flight sends |
Migration and Upgrade
- Automatic: on first startup after upgrade,
importLegacyFileQueuereads existing*.jsonfiles from<stateDir>/delivery-queue/and inserts them into the SQLite outbox. Files are deleted after successful import (unless DB fell back to in-memory). PreserveslastAttemptAtfor correct backoff. - Rollback safe: if downgraded, old code ignores the
.dbfile and continues using file-based queue. No data loss in either direction. - In-memory fallback: if the filesystem is read-only or disk is full, the outbox runs in-memory (same reliability as the old file queue). Logged at verbose level.
- No config changes required:
messages.delivery.*keys have sensible defaults.
Testing
~150 tests added across all 7 PRs:
- Per-adapter
*.sendpayload.test.tsfiles: text-only, single media, multi-media, empty payload, long-text chunking, empty chunk guard src/infra/outbound/outbound.test.ts: enqueue/ack/fail lifecycle, backoff calculation, permanent error detection, recovery worker, legacy import, pruning, expiry, dispatch_kind filtering, startupCutoff, moveToFailed status guardsrc/infra/outbound/deliver.test.ts: sendPayload routing preference, chunking integration
Related Issues
Closes:
- [Feature]: Persistent outbox queue for outbound message reliability #14725 — Persistent outbox queue for outbound message reliability
- [Feature]: Add TTL/Expiry for Delivery Queue Messages #16555 — Add TTL/expiry for delivery queue messages
- Delivery Queue Processor Only Runs on Recovery, Causing Multi-Hour Delays #21722 — Delivery queue processor only runs on recovery, causing multi-hour delays
- [Bug]: Telegram sendMessage fails during gateway restart — messages lost with no retry #22376 — Telegram sendMessage fails during gateway restart — messages lost with no retry
- [Bug]: Delivery Queue Retries Permanently-Failed Entries Indefinitely #23777 — Delivery queue retries permanently-failed entries indefinitely
- [Bug]: Missing delivery queue monitoring leads to periodic gateway hangs #24353 — Missing delivery queue monitoring leads to periodic gateway hangs
Addresses (partially):
- Message runs interrupted by network errors are not retried, causing silent message loss #9208 — Message runs interrupted by network errors are not retried
- WhatsApp: Messages silently dropped during reconnection window #14827 — WhatsApp messages silently dropped during reconnection window
- [Feature]: Message catch-up on gateway restart for Telegram and Discord #26783 — Message catch-up on gateway restart for Telegram and Discord
Prior work (closed):
- fix(gateway): message delivery reliability with SQLite journal (inbound dedup + orphan recovery) #27939 — SQLite journal approach (closed in favor of incremental PR series)
- feat(gateway): unified durable message lifecycle — SQLite turns+outbox, continuous workers, plugin compat layer #28941 — Unified durable lifecycle (closed in favor of incremental PR series)
Pull Requests
| PR | Title | Status |
|---|---|---|
| #30144 | feat(adapters): add sendPayload to batch-d (Zalo, Zalouser, core outbound plugins) | ✅ Merged |
| #30141 | feat(adapters): add sendPayload to batch-a (BlueBubbles, iMessage, Signal, Telegram, WhatsApp) | Open |
| #30142 | feat(adapters): add sendPayload to batch-b (Discord, Google Chat, Mattermost, MS Teams, Slack, Synology) | Open |
| #30143 | feat(adapters): add sendPayload to batch-c (Feishu, IRC, Matrix, Nextcloud-Talk, Nostr, Tlon, Twitch) | Open |
| #29998 | feat(outbox): migrate delivery queue from file-based to SQLite outbox | Open |
| #30009 | feat(outbox): write-ahead outbox with recovery worker and delivery tracking | Open |
| #29997 | feat(outbound): prefer sendPayload for all payloads when adapter supports it | Open |
Future Work
- Populate
idempotency_keyfor channels that support deduplication (requires per-channel MessageSid audit) - Persistent inbound dedup (flip
disablePersistentDedupeflag + per-channel validation) openclaw status --deliverymonitoring command- Periodic recovery timer (not just startup) for long-running gateway sessions
- Cross-instance coordination for multi-gateway deployments
- Configurable
maxRetriesper channel (Feature Request: delivery.maxRetries config for delivery queue #30496)