Skip to content

feat(outbox): migrate delivery queue from file-based to SQLite outbox#29147

Closed
nohat wants to merge 1 commit intoopenclaw:mainfrom
nohat:lifecycle/sqlite-outbox
Closed

feat(outbox): migrate delivery queue from file-based to SQLite outbox#29147
nohat wants to merge 1 commit intoopenclaw:mainfrom
nohat:lifecycle/sqlite-outbox

Conversation

@nohat
Copy link
Contributor

@nohat nohat commented Feb 27, 2026

Summary

  • Problem: File-based delivery queue is unbounded, non-queryable, and has no TTL — failed entries retry forever and successful deliveries leave no audit trail
  • Why it matters: Gateway crashes leave orphaned queue files with no recovery path; operators cannot inspect delivery state
  • What changed: Replaced file-based JSON queue with SQLite message_outbox table; added configurable TTL/expiry (messages.delivery.maxAgeMs, messages.delivery.expireAction); one-time migration imports existing file-queue entries; exponential backoff for retries (5s → 25s → 2m → 10m, max 5)
  • What did NOT change (scope boundary): No write-ahead pattern yet (enqueue-before-send comes in PR 2); no turn tracking; no plugin compat layer; no continuous worker loop

Change Type (select all)

  • Bug fix
  • Feature
  • Refactor
  • Docs
  • Security hardening
  • Chore/infra

Scope (select all touched areas)

  • Gateway / orchestration
  • Skills / tool execution
  • Auth / tokens
  • Memory / storage
  • Integrations
  • API / contracts
  • UI / DX
  • CI/CD / infra

Linked Issue/PR

User-visible / Behavior Changes

  • New config keys: messages.delivery.maxAgeMs (default 30m), messages.delivery.expireAction ("fail" or "deliver")
  • Delivery queue data moves from delivery-queue/*.json files to message-lifecycle.db SQLite database on first startup (automatic, one-time migration)

Security Impact (required)

  • New permissions/capabilities? No
  • Secrets/tokens handling changed? No
  • New/changed network calls? No
  • Command/tool execution surface changed? No
  • Data access scope changed? No

Repro + Verification

Environment

  • OS: macOS
  • Runtime/container: Node 22+ / Bun
  • Model/provider: N/A
  • Integration/channel (if any): All outbound channels
  • Relevant config (redacted): messages.delivery.maxAgeMs, messages.delivery.expireAction

Steps

  1. Start gateway with existing file-based queue entries
  2. Verify legacy entries are imported into SQLite and JSON files removed
  3. Trigger a delivery failure and verify exponential backoff retry
  4. Verify expired entries are marked terminal after maxAgeMs

Expected

  • Legacy queue files migrated to SQLite on first startup
  • Failed deliveries retry with backoff, then expire per config

Actual

  • Verified via test suite (58 outbound delivery tests)

Evidence

  • Failing test/log before + passing after
  • Trace/log snippets
  • Screenshot/recording
  • Perf numbers (if relevant)

Human Verification (required)

  • Verified scenarios: pnpm build, pnpm test (948 pass), pnpm check
  • Edge cases checked: Legacy import idempotency, in-memory SQLite fallback, permanent error detection, backoff calculation
  • What you did not verify: Live gateway startup with real file-queue data

Compatibility / Migration

  • Backward compatible? Yes
  • Config/env changes? Yes — new optional messages.delivery config keys (defaults preserve existing behavior)
  • Migration needed? Yes — automatic one-time import of file-queue entries to SQLite on startup
  • If yes, exact upgrade steps: None required — migration runs automatically on first startup

Failure Recovery (if this breaks)

  • How to disable/revert this change quickly: Revert commit; file-queue directory is preserved until import succeeds
  • Files/config to restore: N/A
  • Known bad symptoms reviewers should watch for: node:sqlite unavailable on older Node versions (requires Node 22+); in-memory fallback means entries don't persist across restarts

Risks and Mitigations

  • Risk: node:sqlite not available on some Node versions
    • Mitigation: Graceful in-memory fallback with verbose logging; Node 22+ is already the runtime baseline

Part 1 of 3: SQLite outbox → #29148 (write-ahead outbox + worker) → #29149 (turn tracking)

E2E Test Results

Tested with live Telegram bot on lifecycle/sqlite-outbox branch:

  • Test 1 (Migration): Legacy delivery-queue/*.json files migrated to message_outbox SQLite table on first startup, JSON files removed
  • Test 2 (Normal Delivery): Message sent via Telegram, replied successfully; delivery goes direct (outbox write-ahead is PR feat(outbox): add write-ahead outbox, recovery worker, and plugin compat layer #29148)
  • Test 3 (Recovery on Restart): Seeded pending outbox entries recovered on gateway restart — entries at MAX_RETRIES marked failed_terminal, fresh entries attempted and properly failed with backoff

E2E test script: https://gist.github.com/nohat/d9d7b076c760b178cdfbf589242c0788

Replace unbounded file-based delivery queue with queryable SQLite
message_outbox table. Adds TTL/expiry for stale entries, delivery
outcome retention, and one-time legacy file queue import on startup.

Closes openclaw#23777, openclaw#16555, openclaw#29128
@greptile-apps
Copy link
Contributor

greptile-apps bot commented Feb 28, 2026

Greptile Summary

Migrated outbound delivery queue from file-based JSON to SQLite with proper retry/expiry semantics. The implementation is solid and production-ready.

Key Changes:

  • Replaced delivery-queue/*.json files with message_outbox table in SQLite message-lifecycle.db
  • Added configurable TTL (messages.delivery.maxAgeMs, default 30m) and expiry action (messages.delivery.expireAction: "fail" or "deliver")
  • One-time migration with importLegacyFileQueue() using INSERT OR IGNORE for idempotency
  • Exponential backoff preserved (5s → 25s → 2m → 10m, max 5 attempts total)
  • Proper WAL mode, indices for query performance, transaction support, and in-memory fallback

Test Coverage:

  • All 58 outbound delivery tests updated and passing
  • Tests cover enqueue/ack/fail flows, retry backoff, max retries, permanent errors, recovery, expiry, and migration

Architecture:

No issues found — implementation is clean, well-tested, and follows project conventions.

Confidence Score: 5/5

  • This PR is safe to merge with minimal risk
  • Score reflects thorough implementation, comprehensive test coverage (all 58 tests updated and passing), proper error handling with in-memory fallback, backward-compatible one-time migration, and adherence to project conventions. The SQLite implementation uses appropriate indices, WAL mode, and transaction support. Exponential backoff and retry logic preserved correctly from original file-based implementation. No logical bugs, security issues, or breaking changes detected.
  • No files require special attention

Last reviewed commit: bc93966

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: bc9396655f

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines 671 to 672
await recoverPendingDeliveries({
deliver: deliverOutboundPayloads,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Run outbox pruning in gateway maintenance loop

This startup flow imports legacy files and runs recovery, but it never invokes pruneOutbox; after this commit ackDelivery keeps successful sends as delivered rows instead of removing them. In production gateways with steady traffic, message_outbox will grow without bound, which will eventually slow recovery/startup queries and inflate message-lifecycle.db indefinitely.

Useful? React with 👍 / 👎.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pruneOutbox is called in src/gateway/server-message-lifecycle.ts:93, which is added in #29148 (the write-ahead outbox worker PR). This PR #29147 is the SQLite migration base — #29148 adds the worker that calls pruneOutbox in the maintenance loop.

@nohat
Copy link
Contributor Author

nohat commented Feb 28, 2026

Superseded by fresh PR stack — links to follow.

@nohat
Copy link
Contributor Author

nohat commented Feb 28, 2026

New PR stack:

  1. feat(outbox): SQLite outbox with write-ahead delivery, recovery worker, and sendFinal-required adapters #29953 — SQLite outbox + write-ahead delivery + sendFinal-required adapters (→ main)
  2. feat(lifecycle): inbound turn tracking, orphan recovery, and abort coordination #29956 — inbound turn tracking + orphan recovery + abort coordination (→ main, merge after 1)
  3. feat(lifecycle): persistent inbound dedup across gateway restarts #29957 — persistent inbound dedup across restarts (→ main, merge after 2)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

gateway Gateway runtime size: L

Projects

None yet

1 participant