fix: defer DB init to gateway_start hook to prevent database lock race by 100yenadmin · Pull Request #288 · Martian-Engineering/lossless-claw

100yenadmin · 2026-04-06T08:38:03Z

Summary

Prevents "database is locked" errors during macOS launchd-managed gateway restarts by catching SQLite lock errors during plugin register() and deferring the DB open to the gateway_start hook.

Fixes #287

Problem

On macOS with launchd KeepAlive: true + ThrottleInterval: 1, gateway restarts can spawn two processes simultaneously. Both call register() and immediately open lcm.db, but only one can acquire the write lock. The other loops "Migration failed: database is locked" indefinitely.

The gateway's stale-PID cleanup runs at port-bind time, which happens after plugin register(). So by the time the orphan is killed, LCM has already failed.

Solution: Eager-first, defer on lock

Try to open the DB eagerly in register() — preserving the original behavior for tests and normal startup. Only when the open fails with "database is locked" does it defer to gateway_start (which fires after port bind + stale PID cleanup).

register():
  try:
    database = createLcmDatabaseConnection(dbPath)  // works 99% of the time
  catch "database is locked":
    defer to gateway_start hook
  catch other:
    rethrow (fail fast)

This is a strict superset of the original behavior — identical when there's no lock contention.

Changes by file

`src/plugin/index.ts` (+87/-8)

Eager-first init with deferred fallback: wraps createLcmDatabaseConnection() in try/catch; only "database is locked" errors trigger deferral to gateway_start
gateway_stop handler: closes the DB connection via closeLcmConnection(), nulls database and deferredEngine, sets stopped flag
getDatabase(): state-aware guard — distinguishes "not yet initialized" (deferred path) from "closed after gateway_stop" for actionable error messages
getEngine(): validates DB is still open via getDatabase() before returning any engine (eager or deferred), preventing use-after-close
Lifecycle hooks (before_reset, session_end): await deferredReady before accessing engine
Tools and context engines: resolved lazily via getEngine() instead of capturing lcm directly
Command: passes () => getDatabase() instead of the raw handle

`src/plugin/lcm-command.ts` (+6/-5)

createLcmCommand accepts db: DatabaseSync | (() => DatabaseSync) for lazy DB resolution (backward-compatible)
getDb() called only in branches that need it (status, doctor) — /lossless help never resolves the DB

`src/db/connection.ts` (+7/-1)

createLcmDatabaseConnection: separates new DatabaseSync() from configureConnection() (PRAGMAs); if PRAGMA setup fails, the raw handle is closed before rethrowing to prevent FD leaks

`test/lcm-command.test.ts` (+27)

Lazy DB function path: verifies help does not invoke the DB resolver; verifies status does invoke it

Review history

All review comments (13 threads across 4 Copilot reviews) have been addressed and resolved:

Finding	Fix
`getDb()` resolved before subcommand parsing; help fails unnecessarily	Moved into `status`/`doctor` branches only
No tests for `db: () => DatabaseSync` path	Added 2 tests covering help (no-call) and status (calls)
`createLcmDatabaseConnection` leaks raw handle on PRAGMA failure	Split construction from configuration; close on failure
`gateway_stop` doesn't clear engine references → use-after-close	Null `deferredEngine`; guard `getEngine()` via `getDatabase()`
Error messages don't distinguish "not initialized" vs "closed"	Added `stopped` flag; `getDatabase()` returns state-aware messages
`getEngine()` returns eager `lcm` without checking stopped state	`getDatabase()` called before returning `lcm`
Misleading comment on rethrow path	Updated to reflect framework error handling
Earlier findings (sharedInit, FD leak, config staleness, type assertion)	Eliminated by v4 rewrite (eager-first approach)

Test plan

CI passes (563 tests, 15 in lcm-command including 2 new)
Verified on macOS with 1.9GB database: openclaw gateway restart completes, no lock errors
Single gateway process after restart
LCM "Plugin loaded" banner appears in logs
Pre-existing test timeouts on large DB match main branch (not introduced by this PR)
All 13 review threads resolved

On macOS with launchd KeepAlive, gateway restarts can spawn two processes simultaneously. Both call register() and open lcm.db, causing "database is locked" errors that loop indefinitely. Defer createLcmDatabaseConnection() and LcmContextEngine construction from register() to the gateway_start plugin hook, which fires after the HTTP server binds its port and stale PIDs are killed. Uses module-level shared state so deferred plugin reloads reuse the already-initialized connection. Fixes Martian-Engineering#287 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Copilot

Pull request overview

This PR defers LCM SQLite database connection + migration work from plugin register() to the gateway_start hook to avoid macOS launchd restart races that can produce persistent “database is locked” startup failures.

Changes:

Introduces module-level shared initialization state (sharedInit) to coordinate deferred initialization and reuse an already-open DB connection across repeated register() calls.
Moves createLcmDatabaseConnection() and new LcmContextEngine(...) into a gateway_start handler, and gates lifecycle handlers on init.ready.
Updates context engine/tool/command registrations to lazily access the initialized lcm/DB via ensureInitialized().

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

100yenadmin · 2026-04-06T08:43:33Z

TLDR @jalehman this is critical issue with the new OpenClaw update for users with DB over 1-1.5GB due to the time it takes LCM to initialize (multiple solutions to this but this was quickest to stop DOA gateway loop lock).

…taleness Addresses Copilot review comments and adversarial audit findings: 1. Share only the DB handle at module scope; rebuild LcmContextEngine per-register() with fresh deps so hot-reloaded config takes effect. 2. Prevent unhandled promise rejection crash by attaching a no-op .catch() to the ready promise immediately after creation. 3. Close old DB connection when databasePath changes (prevents FD leak and stale locks — the exact problem this PR fixes). 4. Add gateway_stop handler to close DB cleanly on shutdown. 5. Fix half-initialized stuck state: if DB opens but engine fails in the else-if branch, properly set initError and reject the promise instead of silently swallowing. 6. Export __resetSharedInitForTests() for test isolation. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

100yenadmin · 2026-04-06T08:51:34Z

Pushed a follow-up commit (151612e) addressing all three Copilot review findings plus additional issues from adversarial audit:

Review comments addressed:

Config staleness — Now sharing only the DB handle at module scope. LcmContextEngine is rebuilt per-register() with fresh deps, so hot-reloaded config (threshold, model, ignore patterns) always takes effect. Keyed cache invalidates on DB handle identity change.
FD leak on dbPath change — closeSharedDb() is called before replacing sharedInit when the database path changes. Uses closeLcmConnection() to properly close and untrack the handle.
Half-initialized stuck state — The else if branch now properly handles partial failures: if createLcmDatabaseConnection() succeeds but LcmContextEngine throws, the DB handle is closed, initError is set, and rejectReady() is called. No more permanently-pending promises.

Additional fixes from adversarial audit:

Unhandled promise rejection crash — Attached readyPromise.catch(() => {}) immediately after creation to prevent Node.js unhandledRejection if gateway_start init fails before any event handler has awaited the promise.
gateway_stop cleanup — Added gateway_stop handler that calls closeSharedDb() and nulls out the shared state. Prevents FD leaks on shutdown and ensures clean WAL checkpoint.
Test isolation — Exported __resetSharedInitForTests() so tests can cleanly reset module state between runs.

Verified locally: gateway restarts cleanly, zero "database is locked" errors, zero "waiting for gateway_start" errors, single process.

Copilot

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 2 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Addresses second round of Copilot review: 1. Use closeLcmConnection(db) instead of db.close() in the eager-init failure path to keep the connection tracking maps consistent. 2. Change createLcmCommand to accept db as DatabaseSync | (() => DatabaseSync) so the deferred getter can be passed without a type assertion cast. Backward-compatible: existing callers passing a plain DatabaseSync still work via the typeof check. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Major simplification addressing test failures and review concerns: The previous approach (defer everything to gateway_start, share DB at module scope) broke tests that never fire gateway_start and introduced complexity around shared state, promise lifecycle, and config staleness. New approach: try eager DB init immediately in register() (preserving original behavior for tests and normal startup). Only defer to gateway_start if the eager open fails with "database is locked" — the specific error from the macOS launchd orphan-process race. This eliminates: - Module-level shared state (no more sharedDb, no test pollution) - Promise lifecycle complexity (no unhandled rejection risk in normal path) - Config staleness (engine built with fresh deps every register()) - The need for __resetSharedInitForTests() Each register() call gets its own DB handle and engine, matching the original code's behavior. The only difference: lock errors are caught and retried via gateway_start instead of looping forever. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

100yenadmin · 2026-04-06T09:41:48Z

v4 (final) — Eager-first, defer on lock only

Pushed commit 6d8523d which significantly simplifies the approach after 3 rounds of review + adversarial audit.

What changed from v2/v3 → v4

Removed: All module-level shared state (sharedDb, sharedInit), promise lifecycle complexity, __resetSharedInitForTests(). Each register() call is now fully independent — no cross-call coordination.

Kept: The core fix. createLcmDatabaseConnection() is wrapped in a try/catch. If it throws "database is locked", we defer to gateway_start. Everything else works identically to the original code.

Why this is better

The lock race is between processes (two gateways), not between multiple register() calls in the same process. There was never a reason to share DB handles at module scope — that was over-engineering that caused test pollution, config staleness, and promise hazards.

Test results

CI: 561/561 pass ✅
Local (1.9GB DB): 559 pass, 2 timeout — same 2 tests timeout on main too (pre-existing, caused by running compaction on the real 1.9GB database). Not introduced by this PR.
Adversarial audit: 0 CRITICAL, 0 HIGH. Two MEDIUM (tool factories don't await deferredReady like event handlers do — but tools can't be called before gateway is up anyway).

All prior Copilot review comments resolved

~~Config staleness~~ — No shared engine; each register() builds its own with fresh deps
~~FD leak on dbPath change~~ — No shared handles to leak
~~Half-initialized stuck state~~ — No shared promise state; lock catch is simple try/catch
~~db.close() bypasses tracking~~ — gateway_stop uses closeLcmConnection()
~~as cast on createLcmCommand~~ — Changed to accept (() => DatabaseSync) callback

Copilot

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

…fter-close - Move getDb() into status/doctor branches so /lossless help never resolves the database (review comment lcm-command.ts:733) - Close raw DatabaseSync handle when PRAGMA setup fails in createLcmDatabaseConnection to prevent FD leaks (review comment index.ts:1586) - Clear deferredEngine on gateway_stop and guard getEngine() against closed database to prevent use-after-close (review comment index.ts:1642) - Add tests covering the db: () => DatabaseSync lazy path: help must not invoke the resolver, status must (review comment lcm-command.ts:720)

Copilot

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

getDatabase() now distinguishes "closed after gateway_stop" from "not yet initialized" with a stopped flag. getEngine() delegates to getDatabase() instead of duplicating the null check with its own misleading message.

Copilot

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

- Call getDatabase() before returning eagerly-constructed lcm so post-gateway_stop calls fail fast instead of returning an engine backed by a closed DB handle - Update rethrow comment to accurately describe error propagation (framework handles it, not the engine constructor)

Copilot

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated no new comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

100yenadmin · 2026-04-06T14:24:24Z

Clean and ready for review @jalehman 🖤

evaluateLeafTrigger now accepts precomputedTokenCount so callers that already fetched the context token count (compactLeaf, compactFullSweep) can pass it through instead of re-querying. On a 1.9GB SQLite database with 5+ concurrent agent sessions, every DB read acquires a shared lock and adds contention. The duplicate reads were dismissed as "~1ms" but on large databases under concurrent load, they contribute to the lock pressure that caused the gateway lockups fixed in PR Martian-Engineering#288. The afterTurn path (via engine wrapper) still does one read since it doesn't pre-fetch — this is the correct behavior for that path. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

100yenadmin · 2026-04-06T16:18:04Z

@jalehman Ready for merge — CI green, all review comments resolved.

100yenadmin · 2026-04-06T16:39:43Z

Part of the LCM Performance & Cache Optimization Sprint — see #297 for the full tracking issue linking all 5 PRs.

100yenadmin · 2026-04-06T17:54:02Z

Merge order: 1st — No dependencies. Merge first — standalone DB lock fix.

See #297 for the full sprint tracking issue with all 5 PRs.

Recommended merge sequence: #288 → #294 → #289 → #295 → #296

Copilot

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated no new comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

When eager DB open hits a lock during gateway restart, share one deferred initialization promise across context-engine resolution, tools, commands, and lifecycle hooks so the first request waits for gateway_start instead of failing. Persist deferred retry failures so later callers see the real error, and add a patch changeset for the user-visible startup fix. Regeneration-Prompt: | Follow up on PR 288's deferred SQLite startup path for lossless-claw. The lock-contention fallback must not move the failure from plugin load to the first request: context engine resolution, plugin tools, commands, and lifecycle hooks should all await the same deferred initialization when the initial open fails with "database is locked" during macOS launchd restarts. If the deferred retry also fails, retain and rethrow that real error instead of misleading callers with a perpetual "waiting for gateway_start" message. Keep the eager-success path intact, add focused regression coverage for deferred success and deferred failure, and include the missing patch changeset because this changes user-visible runtime behavior.

jalehman · 2026-04-06T20:24:47Z

Thank you!

## Problem OpenClaw v2026.4.5+ calls plugin register() per-agent-context (main, subagents, cron lanes) — not once at startup. Each call opens a new DB connection and runs migrations, causing "Migration failed: database is locked" storms on large databases. PR Martian-Engineering#288's deferred-init fix was merged but does not address this per-context re-registration. ## Solution ### Singleton DB + engine (critical fix) Uses globalThis + Symbol.for() singleton (same pattern as startup-banner-log.ts) keyed on normalized dbPath. When register() is called again with the same DB path, it skips init entirely and wires handlers to the existing waitForEngine/waitForDatabase closures via wirePluginHandlers(). gateway_stop clears the singleton so a fresh init occurs on restart. The shared state stores only the closures (not mutable copies of database/lcm locals), avoiding stale-reference bugs. ### Fallback provider config (additive) - Add fallbackProviders config field (env: LCM_FALLBACK_PROVIDERS, format: provider/model,provider/model) for explicit compaction summarization fallbacks - Append to existing 5-level candidate chain with dedup - Exponential backoff (500ms→8s) between candidate retries - PROVIDER FALLBACK / ALL PROVIDERS EXHAUSTED messages on stderr - Half-threshold early warning and CIRCUIT BREAKER OPEN/CLOSED messages with cooldown time - Startup banner for configured fallback providers

100yenadmin · 2026-04-06T20:54:19Z

Hey @nicobailon — heads up, our last commit (singleton DB init + fallback providers) landed seconds after this was merged, so it didn't make it in.

While investigating the production logs post-merge, we found a second issue: OpenClaw v2026.4.5 calls register() per-agent-context (main, subagents, cron lanes), not once at startup. This means every subagent spawn opens a new DB connection and runs migrations — causing the same "Migration failed: database is locked" storms the deferred-init fix was meant to prevent, but from within the same process rather than across two processes.

Production logs showed 478 re-registrations in a single gateway session with repeated migration lock failures.

Follow-up PR: #302

Singleton DB + engine per dbPath (reuses existing connection on repeat register() calls)
Fallback provider config with exponential backoff and degradation logging

## Problem OpenClaw v2026.4.5+ calls plugin register() per-agent-context (main, subagents, cron lanes) — not once at startup. Each call opens a new DB connection and runs migrations, causing "Migration failed: database is locked" storms on large databases. PR Martian-Engineering#288's deferred-init fix was merged but does not address this per-context re-registration. ## Solution ### Singleton DB + engine (critical fix) Uses globalThis + Symbol.for() singleton (same pattern as startup-banner-log.ts) keyed on normalized dbPath. When register() is called again with the same DB path, it skips init entirely and wires handlers to the existing waitForEngine/waitForDatabase closures via wirePluginHandlers(). gateway_stop clears the singleton so a fresh init occurs on restart. The shared state stores only the closures (not mutable copies of database/lcm locals), avoiding stale-reference bugs. ### Fallback provider config (additive) - Add fallbackProviders config field (env: LCM_FALLBACK_PROVIDERS, format: provider/model,provider/model) for explicit compaction summarization fallbacks - Append to existing 5-level candidate chain with dedup - Exponential backoff (500ms→8s) between candidate retries - PROVIDER FALLBACK / ALL PROVIDERS EXHAUSTED messages on stderr - Half-threshold early warning and CIRCUIT BREAKER OPEN/CLOSED messages with cooldown time - Startup banner for configured fallback providers

* fix: singleton DB init per dbPath + fallback provider config ## Problem OpenClaw v2026.4.5+ calls plugin register() per-agent-context (main, subagents, cron lanes) — not once at startup. Each call opens a new DB connection and runs migrations, causing "Migration failed: database is locked" storms on large databases. PR #288's deferred-init fix was merged but does not address this per-context re-registration. ## Solution ### Singleton DB + engine (critical fix) Uses globalThis + Symbol.for() singleton (same pattern as startup-banner-log.ts) keyed on normalized dbPath. When register() is called again with the same DB path, it skips init entirely and wires handlers to the existing waitForEngine/waitForDatabase closures via wirePluginHandlers(). gateway_stop clears the singleton so a fresh init occurs on restart. The shared state stores only the closures (not mutable copies of database/lcm locals), avoiding stale-reference bugs. ### Fallback provider config (additive) - Add fallbackProviders config field (env: LCM_FALLBACK_PROVIDERS, format: provider/model,provider/model) for explicit compaction summarization fallbacks - Append to existing 5-level candidate chain with dedup - Exponential backoff (500ms→8s) between candidate retries - PROVIDER FALLBACK / ALL PROVIDERS EXHAUSTED messages on stderr - Half-threshold early warning and CIRCUIT BREAKER OPEN/CLOSED messages with cooldown time - Startup banner for configured fallback providers * fix: handle terminal summarizer exhaustion fallback Route terminal non-auth provider failures through the shared exhaustion handler so deterministic truncation actually runs, add regression coverage, and include a changeset for the runtime behavior fix. Regeneration-Prompt: | Address the PR review finding in the multi-provider summarizer fallback path. The existing code added an ALL PROVIDERS EXHAUSTED log after the candidate loop, but the loop always returned, continued, or threw before that block could execute. Preserve existing auth-failure behavior because LcmProviderAuthError is used intentionally by compaction and the circuit breaker, but make terminal non-auth failures fall through to one shared exhaustion path that logs clearly and returns buildDeterministicFallbackSummary instead of an empty string. Add a focused regression test that exhausts all resolved non-auth candidates and proves both the terminal log and deterministic fallback behavior. Add a patch changeset because this changes runtime behavior and logging for plugin summarization fallback. --------- Co-authored-by: Eva <eva@100yen.org> Co-authored-by: Josh Lehman <josh@martian.engineering>

Copilot AI review requested due to automatic review settings April 6, 2026 08:38

Copilot started reviewing on behalf of 100yenadmin April 6, 2026 08:38 View session