fix: defer DB init to gateway_start hook to prevent database lock race#288
Conversation
On macOS with launchd KeepAlive, gateway restarts can spawn two processes simultaneously. Both call register() and open lcm.db, causing "database is locked" errors that loop indefinitely. Defer createLcmDatabaseConnection() and LcmContextEngine construction from register() to the gateway_start plugin hook, which fires after the HTTP server binds its port and stale PIDs are killed. Uses module-level shared state so deferred plugin reloads reuse the already-initialized connection. Fixes Martian-Engineering#287 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
This PR defers LCM SQLite database connection + migration work from plugin register() to the gateway_start hook to avoid macOS launchd restart races that can produce persistent “database is locked” startup failures.
Changes:
- Introduces module-level shared initialization state (
sharedInit) to coordinate deferred initialization and reuse an already-open DB connection across repeatedregister()calls. - Moves
createLcmDatabaseConnection()andnew LcmContextEngine(...)into agateway_starthandler, and gates lifecycle handlers oninit.ready. - Updates context engine/tool/command registrations to lazily access the initialized
lcm/DB viaensureInitialized().
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
TLDR @jalehman this is critical issue with the new OpenClaw update for users with DB over 1-1.5GB due to the time it takes LCM to initialize (multiple solutions to this but this was quickest to stop DOA gateway loop lock). |
…taleness Addresses Copilot review comments and adversarial audit findings: 1. Share only the DB handle at module scope; rebuild LcmContextEngine per-register() with fresh deps so hot-reloaded config takes effect. 2. Prevent unhandled promise rejection crash by attaching a no-op .catch() to the ready promise immediately after creation. 3. Close old DB connection when databasePath changes (prevents FD leak and stale locks — the exact problem this PR fixes). 4. Add gateway_stop handler to close DB cleanly on shutdown. 5. Fix half-initialized stuck state: if DB opens but engine fails in the else-if branch, properly set initError and reject the promise instead of silently swallowing. 6. Export __resetSharedInitForTests() for test isolation. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
Pushed a follow-up commit (151612e) addressing all three Copilot review findings plus additional issues from adversarial audit: Review comments addressed:
Additional fixes from adversarial audit:
Verified locally: gateway restarts cleanly, zero "database is locked" errors, zero "waiting for gateway_start" errors, single process. |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 1 out of 1 changed files in this pull request and generated 2 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Addresses second round of Copilot review: 1. Use closeLcmConnection(db) instead of db.close() in the eager-init failure path to keep the connection tracking maps consistent. 2. Change createLcmCommand to accept db as DatabaseSync | (() => DatabaseSync) so the deferred getter can be passed without a type assertion cast. Backward-compatible: existing callers passing a plain DatabaseSync still work via the typeof check. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Major simplification addressing test failures and review concerns: The previous approach (defer everything to gateway_start, share DB at module scope) broke tests that never fire gateway_start and introduced complexity around shared state, promise lifecycle, and config staleness. New approach: try eager DB init immediately in register() (preserving original behavior for tests and normal startup). Only defer to gateway_start if the eager open fails with "database is locked" — the specific error from the macOS launchd orphan-process race. This eliminates: - Module-level shared state (no more sharedDb, no test pollution) - Promise lifecycle complexity (no unhandled rejection risk in normal path) - Config staleness (engine built with fresh deps every register()) - The need for __resetSharedInitForTests() Each register() call gets its own DB handle and engine, matching the original code's behavior. The only difference: lock errors are caught and retried via gateway_start instead of looping forever. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
v4 (final) — Eager-first, defer on lock onlyPushed commit 6d8523d which significantly simplifies the approach after 3 rounds of review + adversarial audit. What changed from v2/v3 → v4Removed: All module-level shared state ( Kept: The core fix. Why this is betterThe lock race is between processes (two gateways), not between multiple Test results
All prior Copilot review comments resolved
|
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
…fter-close - Move getDb() into status/doctor branches so /lossless help never resolves the database (review comment lcm-command.ts:733) - Close raw DatabaseSync handle when PRAGMA setup fails in createLcmDatabaseConnection to prevent FD leaks (review comment index.ts:1586) - Clear deferredEngine on gateway_stop and guard getEngine() against closed database to prevent use-after-close (review comment index.ts:1642) - Add tests covering the db: () => DatabaseSync lazy path: help must not invoke the resolver, status must (review comment lcm-command.ts:720)
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
getDatabase() now distinguishes "closed after gateway_stop" from "not yet initialized" with a stopped flag. getEngine() delegates to getDatabase() instead of duplicating the null check with its own misleading message.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
- Call getDatabase() before returning eagerly-constructed lcm so post-gateway_stop calls fail fast instead of returning an engine backed by a closed DB handle - Update rethrow comment to accurately describe error propagation (framework handles it, not the engine constructor)
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 4 out of 4 changed files in this pull request and generated no new comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
Clean and ready for review @jalehman 🖤 |
evaluateLeafTrigger now accepts precomputedTokenCount so callers that already fetched the context token count (compactLeaf, compactFullSweep) can pass it through instead of re-querying. On a 1.9GB SQLite database with 5+ concurrent agent sessions, every DB read acquires a shared lock and adds contention. The duplicate reads were dismissed as "~1ms" but on large databases under concurrent load, they contribute to the lock pressure that caused the gateway lockups fixed in PR Martian-Engineering#288. The afterTurn path (via engine wrapper) still does one read since it doesn't pre-fetch — this is the correct behavior for that path. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
@jalehman Ready for merge — CI green, all review comments resolved. |
|
Part of the LCM Performance & Cache Optimization Sprint — see #297 for the full tracking issue linking all 5 PRs. |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 4 out of 4 changed files in this pull request and generated no new comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
When eager DB open hits a lock during gateway restart, share one deferred initialization promise across context-engine resolution, tools, commands, and lifecycle hooks so the first request waits for gateway_start instead of failing. Persist deferred retry failures so later callers see the real error, and add a patch changeset for the user-visible startup fix. Regeneration-Prompt: | Follow up on PR 288's deferred SQLite startup path for lossless-claw. The lock-contention fallback must not move the failure from plugin load to the first request: context engine resolution, plugin tools, commands, and lifecycle hooks should all await the same deferred initialization when the initial open fails with "database is locked" during macOS launchd restarts. If the deferred retry also fails, retain and rethrow that real error instead of misleading callers with a perpetual "waiting for gateway_start" message. Keep the eager-success path intact, add focused regression coverage for deferred success and deferred failure, and include the missing patch changeset because this changes user-visible runtime behavior.
|
Thank you! |
## Problem OpenClaw v2026.4.5+ calls plugin register() per-agent-context (main, subagents, cron lanes) — not once at startup. Each call opens a new DB connection and runs migrations, causing "Migration failed: database is locked" storms on large databases. PR Martian-Engineering#288's deferred-init fix was merged but does not address this per-context re-registration. ## Solution ### Singleton DB + engine (critical fix) Uses globalThis + Symbol.for() singleton (same pattern as startup-banner-log.ts) keyed on normalized dbPath. When register() is called again with the same DB path, it skips init entirely and wires handlers to the existing waitForEngine/waitForDatabase closures via wirePluginHandlers(). gateway_stop clears the singleton so a fresh init occurs on restart. The shared state stores only the closures (not mutable copies of database/lcm locals), avoiding stale-reference bugs. ### Fallback provider config (additive) - Add fallbackProviders config field (env: LCM_FALLBACK_PROVIDERS, format: provider/model,provider/model) for explicit compaction summarization fallbacks - Append to existing 5-level candidate chain with dedup - Exponential backoff (500ms→8s) between candidate retries - PROVIDER FALLBACK / ALL PROVIDERS EXHAUSTED messages on stderr - Half-threshold early warning and CIRCUIT BREAKER OPEN/CLOSED messages with cooldown time - Startup banner for configured fallback providers
|
Hey @nicobailon — heads up, our last commit (singleton DB init + fallback providers) landed seconds after this was merged, so it didn't make it in. While investigating the production logs post-merge, we found a second issue: OpenClaw v2026.4.5 calls Production logs showed 478 re-registrations in a single gateway session with repeated migration lock failures. Follow-up PR: #302
|
## Problem OpenClaw v2026.4.5+ calls plugin register() per-agent-context (main, subagents, cron lanes) — not once at startup. Each call opens a new DB connection and runs migrations, causing "Migration failed: database is locked" storms on large databases. PR Martian-Engineering#288's deferred-init fix was merged but does not address this per-context re-registration. ## Solution ### Singleton DB + engine (critical fix) Uses globalThis + Symbol.for() singleton (same pattern as startup-banner-log.ts) keyed on normalized dbPath. When register() is called again with the same DB path, it skips init entirely and wires handlers to the existing waitForEngine/waitForDatabase closures via wirePluginHandlers(). gateway_stop clears the singleton so a fresh init occurs on restart. The shared state stores only the closures (not mutable copies of database/lcm locals), avoiding stale-reference bugs. ### Fallback provider config (additive) - Add fallbackProviders config field (env: LCM_FALLBACK_PROVIDERS, format: provider/model,provider/model) for explicit compaction summarization fallbacks - Append to existing 5-level candidate chain with dedup - Exponential backoff (500ms→8s) between candidate retries - PROVIDER FALLBACK / ALL PROVIDERS EXHAUSTED messages on stderr - Half-threshold early warning and CIRCUIT BREAKER OPEN/CLOSED messages with cooldown time - Startup banner for configured fallback providers
## Problem OpenClaw v2026.4.5+ calls plugin register() per-agent-context (main, subagents, cron lanes) — not once at startup. Each call opens a new DB connection and runs migrations, causing "Migration failed: database is locked" storms on large databases. PR Martian-Engineering#288's deferred-init fix was merged but does not address this per-context re-registration. ## Solution ### Singleton DB + engine (critical fix) Uses globalThis + Symbol.for() singleton (same pattern as startup-banner-log.ts) keyed on normalized dbPath. When register() is called again with the same DB path, it skips init entirely and wires handlers to the existing waitForEngine/waitForDatabase closures via wirePluginHandlers(). gateway_stop clears the singleton so a fresh init occurs on restart. The shared state stores only the closures (not mutable copies of database/lcm locals), avoiding stale-reference bugs. ### Fallback provider config (additive) - Add fallbackProviders config field (env: LCM_FALLBACK_PROVIDERS, format: provider/model,provider/model) for explicit compaction summarization fallbacks - Append to existing 5-level candidate chain with dedup - Exponential backoff (500ms→8s) between candidate retries - PROVIDER FALLBACK / ALL PROVIDERS EXHAUSTED messages on stderr - Half-threshold early warning and CIRCUIT BREAKER OPEN/CLOSED messages with cooldown time - Startup banner for configured fallback providers
* fix: singleton DB init per dbPath + fallback provider config ## Problem OpenClaw v2026.4.5+ calls plugin register() per-agent-context (main, subagents, cron lanes) — not once at startup. Each call opens a new DB connection and runs migrations, causing "Migration failed: database is locked" storms on large databases. PR #288's deferred-init fix was merged but does not address this per-context re-registration. ## Solution ### Singleton DB + engine (critical fix) Uses globalThis + Symbol.for() singleton (same pattern as startup-banner-log.ts) keyed on normalized dbPath. When register() is called again with the same DB path, it skips init entirely and wires handlers to the existing waitForEngine/waitForDatabase closures via wirePluginHandlers(). gateway_stop clears the singleton so a fresh init occurs on restart. The shared state stores only the closures (not mutable copies of database/lcm locals), avoiding stale-reference bugs. ### Fallback provider config (additive) - Add fallbackProviders config field (env: LCM_FALLBACK_PROVIDERS, format: provider/model,provider/model) for explicit compaction summarization fallbacks - Append to existing 5-level candidate chain with dedup - Exponential backoff (500ms→8s) between candidate retries - PROVIDER FALLBACK / ALL PROVIDERS EXHAUSTED messages on stderr - Half-threshold early warning and CIRCUIT BREAKER OPEN/CLOSED messages with cooldown time - Startup banner for configured fallback providers * fix: handle terminal summarizer exhaustion fallback Route terminal non-auth provider failures through the shared exhaustion handler so deterministic truncation actually runs, add regression coverage, and include a changeset for the runtime behavior fix. Regeneration-Prompt: | Address the PR review finding in the multi-provider summarizer fallback path. The existing code added an ALL PROVIDERS EXHAUSTED log after the candidate loop, but the loop always returned, continued, or threw before that block could execute. Preserve existing auth-failure behavior because LcmProviderAuthError is used intentionally by compaction and the circuit breaker, but make terminal non-auth failures fall through to one shared exhaustion path that logs clearly and returns buildDeterministicFallbackSummary instead of an empty string. Add a focused regression test that exhausts all resolved non-auth candidates and proves both the terminal log and deterministic fallback behavior. Add a patch changeset because this changes runtime behavior and logging for plugin summarization fallback. --------- Co-authored-by: Eva <eva@100yen.org> Co-authored-by: Josh Lehman <josh@martian.engineering>
Summary
Prevents "database is locked" errors during macOS launchd-managed gateway restarts by catching SQLite lock errors during plugin
register()and deferring the DB open to thegateway_starthook.Fixes #287
Problem
On macOS with launchd
KeepAlive: true+ThrottleInterval: 1, gateway restarts can spawn two processes simultaneously. Both callregister()and immediately openlcm.db, but only one can acquire the write lock. The other loops "Migration failed: database is locked" indefinitely.The gateway's stale-PID cleanup runs at port-bind time, which happens after plugin
register(). So by the time the orphan is killed, LCM has already failed.Solution: Eager-first, defer on lock
Try to open the DB eagerly in
register()— preserving the original behavior for tests and normal startup. Only when the open fails with "database is locked" does it defer togateway_start(which fires after port bind + stale PID cleanup).This is a strict superset of the original behavior — identical when there's no lock contention.
Changes by file
src/plugin/index.ts(+87/-8)createLcmDatabaseConnection()in try/catch; only "database is locked" errors trigger deferral togateway_startgateway_stophandler: closes the DB connection viacloseLcmConnection(), nullsdatabaseanddeferredEngine, setsstoppedflaggetDatabase(): state-aware guard — distinguishes "not yet initialized" (deferred path) from "closed after gateway_stop" for actionable error messagesgetEngine(): validates DB is still open viagetDatabase()before returning any engine (eager or deferred), preventing use-after-closebefore_reset,session_end): awaitdeferredReadybefore accessing enginegetEngine()instead of capturinglcmdirectly() => getDatabase()instead of the raw handlesrc/plugin/lcm-command.ts(+6/-5)createLcmCommandacceptsdb: DatabaseSync | (() => DatabaseSync)for lazy DB resolution (backward-compatible)getDb()called only in branches that need it (status,doctor) —/lossless helpnever resolves the DBsrc/db/connection.ts(+7/-1)createLcmDatabaseConnection: separatesnew DatabaseSync()fromconfigureConnection()(PRAGMAs); if PRAGMA setup fails, the raw handle is closed before rethrowing to prevent FD leakstest/lcm-command.test.ts(+27)helpdoes not invoke the DB resolver; verifiesstatusdoes invoke itReview history
All review comments (13 threads across 4 Copilot reviews) have been addressed and resolved:
getDb()resolved before subcommand parsing; help fails unnecessarilystatus/doctorbranches onlydb: () => DatabaseSyncpathcreateLcmDatabaseConnectionleaks raw handle on PRAGMA failuregateway_stopdoesn't clear engine references → use-after-closedeferredEngine; guardgetEngine()viagetDatabase()stoppedflag;getDatabase()returns state-aware messagesgetEngine()returns eagerlcmwithout checking stopped stategetDatabase()called before returninglcmTest plan
openclaw gateway restartcompletes, no lock errorsmainbranch (not introduced by this PR)