pglite-lock force-removes locks of LIVE processes after 5 minutes (no heartbeat) — concurrent serve processes share the single-writer data dir; WAL corruption observed

Found during a deep audit of gbrain v0.42.26.0 after a real PGLite WAL corruption event.

## Mechanics (src/core/pglite-lock.ts)
- `STALE_THRESHOLD_MS = 5 min` (line ~22); the acquire loop force-removes a lock whose holder PID is **alive** once the lock is older than 5 minutes ("Still alive but probably stuck — force remove", lines ~82-85).
- `acquired_at` is written once at acquisition (lines ~102-106) and never refreshed — there is no heartbeat, so **every** healthy long-lived holder is classified stale after 5 minutes.
- `releaseLock` (lines ~140-148) `rmSync`s the shared lock dir without checking ownership, so a finishing short-lived CLI process deletes whichever holder's marker is present.

## Consequence
`gbrain serve` connects the engine at startup and holds it for the process lifetime (cli.ts "serve doesn't disconnect"). With per-client stdio MCP registration (Claude Desktop + every Claude Code/Codex session spawns its own serve), serve #2 starting >5 min after serve #1 steals the lock, and so on. Observed live: **4 concurrent `gbrain serve` processes each holding 40+ read-write FDs inside the same `brain.pglite` data dir (two sharing the same WAL segment), with no `.gbrain-lock` dir present at all** — exactly the concurrent access the module's own header says PGLite cannot survive (`Aborted()`). We experienced a WAL/checkpoint corruption that required a full rename-aside + reimport, plus recurring "PGLite lock contention persisted after 3 attempts" failures in scheduled sync runs.

## Suggested fix
Mirror the design already present in `src/core/db-lock.ts` (`last_refreshed_at`, lines ~154-160):
1. Heartbeat: refresh a timestamp in the lock file every ~60s (unref'd interval) while the engine is connected.
2. Steal only when the holder PID is dead OR the heartbeat (not `acquired_at`) is stale.
3. `releaseLock` should verify the lock file's pid === process.pid before `rmSync`.

Note the behavior change: a second stdio serve then fails fast with the actionable 30s timeout instead of silently sharing the data dir — correct for a single-writer database, and it makes the documented `serve --http` single-daemon mode the explicit multi-client answer.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pglite-lock force-removes locks of LIVE processes after 5 minutes (no heartbeat) — concurrent serve processes share the single-writer data dir; WAL corruption observed #2058

Mechanics (src/core/pglite-lock.ts)

Consequence

Suggested fix

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

pglite-lock force-removes locks of LIVE processes after 5 minutes (no heartbeat) — concurrent serve processes share the single-writer data dir; WAL corruption observed #2058

Description

Mechanics (src/core/pglite-lock.ts)

Consequence

Suggested fix

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions