Found during a deep audit of gbrain v0.42.26.0 after a real PGLite WAL corruption event.
Mechanics (src/core/pglite-lock.ts)
STALE_THRESHOLD_MS = 5 min (line ~22); the acquire loop force-removes a lock whose holder PID is alive once the lock is older than 5 minutes ("Still alive but probably stuck — force remove", lines ~82-85).
acquired_at is written once at acquisition (lines ~102-106) and never refreshed — there is no heartbeat, so every healthy long-lived holder is classified stale after 5 minutes.
releaseLock (lines ~140-148) rmSyncs the shared lock dir without checking ownership, so a finishing short-lived CLI process deletes whichever holder's marker is present.
Consequence
gbrain serve connects the engine at startup and holds it for the process lifetime (cli.ts "serve doesn't disconnect"). With per-client stdio MCP registration (Claude Desktop + every Claude Code/Codex session spawns its own serve), serve #2 starting >5 min after serve #1 steals the lock, and so on. Observed live: 4 concurrent gbrain serve processes each holding 40+ read-write FDs inside the same brain.pglite data dir (two sharing the same WAL segment), with no .gbrain-lock dir present at all — exactly the concurrent access the module's own header says PGLite cannot survive (Aborted()). We experienced a WAL/checkpoint corruption that required a full rename-aside + reimport, plus recurring "PGLite lock contention persisted after 3 attempts" failures in scheduled sync runs.
Suggested fix
Mirror the design already present in src/core/db-lock.ts (last_refreshed_at, lines ~154-160):
- Heartbeat: refresh a timestamp in the lock file every ~60s (unref'd interval) while the engine is connected.
- Steal only when the holder PID is dead OR the heartbeat (not
acquired_at) is stale.
releaseLock should verify the lock file's pid === process.pid before rmSync.
Note the behavior change: a second stdio serve then fails fast with the actionable 30s timeout instead of silently sharing the data dir — correct for a single-writer database, and it makes the documented serve --http single-daemon mode the explicit multi-client answer.
Found during a deep audit of gbrain v0.42.26.0 after a real PGLite WAL corruption event.
Mechanics (src/core/pglite-lock.ts)
STALE_THRESHOLD_MS = 5 min(line ~22); the acquire loop force-removes a lock whose holder PID is alive once the lock is older than 5 minutes ("Still alive but probably stuck — force remove", lines ~82-85).acquired_atis written once at acquisition (lines ~102-106) and never refreshed — there is no heartbeat, so every healthy long-lived holder is classified stale after 5 minutes.releaseLock(lines ~140-148)rmSyncs the shared lock dir without checking ownership, so a finishing short-lived CLI process deletes whichever holder's marker is present.Consequence
gbrain serveconnects the engine at startup and holds it for the process lifetime (cli.ts "serve doesn't disconnect"). With per-client stdio MCP registration (Claude Desktop + every Claude Code/Codex session spawns its own serve), serve #2 starting >5 min after serve #1 steals the lock, and so on. Observed live: 4 concurrentgbrain serveprocesses each holding 40+ read-write FDs inside the samebrain.pglitedata dir (two sharing the same WAL segment), with no.gbrain-lockdir present at all — exactly the concurrent access the module's own header says PGLite cannot survive (Aborted()). We experienced a WAL/checkpoint corruption that required a full rename-aside + reimport, plus recurring "PGLite lock contention persisted after 3 attempts" failures in scheduled sync runs.Suggested fix
Mirror the design already present in
src/core/db-lock.ts(last_refreshed_at, lines ~154-160):acquired_at) is stale.releaseLockshould verify the lock file's pid === process.pid beforermSync.Note the behavior change: a second stdio serve then fails fast with the actionable 30s timeout instead of silently sharing the data dir — correct for a single-writer database, and it makes the documented
serve --httpsingle-daemon mode the explicit multi-client answer.