Skip to content

pglite-lock force-removes locks of LIVE processes after 5 minutes (no heartbeat) — concurrent serve processes share the single-writer data dir; WAL corruption observed #2058

@na-dev-12

Description

@na-dev-12

Found during a deep audit of gbrain v0.42.26.0 after a real PGLite WAL corruption event.

Mechanics (src/core/pglite-lock.ts)

  • STALE_THRESHOLD_MS = 5 min (line ~22); the acquire loop force-removes a lock whose holder PID is alive once the lock is older than 5 minutes ("Still alive but probably stuck — force remove", lines ~82-85).
  • acquired_at is written once at acquisition (lines ~102-106) and never refreshed — there is no heartbeat, so every healthy long-lived holder is classified stale after 5 minutes.
  • releaseLock (lines ~140-148) rmSyncs the shared lock dir without checking ownership, so a finishing short-lived CLI process deletes whichever holder's marker is present.

Consequence

gbrain serve connects the engine at startup and holds it for the process lifetime (cli.ts "serve doesn't disconnect"). With per-client stdio MCP registration (Claude Desktop + every Claude Code/Codex session spawns its own serve), serve #2 starting >5 min after serve #1 steals the lock, and so on. Observed live: 4 concurrent gbrain serve processes each holding 40+ read-write FDs inside the same brain.pglite data dir (two sharing the same WAL segment), with no .gbrain-lock dir present at all — exactly the concurrent access the module's own header says PGLite cannot survive (Aborted()). We experienced a WAL/checkpoint corruption that required a full rename-aside + reimport, plus recurring "PGLite lock contention persisted after 3 attempts" failures in scheduled sync runs.

Suggested fix

Mirror the design already present in src/core/db-lock.ts (last_refreshed_at, lines ~154-160):

  1. Heartbeat: refresh a timestamp in the lock file every ~60s (unref'd interval) while the engine is connected.
  2. Steal only when the holder PID is dead OR the heartbeat (not acquired_at) is stale.
  3. releaseLock should verify the lock file's pid === process.pid before rmSync.

Note the behavior change: a second stdio serve then fails fast with the actionable 30s timeout instead of silently sharing the data dir — correct for a single-writer database, and it makes the documented serve --http single-daemon mode the explicit multi-client answer.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions