Skip to content

fix(pglite): fail closed on stale live lock heartbeat (#2058)#2154

Open
venturejakef wants to merge 1 commit into
garrytan:masterfrom
venturejakef:codex/pglite-lock-live-pid-fail-closed
Open

fix(pglite): fail closed on stale live lock heartbeat (#2058)#2154
venturejakef wants to merge 1 commit into
garrytan:masterfrom
venturejakef:codex/pglite-lock-live-pid-fail-closed

Conversation

@venturejakef

Copy link
Copy Markdown

Closes #2058

Summary

v0.42.41.0 fixed the original age-only PGLite lock steal by adding a heartbeat and ownership token. One unsafe edge remains: a same-host process whose JS heartbeat is stale can still have the .gbrain-lock directory force-removed while its PID is alive.

For embedded PGLite, a stale JS timer is not proof the WASM Postgres process has closed its files. A paused/starved holder can still have the data directory open, so stealing from a live PID can recreate the single-writer/WAL corruption class this lock exists to prevent.

The fix

This keeps the current heartbeat and owner-token design, but changes stale-heartbeat handling to fail closed:

  • PID probe now returns alive | dead | unknown instead of collapsing all process.kill(pid, 0) errors to dead.
  • ESRCH is the only automatic dead-process reclaim path.
  • EPERM, unknown probe errors, and live PIDs are treated as unsafe to steal.
  • A stale heartbeat on a live/unprobeable PID now waits and eventually times out with the existing holder diagnostic instead of deleting the lock.
  • Dead PID cleanup and owner-checked heartbeat/release behavior remain intact.

Scope decisions

  • Only touches src/core/pglite-lock.ts and test/pglite-lock.test.ts.
  • No release files, no CHANGELOG, no docs/llms churn, no CLI surface.
  • No replacement of the merged #2058 implementation. This is a narrow follow-up on top of it.
  • No new public command or config. GBRAIN_PGLITE_LOCK_STEAL_GRACE_SECONDS remains a stale-heartbeat diagnostic threshold, but no longer authorizes stealing from a live PID.

Test Coverage

test/pglite-lock.test.ts now pins the remaining unsafe case:

  • live PID + fresh heartbeat is not stolen
  • live PID + stale heartbeat is not stolen
  • lowering GBRAIN_PGLITE_LOCK_STEAL_GRACE_SECONDS does not make a live stale-heartbeat holder stealable
  • unknown PID probe result is not stolen
  • dead PID is still cleaned up by the existing stale-dead-process test
  • owner-checked release still does not remove a replacement owner lock

Verification Results

Targeted and type checks:

bun test test/pglite-lock.test.ts
# 13 pass / 0 fail

bun run typecheck
# tsc --noEmit

bun run verify caveat: in this Codex shell, the parent verify wrapper is SIGTERM'd at ~30s, which marks every parallel child as 143 even though no individual check failed. To avoid reporting a false failure as a repo failure, I ran every check listed by scripts/run-verify-parallel.sh --dry-list in smaller batches. All 30 constituent checks passed, including privacy, JSONB, source-id projection, test-isolation, WASM, admin build, resolver, doc-history, worker-lock-renewal-shape, and typecheck.

Test plan

  • bun test test/pglite-lock.test.ts
  • bun run typecheck
  • All 30 bun run verify constituent checks, run in batches because the wrapper is killed by this shell at ~30s

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

1 participant