fix(pglite): fail closed on stale live lock heartbeat (#2058)#2154
Open
venturejakef wants to merge 1 commit into
Open
fix(pglite): fail closed on stale live lock heartbeat (#2058)#2154venturejakef wants to merge 1 commit into
venturejakef wants to merge 1 commit into
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #2058
Summary
v0.42.41.0fixed the original age-only PGLite lock steal by adding a heartbeat and ownership token. One unsafe edge remains: a same-host process whose JS heartbeat is stale can still have the.gbrain-lockdirectory force-removed while its PID is alive.For embedded PGLite, a stale JS timer is not proof the WASM Postgres process has closed its files. A paused/starved holder can still have the data directory open, so stealing from a live PID can recreate the single-writer/WAL corruption class this lock exists to prevent.
The fix
This keeps the current heartbeat and owner-token design, but changes stale-heartbeat handling to fail closed:
alive | dead | unknowninstead of collapsing allprocess.kill(pid, 0)errors to dead.ESRCHis the only automatic dead-process reclaim path.EPERM, unknown probe errors, and live PIDs are treated as unsafe to steal.Scope decisions
src/core/pglite-lock.tsandtest/pglite-lock.test.ts.#2058implementation. This is a narrow follow-up on top of it.GBRAIN_PGLITE_LOCK_STEAL_GRACE_SECONDSremains a stale-heartbeat diagnostic threshold, but no longer authorizes stealing from a live PID.Test Coverage
test/pglite-lock.test.tsnow pins the remaining unsafe case:GBRAIN_PGLITE_LOCK_STEAL_GRACE_SECONDSdoes not make a live stale-heartbeat holder stealableVerification Results
Targeted and type checks:
bun run verifycaveat: in this Codex shell, the parent verify wrapper is SIGTERM'd at ~30s, which marks every parallel child as143even though no individual check failed. To avoid reporting a false failure as a repo failure, I ran every check listed byscripts/run-verify-parallel.sh --dry-listin smaller batches. All 30 constituent checks passed, including privacy, JSONB, source-id projection, test-isolation, WASM, admin build, resolver, doc-history, worker-lock-renewal-shape, and typecheck.Test plan
bun test test/pglite-lock.test.tsbun run typecheckbun run verifyconstituent checks, run in batches because the wrapper is killed by this shell at ~30s