v0.41.8.0 fix(pglite): search/query/get exit cleanly + #1340 hint + #1342 breadcrumbs#1405
Merged
Conversation
…sconnect Closes the structural bug class behind #1247, #1269, #1290: PGLite CLI search/query/get_page commands printed results then hung at ~95-98% CPU until SIGKILL. Root cause: bumpLastRetrievedAt's IIFE races engine.disconnect() — PGLite's WASM runtime keeps Bun's event loop alive while the dangling UPDATE settles. Mirrors the existing awaitPendingSearchCacheWrites precedent landed in v0.36.1.x for #1090. Tracks every IIFE promise in a module-scoped Set, exposes awaitPendingLastRetrievedWrites(timeoutMs) that resolves once all settle. Bounded with a 5s default timeout via Promise.race so a future fire-and-forget that hangs forever can't recreate the bug class at this layer — instead, the drain stderr-warns with a pending count and returns timeout outcome so the caller can decide its fallback. Test coverage: 6 unit cases covering empty drain, single + multi-pending settle, throw-in-IIFE still settles, permanently-pending hits timeout within bound, empty pageIds does not track. This commit ships the helper + tracking + tests with NO consumer. The cli.ts wiring lands in a follow-up commit (atomic bisect units). Co-Authored-By: Park Je Hoon <jehoon@users.noreply.github.com> Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…uard
Refactor PGLiteEngine.disconnect() with two structural fixes:
(1) Snapshot + early-null pattern: capture db/lock refs and null the
instance fields BEFORE any await. A concurrent connect() can no
longer observe `_db` pointing at a handle that's mid-close. This
is PR #1337's load-bearing contribution that we DID take.
(2) Wrap close + release in try/finally. Without this guard, a thrown
db.close() would leak the file lock and wedge every next gbrain
invocation on the stale lock. Codex outside-voice review (eng
review finding #7) caught this gap when reviewing the snapshot
refactor.
KEEP the original close-then-release order. PR #1337's diff swapped
this to release-then-close, which we explicitly REJECTED — releasing
the lock before close lets a sibling process try to connect to a
still-closing brain. The new lifecycle test file pins this ordering
so a future maintainer reading PR #1337's diff cannot accidentally
flip it.
Test coverage in test/pglite-engine-disconnect.serial.test.ts: 5
cases — close-before-release ordering, early-null observable inside
close, lock-still-releases on close-throw, double-disconnect
idempotency, reconnect-after-disconnect clean state. `.serial`
because each test creates a fresh PGLite engine (WASM cold-start
cost) — running in parallel shards would starve other tests.
Existing test/pglite-engine.test.ts: 100/100 still green.
Co-Authored-By: Matt Dean <matt-dean-git@users.noreply.github.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…1340) Closes the user-facing half of #1340: on macOS 12.7.6 + Bun 1.3.14, the PGLite connect() catch block hardcoded the macOS 26.3 hint (#223). The actual root cause for #1340 is Bun's vfs: `/$$bunfs/root` is read-only on older macOS, so PGLite cannot extract its pglite.data WASM payload. Adds two exported helpers in pglite-engine.ts: classifyPgliteInitError(message): 'bunfs' | 'macos-26-3' | 'unknown' buildPgliteInitErrorMessage(verdict, original): string Connect catch block now routes the hint by verdict. The bunfs hint names `bun upgrade` + Node fallback. The macOS 26.3 hint keeps the existing #223 link. Unknown falls through to a generic doctor + #223 fallback. Per Codex eng-review finding #9, the bunfs regex is tightened to match either the literal `$$bunfs` marker OR ENOENT+pglite.data co-occurrence — NOT generic `pglite.data` substring (would fire on unrelated errors). Negative test pinned. Root fix is upstream Bun; this PR just stops misclassifying the failure class so support traffic doesn't conflate two unrelated bugs. Test coverage: 12 pure-function unit cases including the #1340 reporter's exact error string round-trip, the negative case Codex caught, and all three verdicts × all three message contents. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…#1247, #1269, #1290) Wires the v0.40.10.0 drain helper into cli.ts and adds the IRON-RULE behavioral regression test for the search-hang class. The drain is called unconditionally for every op (not per-op-name gated — that was the original PR #1259 mistake that left search and get_page exposed). The narrow force-exit synthesis (decision D7 from the eng review, informed by Codex outside-voice findings #1+#2+#8): when the drain returns outcome:'timeout', AFTER engine.disconnect() resolves AND the command is NOT a daemon, fire process.exit(0). The drain helper already stderr-warned with the pending count, so the diagnostic signal is preserved. Without this guard, a hung underlying promise could still keep Bun's event loop alive past disconnect. CRITICALLY narrower than PR #1337's blanket force-exit: the timeout path is the only trigger. In the common case (drain settles cleanly under 5s), no force-exit fires and the behavioral subprocess test still catches future regressions. The shouldForceExitAfterMain guard excludes 'serve' so the stdio + HTTP daemons stay alive past main(). e2e/pglite-cli-exit.serial.test.ts (NEW, IRON RULE): - gbrain search "foxtrot" → exits 0 within 15s - gbrain get alpha → exits 0 within 15s with foxtrot in stdout - gbrain query "foxtrot" --no-expand → exits within 15s (no-API-key graceful) - gbrain serve --http → stays alive 3+ seconds (daemon-survival regression guard) fix-wave-structural.test.ts: - import assertion for awaitPendingLastRetrievedWrites - last-retrieved.ts exports + Set tracking + Promise.race + timeout - BEHAVIORAL positioning assertion: drain `await` appears textually BEFORE engine.disconnect `await` in the op-dispatch local-engine path. Survives variable-rename refactors; catches any new disconnect path that bypasses the drain. - shouldForceExitAfterMain excludes 'serve' AND the gate is conditioned on drainResult.outcome==='timeout' Per D8 (Codex finding #5), explicitly do NOT add a drift-guard counting bumpLastRetrievedAt callers — would block harmless refactors and miss aliases. Co-Authored-By: Park Je Hoon <jehoon@users.noreply.github.com> Co-Authored-By: Matt Dean <matt-dean-git@users.noreply.github.com> Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The #1342 reporter saw ZERO stderr output before their PGLite sync hang, which made the bug impossible to triage from a community report alone. Mirrors the pre-existing `[gbrain phase] sync.git_pull start/done` pattern at the major pre-pull phase boundaries so the next #1342-shaped report names WHICH phase spun. Four new breadcrumbs at: - sync.resolve_repo (top of performSyncInner) - sync.load_active_pack (before the v0.39 T1.5 pack load) - sync.validate_repo_state (only when opts.sourceId is set — the re-clone branch) - sync.detect_head (before the isDetachedHead probe) No behavior change — pure stderr instrumentation. Doesn't fix #1342 (which still needs investigation per the TODOS entry filed in this wave), but converts "hung with no output" into actionable diagnostic data the next time the bug shape is reported. Per D9 in the eng review + Codex outside-voice finding #14. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Key Files entries updated: - src/core/pglite-engine.ts: documents the v0.40.10.0 disconnect refactor (snapshot+early-null + try/finally lock-leak guard, KEEPS close-then-release order), and the new classifyPgliteInitError / buildPgliteInitErrorMessage helpers for #1340 hint routing. Pins PR #1337's accepted-but-narrowed contribution and the rejected release-then-close ordering swap. - src/core/last-retrieved.ts (within the brainstorm entry): documents the new awaitPendingLastRetrievedWrites drain, the Set tracking pattern, the 5s bounded timeout, the cli.ts narrow timeout-only force-exit synthesis with the serve-daemon guard, and the three community-validated reports (#1247/#1269/#1290) the fix closes. Credits PR #1259 (drain pattern) and PR #1337 (snapshot pattern + force-exit guard idea). Regenerated llms.txt + llms-full.txt — build-llms.test.ts gates the drift, all 7 cases green. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three deferred items from the v0.40.10.0 fix wave: 1. #1342 sync-hang investigation. Single-reporter, JS-tight-loop shape, needs reproducer before any fix. Documents the ruled-out hypotheses (lock-refresh heartbeat, v91 trigger, while-true loops) and three concrete diagnostic next steps. The v0.40.10.0 sync phase breadcrumbs make the next report actionable. 2. awaitPendingSearchCacheWrites timeout-symmetry retrofit. The #1090 drain shipped without a timeout; the v0.40.10.0 #1247 drain ships with one. Apply the same Promise.race + stderr warn pattern for symmetry. 3. Drain-helper extraction. Per D4 in the eng review: two surfaces is the threshold for noticing, three for extracting. Pair with the symmetry retrofit above as one focused refactor when a third fire-and-forget surface appears. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…1342 breadcrumbs Closes #1247, #1269, #1290 (PGLite CLI search/query/get hang at ~95-98% CPU after printing results — three community-validated reports). Also fixes #1340 (WASM init misroutes to macOS 26.3 hint when real cause is Bun vfs read-only mount) and adds diagnostic phase breadcrumbs for the single-reporter #1342 sync-hang investigation. Core fix: track every fire-and-forget bumpLastRetrievedAt IIFE in a module-scoped Set; cli.ts awaits the drain before engine.disconnect() in the op-dispatch finally block; narrow process.exit(0) fires ONLY when the drain times out AND the command isn't a daemon. Snapshot+ early-null disconnect pattern + try/finally lock-leak guard close the partial-state race PR #1337 originally surfaced. Co-Authored-By: Park Je Hoon <jehoon@users.noreply.github.com> Co-Authored-By: Matt Dean <matt-dean-git@users.noreply.github.com> Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ases Gap-audit follow-up: cli.ts is a script entrypoint (top-level main() side effect), so importing it from a test fires the help output as a side effect. Move shouldForceExitAfterMain into src/core/cli-force-exit.ts so it can be unit-tested in isolation without the cli.ts script tail running. Adds test/cli-should-force-exit.test.ts (9 cases): bare serve, serve with flags after, global flags BEFORE the command (the load-bearing case for `gbrain --quiet serve`), op commands return true, non-daemon CLI commands return true, empty argv defaults to true, flag-only argv, default-arg fallback to process.argv.slice(2), substring-match avoidance (`serves` is NOT `serve` — strict equality via Set, not startsWith/includes). The daemon command set is now an explicit ReadonlySet — future daemons (a hypothetical `gbrain watch` or `gbrain daemon`) just add their name to DAEMON_COMMANDS rather than chaining ||. Updates fix-wave-structural.test.ts to look for the import + the new DAEMON_COMMANDS shape. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…0.41.0.0+ landed) origin/master moved from v0.40.8.1 → v0.41.0.0 while this wave was in flight (PR #1367 minions cathedral). v0.41.1-v0.41.5 are claimed by other in-flight branches, so v0.41.6.0 is the next available slot. Bulk-renamed v0.40.10.0 → v0.41.6.0 across: - VERSION + package.json (trio audit clean: 0.41.6.0 / 0.41.6.0 / 0.41.6.0) - CHANGELOG.md (header + 3 prose references) - CLAUDE.md (Key Files annotations) - TODOS.md (follow-up entry header) - src/cli.ts + src/core/cli-force-exit.ts + src/core/last-retrieved.ts + src/core/pglite-engine.ts + src/commands/sync.ts (inline comments) - test/* (describe blocks + test file headers) - llms-full.txt (regenerated via `bun run build:llms`) bun.lock unchanged (version-only bump, no dep churn) per Codex #12. Verify: 52/52 wave tests pass after rename, typecheck clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…tart flake) The full-suite run during the v0.41.6.0 fix wave ship hit a 30s timeout in test/seed-pglite.test.ts under heavy 4-shard parallel contention (4972/4973 passed before SIGKILL). The test passes 11/11 in isolation. Root cause: each test instantiates a fresh PGLiteEngine (5 instances across the file, one per test) because each case writes to a different mkdtemp-ed dbPath. Under parallel shard load, multiple shards each cold-starting PGLite WASM simultaneously stretches the per-instance init from ~5s to 30s+. The shared-engine pattern (canonical PGLite block in CLAUDE.md R3+R4) doesn't apply here — different dbPaths require different engines. Fix per CLAUDE.md test-isolation quarantine rules: rename to `.serial.test.ts` so the file runs in the post-parallel serial pass with full WASM init capacity. Same pattern as test/pglite-engine-disconnect.serial.test.ts (added in this wave) and test/brain-registry.serial.test.ts (pre-existing). Removes test/seed-pglite.test.ts from check-test-isolation.allowlist since the .serial.test.ts rename auto-exempts it from the R3+R4 lint (scan skips *.serial.test.ts). 641 non-serial unit files scanned, lint clean. Verify: - bun test test/seed-pglite.serial.test.ts → 11/11 pass in 4.19s - scripts/check-test-isolation.sh → OK - bun run verify → all gates pass Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…atch drain, M1 type drift)
Adversarial review + maintainability specialist surfaced four real
issues in the v0.41.8.0 wave. All four fixed in this commit; one
deferred to TODOS.md as a v0.41+ follow-up (unusual caller pattern).
**C13 [load-bearing, defense-in-depth for the wave's stated goal]:**
`await engine.disconnect()` inside the op-dispatch finally can ITSELF
hang on PGLite (db.close() racing OS-level FS state). When that
happens, the entire wave's force-exit guard never runs — we recreate
the original hang at a new layer. Fix: install an unref'd setTimeout
hard-exit fallback BEFORE entering the try/catch/finally. The timer
fires after DISCONNECT_HARD_DEADLINE_MS=10s with a stderr warn and
process.exit(0). unref ensures it doesn't keep the loop alive on a
healthy exit. Daemons (`serve`) are excluded by reusing the
shouldForceExitAfterMain guard.
**C9 [data freshness gap, narrow but real]:**
The drain ran ONLY in the success branch of try. If
`bumpLastRetrievedAt` fired (handler succeeded) but
`JSON.parse(JSON.stringify(...))` or `formatResult` then threw,
process.exit(1) killed the process and the in-flight UPDATE was
discarded. Fix: drain in the catch path too before process.exit(1)
(best-effort, bounded by the drain's own 5s timeout).
**C1 [daemon leak]:**
A timed-out IIFE used to stay in the pending-writes Set forever
because its `.finally` never fires. Long-lived `gbrain serve` would
accumulate references without bound across repeated timeouts. Fix:
explicitly `delete` the snapshot's tracked promises from the Set
after a timeout outcome. The IIFEs keep running (orphaned), but the
Set no longer leaks references. Pinned by a new unit test that
asserts the second drain after a timeout returns immediately with
empty pending count.
**M1 [silent type drift]:**
`cli.ts` duplicated the `{outcome, pending}` literal shape instead of
importing the `DrainOutcome` type that `last-retrieved.ts` exports
exactly for this purpose. Two-line fix: add `type DrainOutcome` to
the import and use it for `let drainResult`. Future changes to the
return shape now propagate through TypeScript.
**Deferred to TODOS.md (C6 — unusual caller pattern):**
Concurrent connect/disconnect on the same `PGLiteEngine` instance can
strand: disconnect snapshots+nulls the lock while connect is still
in-flight, leaving the resolved engine with no file lock held. Fix
requires an instance-level mutex; not worth the complexity for a
caller pattern that doesn't appear in production (single instance per
process, sequential lifecycle).
Also broadened `test/fix-wave-structural.test.ts` regex to accept
additional type-imports from `last-retrieved.ts` (e.g. the new
`type DrainOutcome` import that M1 added).
Test coverage: 53/53 wave tests pass (added C1-followup case to
last-retrieved.test.ts). The C1 fix is also pinned by tightening the
existing permanent-pending test's post-timeout assertion to expect
empty pending count rather than the prior (stale) "stays in set" note.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Consolidate the duplicate 'take advantage of v0.41.8.0' sections in the CHANGELOG entry into a single canonical block per the CLAUDE.md template. The wave originally landed with both '### How to take advantage' (line 13) and '### To take advantage' (line 57) as h3 headings. CLAUDE.md mandates one '## To take advantage of v[version]' h2 block per release entry, with verify steps + an issue-filing fallback for users hitting upgrade failures. Promoted the second block to h2, added the issue-filing step, and removed the redundant first block (the upgrade command is already covered in the verify steps). Itemized changes section was unchanged. llms.txt + llms-full.txt regenerated; structurally identical so no content changes shipped.
Master added v0.41.6.0 (CI test speedup — matrix 4→6 + weight-aware sharding + auto SHA cache + parallel verify, 23min → ~9min). After the merge, llms-full.txt grew to 603041 bytes — over the 600KB FULL_SIZE_BUDGET, which broke the build-llms.test.ts size budget assertion on CI shard 1. Per the canonical fix recipe (`scripts/llms-config.ts:241`: "ship with includeInFull=false exclusions"), excluded `docs/guides/minions-deployment.md` from the single-fetch bundle. It's a 13KB deployment runbook that operators read once; agents rarely need it in context. Web index entry stays discoverable. Result: 589907 bytes, 10KB headroom for future growth. Verify gate clean (21/21 parallel checks); wave test suite green (53/53 across 6 files); 7/7 build-llms tests green. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…536 (CI shard 1) CI shard 1 failed on this branch with: \"expected 1280 dimensions, not 1536\" from pgvector's CheckExpectedDim. Root cause: master's v0.36.0 changed DEFAULT_EMBEDDING_DIMENSIONS from OpenAI's 1536d to ZeroEntropy's 1280d (src/core/ai/defaults.ts:21). The test's basisEmbedding helper hardcoded dim=1536, so beforeAll's upsertChunks failed when the schema column was created at 1280d. Latent on master: the weight-aware LPT bin-packing in scripts/sharding.ts assigns files to shards deterministically based on the COMPLETE file set. My branch adds 5 new test files, which shifted find-experts-op.test.ts into shard 1. Master's shard 1 doesn't run this file (it lands in a different shard there), so the bug never surfaced in master's CI. Fix: query the actual column dim via SELECT atttypmod FROM pg_attribute after initSchema, then seed the embedding at that width. This handles both paths (no-env CI → 1280; env-configured local → 1536) without hardcoding either default. Verify: - bun test test/find-experts-op.test.ts → 11/11 pass with provider env - env -i bun test test/find-experts-op.test.ts → 11/11 pass without - bun run verify → all 21 parallel checks clean - bun run typecheck → clean Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Master added v0.41.7.0 (compact list-format resolver + 300-skill scaling tutorial). VERSION/package.json/CHANGELOG conflicts resolved keeping v0.41.8.0 (this PR's claimed slot) + both CHANGELOG entries. llms-config.ts auto-merged cleanly — master's UPGRADING_DOWNSTREAM_AGENTS exclusion + this wave's minions-deployment exclusion both landed. Bundle now 578758 bytes (was 589907, ample headroom under 600KB). Verify: 21/21 parallel checks pass; typecheck clean; 62/62 wave tests across 6 files green (+1 from new scaling-skills test or similar pickup via master). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…CI verify) CI verify failed on PR #1405 with check:test-isolation flagging test/scripts/check-test-isolation.test.ts even though that file is on line 22 of the allowlist (and has been since v0.26.7 as a permanent exemption — its body contains process.env mutation fixtures that the lint legitimately matches). Could not reproduce locally on macOS bash 3.2 + BSD grep across any locale (C, C.UTF-8, POSIX). Suspect a subtle interaction between the prior `echo "$ALLOWLIST" | grep -qxF "$f"` form and one of: Ubuntu 24.04's bash 5 set-e/pipefail semantics, GNU grep edge case on the first-line entry, or `bun run` + GNU timeout subshell interaction. Diagnostic value of chasing further is low — the fix is to drop the grep+pipe form entirely. Switch is_allowlisted() to pure-bash `case $'\n'"$ALLOWLIST"$'\n' in *$'\n'"$f"$'\n'*) return 0 ;; esac` whole-line matching: - Locale-free (no character-class interaction) - Pipe-free (no pipefail / SIGPIPE / buffering) - Subshell-free (no env or exit-code propagation gotchas) - set-e-quirk-free (no left-side compound failure) - ~100x faster (no fork+exec per call across 689 files) Verified locally: lint OK (689 files), case-match returns true for the allowlisted file and false for a non-allowlisted file. bun run verify clean (21/21 parallel checks pass). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
garrytan
added a commit
that referenced
this pull request
May 25, 2026
Master advanced past v0.41.7.0: - v0.41.8.0: PGLite search/query/get exit cleanly + #1340 hint + #1342 breadcrumbs (#1405) The headline conflict was scripts/check-test-isolation.sh: master shipped the SAME fix I had pushed (different code, same bug), and master's is structurally better — pure-bash `case` whole-line match instead of the file-direct grep I used. Both eliminate the Ubuntu 24.04 + bash 5 + GNU grep flake. Master's wins because: - no pipe, no subshell, no grep - locale-free, set-e-quirk-free - ~100x faster per call Resolved by taking master's `is_allowlisted` body (the pure-bash case) and restoring the cached `ALLOWLIST=` setup it depends on. My v0.41.9.0 file-direct grep approach is superseded. VERSION + package.json + CHANGELOG conflicts resolved (v0.41.9.0 still holds; CHANGELOG interleaves master's v0.41.8.0 entry below ours). llms-full.txt regenerated: 580,462 bytes (~120KB headroom under the v0.41.9.0 700KB budget, after master's expanded includeInFull exclusions landed in v0.41.7.0). 3-line audit clean. Verify: typecheck clean, check-test-isolation OK (694 files), build-llms 7/7 pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
garrytan
added a commit
that referenced
this pull request
May 25, 2026
Brings in #1405 (v0.41.8.0 fix: PGLite search/query/get exit cleanly + #1340 hint + #1342 breadcrumbs). Standard trio conflicts resolved per CLAUDE.md procedure: - VERSION: ours wins (0.41.11.0). - package.json: ours wins (version line; rest of file auto-merged clean). - CHANGELOG.md: both entries kept; ours stays topmost. No code-file conflicts this time — CLAUDE.md, llms-full.txt, src/cli.ts auto-merged cleanly. Post-merge verification: - bun install (no changes) - typecheck clean - bun run verify PASS (21 checks, 18s parallel) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
mgunnin
added a commit
to mgunnin/gbrain
that referenced
this pull request
May 28, 2026
* upstream/master: v0.41.10.1 fix-wave: dream.* config + batch retry + extract_atoms idempotency + ze-switch env-gate (garrytan#1445) v0.41.10.0 feat: orphan reduction via --by-mention + UTF-16 surrogate-pair fix (garrytan#1442) v0.41.9.0 — UX/reliability fix wave (5 defects from production report) (garrytan#1440) v0.41.8.0 fix(pglite): search/query/get exit cleanly + garrytan#1340 hint + garrytan#1342 breadcrumbs (garrytan#1405) v0.41.7.0 feat: compact list-format resolver + 300-skill scaling tutorial (garrytan#1407) v0.41.6.0 feat(ci): CI test speedup — 23min → ~9min via matrix 4→6 + weight-aware sharding + auto SHA cache + parallel verify (garrytan#1444) v0.41.5.0 fix-wave: warm-narwhal — 6 community PRs + E2E reliability (garrytan#1374) # Conflicts: # src/core/ai/recipes/openai.ts
garrytan-agents
pushed a commit
to garrytan-agents/gbrain
that referenced
this pull request
Jun 13, 2026
…hint + garrytan#1342 breadcrumbs (garrytan#1405) * fix(pglite): drain fire-and-forget last_retrieved_at writes before disconnect Closes the structural bug class behind garrytan#1247, garrytan#1269, garrytan#1290: PGLite CLI search/query/get_page commands printed results then hung at ~95-98% CPU until SIGKILL. Root cause: bumpLastRetrievedAt's IIFE races engine.disconnect() — PGLite's WASM runtime keeps Bun's event loop alive while the dangling UPDATE settles. Mirrors the existing awaitPendingSearchCacheWrites precedent landed in v0.36.1.x for garrytan#1090. Tracks every IIFE promise in a module-scoped Set, exposes awaitPendingLastRetrievedWrites(timeoutMs) that resolves once all settle. Bounded with a 5s default timeout via Promise.race so a future fire-and-forget that hangs forever can't recreate the bug class at this layer — instead, the drain stderr-warns with a pending count and returns timeout outcome so the caller can decide its fallback. Test coverage: 6 unit cases covering empty drain, single + multi-pending settle, throw-in-IIFE still settles, permanently-pending hits timeout within bound, empty pageIds does not track. This commit ships the helper + tracking + tests with NO consumer. The cli.ts wiring lands in a follow-up commit (atomic bisect units). Co-Authored-By: Park Je Hoon <jehoon@users.noreply.github.com> Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(pglite): snapshot+early-null disconnect + try/finally lock-leak guard Refactor PGLiteEngine.disconnect() with two structural fixes: (1) Snapshot + early-null pattern: capture db/lock refs and null the instance fields BEFORE any await. A concurrent connect() can no longer observe `_db` pointing at a handle that's mid-close. This is PR garrytan#1337's load-bearing contribution that we DID take. (2) Wrap close + release in try/finally. Without this guard, a thrown db.close() would leak the file lock and wedge every next gbrain invocation on the stale lock. Codex outside-voice review (eng review finding garrytan#7) caught this gap when reviewing the snapshot refactor. KEEP the original close-then-release order. PR garrytan#1337's diff swapped this to release-then-close, which we explicitly REJECTED — releasing the lock before close lets a sibling process try to connect to a still-closing brain. The new lifecycle test file pins this ordering so a future maintainer reading PR garrytan#1337's diff cannot accidentally flip it. Test coverage in test/pglite-engine-disconnect.serial.test.ts: 5 cases — close-before-release ordering, early-null observable inside close, lock-still-releases on close-throw, double-disconnect idempotency, reconnect-after-disconnect clean state. `.serial` because each test creates a fresh PGLite engine (WASM cold-start cost) — running in parallel shards would starve other tests. Existing test/pglite-engine.test.ts: 100/100 still green. Co-Authored-By: Matt Dean <matt-dean-git@users.noreply.github.com> Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(pglite): classify WASM init errors so garrytan#1340 gets the right hint (garrytan#1340) Closes the user-facing half of garrytan#1340: on macOS 12.7.6 + Bun 1.3.14, the PGLite connect() catch block hardcoded the macOS 26.3 hint (garrytan#223). The actual root cause for garrytan#1340 is Bun's vfs: `/$$bunfs/root` is read-only on older macOS, so PGLite cannot extract its pglite.data WASM payload. Adds two exported helpers in pglite-engine.ts: classifyPgliteInitError(message): 'bunfs' | 'macos-26-3' | 'unknown' buildPgliteInitErrorMessage(verdict, original): string Connect catch block now routes the hint by verdict. The bunfs hint names `bun upgrade` + Node fallback. The macOS 26.3 hint keeps the existing garrytan#223 link. Unknown falls through to a generic doctor + garrytan#223 fallback. Per Codex eng-review finding garrytan#9, the bunfs regex is tightened to match either the literal `$$bunfs` marker OR ENOENT+pglite.data co-occurrence — NOT generic `pglite.data` substring (would fire on unrelated errors). Negative test pinned. Root fix is upstream Bun; this PR just stops misclassifying the failure class so support traffic doesn't conflate two unrelated bugs. Test coverage: 12 pure-function unit cases including the garrytan#1340 reporter's exact error string round-trip, the negative case Codex caught, and all three verdicts × all three message contents. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(cli): await last-retrieved drain + narrow timeout-only force-exit (garrytan#1247, garrytan#1269, garrytan#1290) Wires the v0.40.10.0 drain helper into cli.ts and adds the IRON-RULE behavioral regression test for the search-hang class. The drain is called unconditionally for every op (not per-op-name gated — that was the original PR garrytan#1259 mistake that left search and get_page exposed). The narrow force-exit synthesis (decision D7 from the eng review, informed by Codex outside-voice findings garrytan#1+garrytan#2+garrytan#8): when the drain returns outcome:'timeout', AFTER engine.disconnect() resolves AND the command is NOT a daemon, fire process.exit(0). The drain helper already stderr-warned with the pending count, so the diagnostic signal is preserved. Without this guard, a hung underlying promise could still keep Bun's event loop alive past disconnect. CRITICALLY narrower than PR garrytan#1337's blanket force-exit: the timeout path is the only trigger. In the common case (drain settles cleanly under 5s), no force-exit fires and the behavioral subprocess test still catches future regressions. The shouldForceExitAfterMain guard excludes 'serve' so the stdio + HTTP daemons stay alive past main(). e2e/pglite-cli-exit.serial.test.ts (NEW, IRON RULE): - gbrain search "foxtrot" → exits 0 within 15s - gbrain get alpha → exits 0 within 15s with foxtrot in stdout - gbrain query "foxtrot" --no-expand → exits within 15s (no-API-key graceful) - gbrain serve --http → stays alive 3+ seconds (daemon-survival regression guard) fix-wave-structural.test.ts: - import assertion for awaitPendingLastRetrievedWrites - last-retrieved.ts exports + Set tracking + Promise.race + timeout - BEHAVIORAL positioning assertion: drain `await` appears textually BEFORE engine.disconnect `await` in the op-dispatch local-engine path. Survives variable-rename refactors; catches any new disconnect path that bypasses the drain. - shouldForceExitAfterMain excludes 'serve' AND the gate is conditioned on drainResult.outcome==='timeout' Per D8 (Codex finding garrytan#5), explicitly do NOT add a drift-guard counting bumpLastRetrievedAt callers — would block harmless refactors and miss aliases. Co-Authored-By: Park Je Hoon <jehoon@users.noreply.github.com> Co-Authored-By: Matt Dean <matt-dean-git@users.noreply.github.com> Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(sync): add phase breadcrumbs to performSyncInner for garrytan#1342 triage The garrytan#1342 reporter saw ZERO stderr output before their PGLite sync hang, which made the bug impossible to triage from a community report alone. Mirrors the pre-existing `[gbrain phase] sync.git_pull start/done` pattern at the major pre-pull phase boundaries so the next garrytan#1342-shaped report names WHICH phase spun. Four new breadcrumbs at: - sync.resolve_repo (top of performSyncInner) - sync.load_active_pack (before the v0.39 T1.5 pack load) - sync.validate_repo_state (only when opts.sourceId is set — the re-clone branch) - sync.detect_head (before the isDetachedHead probe) No behavior change — pure stderr instrumentation. Doesn't fix garrytan#1342 (which still needs investigation per the TODOS entry filed in this wave), but converts "hung with no output" into actionable diagnostic data the next time the bug shape is reported. Per D9 in the eng review + Codex outside-voice finding garrytan#14. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs: annotate v0.40.10.0 PGLite hang wave in CLAUDE.md + regen llms Key Files entries updated: - src/core/pglite-engine.ts: documents the v0.40.10.0 disconnect refactor (snapshot+early-null + try/finally lock-leak guard, KEEPS close-then-release order), and the new classifyPgliteInitError / buildPgliteInitErrorMessage helpers for garrytan#1340 hint routing. Pins PR garrytan#1337's accepted-but-narrowed contribution and the rejected release-then-close ordering swap. - src/core/last-retrieved.ts (within the brainstorm entry): documents the new awaitPendingLastRetrievedWrites drain, the Set tracking pattern, the 5s bounded timeout, the cli.ts narrow timeout-only force-exit synthesis with the serve-daemon guard, and the three community-validated reports (garrytan#1247/garrytan#1269/garrytan#1290) the fix closes. Credits PR garrytan#1259 (drain pattern) and PR garrytan#1337 (snapshot pattern + force-exit guard idea). Regenerated llms.txt + llms-full.txt — build-llms.test.ts gates the drift, all 7 cases green. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs(todos): file v0.40.10.0 PGLite hang follow-ups Three deferred items from the v0.40.10.0 fix wave: 1. garrytan#1342 sync-hang investigation. Single-reporter, JS-tight-loop shape, needs reproducer before any fix. Documents the ruled-out hypotheses (lock-refresh heartbeat, v91 trigger, while-true loops) and three concrete diagnostic next steps. The v0.40.10.0 sync phase breadcrumbs make the next report actionable. 2. awaitPendingSearchCacheWrites timeout-symmetry retrofit. The garrytan#1090 drain shipped without a timeout; the v0.40.10.0 garrytan#1247 drain ships with one. Apply the same Promise.race + stderr warn pattern for symmetry. 3. Drain-helper extraction. Per D4 in the eng review: two surfaces is the threshold for noticing, three for extracting. Pair with the symmetry retrofit above as one focused refactor when a third fire-and-forget surface appears. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * v0.40.10.0 fix(pglite): search/query/get exit cleanly + garrytan#1340 hint + garrytan#1342 breadcrumbs Closes garrytan#1247, garrytan#1269, garrytan#1290 (PGLite CLI search/query/get hang at ~95-98% CPU after printing results — three community-validated reports). Also fixes garrytan#1340 (WASM init misroutes to macOS 26.3 hint when real cause is Bun vfs read-only mount) and adds diagnostic phase breadcrumbs for the single-reporter garrytan#1342 sync-hang investigation. Core fix: track every fire-and-forget bumpLastRetrievedAt IIFE in a module-scoped Set; cli.ts awaits the drain before engine.disconnect() in the op-dispatch finally block; narrow process.exit(0) fires ONLY when the drain times out AND the command isn't a daemon. Snapshot+ early-null disconnect pattern + try/finally lock-leak guard close the partial-state race PR garrytan#1337 originally surfaced. Co-Authored-By: Park Je Hoon <jehoon@users.noreply.github.com> Co-Authored-By: Matt Dean <matt-dean-git@users.noreply.github.com> Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test: extract shouldForceExitAfterMain to its own module + add unit cases Gap-audit follow-up: cli.ts is a script entrypoint (top-level main() side effect), so importing it from a test fires the help output as a side effect. Move shouldForceExitAfterMain into src/core/cli-force-exit.ts so it can be unit-tested in isolation without the cli.ts script tail running. Adds test/cli-should-force-exit.test.ts (9 cases): bare serve, serve with flags after, global flags BEFORE the command (the load-bearing case for `gbrain --quiet serve`), op commands return true, non-daemon CLI commands return true, empty argv defaults to true, flag-only argv, default-arg fallback to process.argv.slice(2), substring-match avoidance (`serves` is NOT `serve` — strict equality via Set, not startsWith/includes). The daemon command set is now an explicit ReadonlySet — future daemons (a hypothetical `gbrain watch` or `gbrain daemon`) just add their name to DAEMON_COMMANDS rather than chaining ||. Updates fix-wave-structural.test.ts to look for the import + the new DAEMON_COMMANDS shape. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore(version): rebase v0.40.10.0 → v0.41.6.0 (slot collision after v0.41.0.0+ landed) origin/master moved from v0.40.8.1 → v0.41.0.0 while this wave was in flight (PR garrytan#1367 minions cathedral). v0.41.1-v0.41.5 are claimed by other in-flight branches, so v0.41.6.0 is the next available slot. Bulk-renamed v0.40.10.0 → v0.41.6.0 across: - VERSION + package.json (trio audit clean: 0.41.6.0 / 0.41.6.0 / 0.41.6.0) - CHANGELOG.md (header + 3 prose references) - CLAUDE.md (Key Files annotations) - TODOS.md (follow-up entry header) - src/cli.ts + src/core/cli-force-exit.ts + src/core/last-retrieved.ts + src/core/pglite-engine.ts + src/commands/sync.ts (inline comments) - test/* (describe blocks + test file headers) - llms-full.txt (regenerated via `bun run build:llms`) bun.lock unchanged (version-only bump, no dep churn) per Codex garrytan#12. Verify: 52/52 wave tests pass after rename, typecheck clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test: quarantine seed-pglite to .serial.test.ts (parallel WASM cold-start flake) The full-suite run during the v0.41.6.0 fix wave ship hit a 30s timeout in test/seed-pglite.test.ts under heavy 4-shard parallel contention (4972/4973 passed before SIGKILL). The test passes 11/11 in isolation. Root cause: each test instantiates a fresh PGLiteEngine (5 instances across the file, one per test) because each case writes to a different mkdtemp-ed dbPath. Under parallel shard load, multiple shards each cold-starting PGLite WASM simultaneously stretches the per-instance init from ~5s to 30s+. The shared-engine pattern (canonical PGLite block in CLAUDE.md R3+R4) doesn't apply here — different dbPaths require different engines. Fix per CLAUDE.md test-isolation quarantine rules: rename to `.serial.test.ts` so the file runs in the post-parallel serial pass with full WASM init capacity. Same pattern as test/pglite-engine-disconnect.serial.test.ts (added in this wave) and test/brain-registry.serial.test.ts (pre-existing). Removes test/seed-pglite.test.ts from check-test-isolation.allowlist since the .serial.test.ts rename auto-exempts it from the R3+R4 lint (scan skips *.serial.test.ts). 641 non-serial unit files scanned, lint clean. Verify: - bun test test/seed-pglite.serial.test.ts → 11/11 pass in 4.19s - scripts/check-test-isolation.sh → OK - bun run verify → all gates pass Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix: pre-landing review fixes (C13 disconnect-hang, C1 set leak, C9 catch drain, M1 type drift) Adversarial review + maintainability specialist surfaced four real issues in the v0.41.8.0 wave. All four fixed in this commit; one deferred to TODOS.md as a v0.41+ follow-up (unusual caller pattern). **C13 [load-bearing, defense-in-depth for the wave's stated goal]:** `await engine.disconnect()` inside the op-dispatch finally can ITSELF hang on PGLite (db.close() racing OS-level FS state). When that happens, the entire wave's force-exit guard never runs — we recreate the original hang at a new layer. Fix: install an unref'd setTimeout hard-exit fallback BEFORE entering the try/catch/finally. The timer fires after DISCONNECT_HARD_DEADLINE_MS=10s with a stderr warn and process.exit(0). unref ensures it doesn't keep the loop alive on a healthy exit. Daemons (`serve`) are excluded by reusing the shouldForceExitAfterMain guard. **C9 [data freshness gap, narrow but real]:** The drain ran ONLY in the success branch of try. If `bumpLastRetrievedAt` fired (handler succeeded) but `JSON.parse(JSON.stringify(...))` or `formatResult` then threw, process.exit(1) killed the process and the in-flight UPDATE was discarded. Fix: drain in the catch path too before process.exit(1) (best-effort, bounded by the drain's own 5s timeout). **C1 [daemon leak]:** A timed-out IIFE used to stay in the pending-writes Set forever because its `.finally` never fires. Long-lived `gbrain serve` would accumulate references without bound across repeated timeouts. Fix: explicitly `delete` the snapshot's tracked promises from the Set after a timeout outcome. The IIFEs keep running (orphaned), but the Set no longer leaks references. Pinned by a new unit test that asserts the second drain after a timeout returns immediately with empty pending count. **M1 [silent type drift]:** `cli.ts` duplicated the `{outcome, pending}` literal shape instead of importing the `DrainOutcome` type that `last-retrieved.ts` exports exactly for this purpose. Two-line fix: add `type DrainOutcome` to the import and use it for `let drainResult`. Future changes to the return shape now propagate through TypeScript. **Deferred to TODOS.md (C6 — unusual caller pattern):** Concurrent connect/disconnect on the same `PGLiteEngine` instance can strand: disconnect snapshots+nulls the lock while connect is still in-flight, leaving the resolved engine with no file lock held. Fix requires an instance-level mutex; not worth the complexity for a caller pattern that doesn't appear in production (single instance per process, sequential lifecycle). Also broadened `test/fix-wave-structural.test.ts` regex to accept additional type-imports from `last-retrieved.ts` (e.g. the new `type DrainOutcome` import that M1 added). Test coverage: 53/53 wave tests pass (added C1-followup case to last-retrieved.test.ts). The C1 fix is also pinned by tightening the existing permanent-pending test's post-timeout assertion to expect empty pending count rather than the prior (stale) "stays in set" note. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs: post-ship documentation sync for v0.41.8.0 Consolidate the duplicate 'take advantage of v0.41.8.0' sections in the CHANGELOG entry into a single canonical block per the CLAUDE.md template. The wave originally landed with both '### How to take advantage' (line 13) and '### To take advantage' (line 57) as h3 headings. CLAUDE.md mandates one '## To take advantage of v[version]' h2 block per release entry, with verify steps + an issue-filing fallback for users hitting upgrade failures. Promoted the second block to h2, added the issue-filing step, and removed the redundant first block (the upgrade command is already covered in the verify steps). Itemized changes section was unchanged. llms.txt + llms-full.txt regenerated; structurally identical so no content changes shipped. * fix(test): find-experts-op queries schema dim instead of hardcoding 1536 (CI shard 1) CI shard 1 failed on this branch with: \"expected 1280 dimensions, not 1536\" from pgvector's CheckExpectedDim. Root cause: master's v0.36.0 changed DEFAULT_EMBEDDING_DIMENSIONS from OpenAI's 1536d to ZeroEntropy's 1280d (src/core/ai/defaults.ts:21). The test's basisEmbedding helper hardcoded dim=1536, so beforeAll's upsertChunks failed when the schema column was created at 1280d. Latent on master: the weight-aware LPT bin-packing in scripts/sharding.ts assigns files to shards deterministically based on the COMPLETE file set. My branch adds 5 new test files, which shifted find-experts-op.test.ts into shard 1. Master's shard 1 doesn't run this file (it lands in a different shard there), so the bug never surfaced in master's CI. Fix: query the actual column dim via SELECT atttypmod FROM pg_attribute after initSchema, then seed the embedding at that width. This handles both paths (no-env CI → 1280; env-configured local → 1536) without hardcoding either default. Verify: - bun test test/find-experts-op.test.ts → 11/11 pass with provider env - env -i bun test test/find-experts-op.test.ts → 11/11 pass without - bun run verify → all 21 parallel checks clean - bun run typecheck → clean Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(lint): robust pure-bash allowlist match in check-test-isolation (CI verify) CI verify failed on PR garrytan#1405 with check:test-isolation flagging test/scripts/check-test-isolation.test.ts even though that file is on line 22 of the allowlist (and has been since v0.26.7 as a permanent exemption — its body contains process.env mutation fixtures that the lint legitimately matches). Could not reproduce locally on macOS bash 3.2 + BSD grep across any locale (C, C.UTF-8, POSIX). Suspect a subtle interaction between the prior `echo "$ALLOWLIST" | grep -qxF "$f"` form and one of: Ubuntu 24.04's bash 5 set-e/pipefail semantics, GNU grep edge case on the first-line entry, or `bun run` + GNU timeout subshell interaction. Diagnostic value of chasing further is low — the fix is to drop the grep+pipe form entirely. Switch is_allowlisted() to pure-bash `case $'\n'"$ALLOWLIST"$'\n' in *$'\n'"$f"$'\n'*) return 0 ;; esac` whole-line matching: - Locale-free (no character-class interaction) - Pipe-free (no pipefail / SIGPIPE / buffering) - Subshell-free (no env or exit-code propagation gotchas) - set-e-quirk-free (no left-side compound failure) - ~100x faster (no fork+exec per call across 689 files) Verified locally: lint OK (689 files), case-match returns true for the allowlisted file and false for a non-allowlisted file. bun run verify clean (21/21 parallel checks pass). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Park Je Hoon <jehoon@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Co-authored-by: Matt Dean <matt-dean-git@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
PGLite was the #1 community pain since v0.37: five open issues report the same engine hanging in different shapes. This wave ships the fix for the search/query/get hang class (#1247, #1269, #1290), improves the WASM init error hint for #1340, and adds diagnostic phase breadcrumbs to make the next #1342-shape report actionable.
Performance / fixes
gbrain search,gbrain query,gbrain geton PGLite now exit cleanly in <2s instead of hanging at ~95-98% CPU until SIGKILL.bumpLastRetrievedAtIIFE beforeengine.disconnect(). Tracked in a module-scopedSet<Promise>, mirrors the existing bug:/adminreturns 404 Not Found even on freshgbrain init+gbrain serve --http#1090awaitPendingSearchCacheWritesprecedent.cli.tsso a hungengine.disconnect()cannot defeat the drain timeout (adversarial-review C13).gbrain serve --httpstays alive — narrow force-exit only fires on drain timeout AND non-daemon command viashouldForceExitAfterMain().Defensive infra
PGLiteEngine.disconnect()snapshot+early-null + try/finally lock-leak guard (closes the partial-state race PR fix: stop PGLite CLI commands from hanging #1337 originally surfaced).classifyPgliteInitError(message)+buildPgliteInitErrorMessage(verdict, original)route WASM init errors by failure shape so PGLite WASM initialization fails with ENOENT: //root/pglite.data on macOS 12.7.6 + Bun 1.3.14 #1340 (bunfs/older macOS) stops being misrouted to the macOS 26.3 PGLite WASM crash on macOS 26.3 with Bun 1.3.11 #223 hint.Diagnostic
[gbrain phase] sync.<phase>stderr breadcrumbs inperformSyncInner(resolve_repo, load_active_pack, validate_repo_state, detect_head) so the next gbrain sync hangs indefinitely after 89→92 schema migration (0.40.8.0, PGLite) #1342-shape sync hang names which phase spun.Tests added: 53/53 pass across the wave
test/last-retrieved.test.ts(7 unit cases — including the C1 daemon-leak-guard regression added in the pre-landing fix pass)test/pglite-engine-disconnect.serial.test.ts(5 lifecycle invariants — close-then-release order, snapshot pattern, lock-leak-on-throw, double-disconnect idempotency, reconnect cleanliness)test/pglite-init-classifier.test.ts(12 cases including PGLite WASM initialization fails with ENOENT: //root/pglite.data on macOS 12.7.6 + Bun 1.3.14 #1340 reporter's exact error round-trip + negative case for the tightened regex)test/cli-should-force-exit.test.ts(9 pure-function cases including substring-match avoidance)test/e2e/pglite-cli-exit.serial.test.ts(4 behavioral subprocess cases — search/query/get exit clean + daemon-survival regression guard) — IRON-RULE regressiontest/fix-wave-structural.test.ts(5 new describe blocks for v0.41.8.0)test/seed-pglite.serial.test.ts(RENAMED from.test.ts— quarantine for parallel-shard PGLite WASM cold-start contention; 11/11 pass in serial pass)Test Coverage
44/54 paths fully covered (★★★) = 81%. 50/54 with at least smoke = 93%. Coverage gate: PASS (≥80% target).
Gaps are concentrated in graceful-degradation error branches (best-effort code that swallows errors anyway) and observational stderr breadcrumbs (#1342 diagnostic only). Load-bearing fixes (drain, force-exit, classifier, disconnect lifecycle) densely covered with unit + serial + subprocess E2E.
Pre-Landing Review
13 commits, 1327 lines. Reviewed via adversarial (Claude subagent) + testing/maintainability specialist + Codex. Surfaced 4 critical findings; all fixed in commit cb349f0. 8 informational findings reviewed; informational nits filed as v0.41+ TODOs where appropriate.
Fixed in cb349f0:
await engine.disconnect()can itself hang on PGLite (db.close racing OS-level FS). Installed unref'dsetTimeout(10s)hard-exit fallback BEFORE entering the try/catch/finally so a hung disconnect cannot defeat the drain timeout. Daemons excluded viashouldForceExitAfterMain.bumpLastRetrievedAtfired thenformatResultthrew,process.exit(1)discarded the UPDATE. Now drains in catch path too (best-effort, bounded).gbrain servewould accumulate references. Now explicitly deletes the snapshot's tracked promises from the Set after a timeout outcome. Pinned by a new unit test that asserts the next drain after a timeout returns immediately with empty pending count.cli.tsduplicated theDrainOutcomeliteral shape. Now importstype DrainOutcomefromlast-retrieved.tsso future shape changes propagate.Deferred to TODOS.md (C6): Concurrent
connect()/disconnect()on the same instance can strand (unusual caller pattern; not in production).Plan Completion
12/13 DONE, 1 CHANGED (T8: split into
test/pglite-engine-disconnect.serial.test.ts+test/pglite-init-classifier.test.tsinstead of extending existing file — coverage ≥ plan intent). 7 post-merge UNVERIFIABLE actions (close PRs #1259/#1337 with credit, close issues #1247/#1269/#1290 as fixed, leave #1340/#1342 open with notes) — to land after merge.TODOS
v0.41.8.0 PGLite hang follow-upssection in TODOS.md:awaitPendingSearchCacheWriteswith the same bounded timeoutcreateDrainHelper<T>()factory when a third surface appearsPGLiteEngineDocumentation
CHANGELOG.md— consolidated the duplicate "take advantage of v0.41.8.0" sections into a single canonical## To take advantage of v0.41.8.0h2 block per the CLAUDE.md template (commit 494e3bd).CLAUDE.md— Key Files entries forsrc/core/last-retrieved.ts(drain helper + Set tracking + bounded timeout) andsrc/core/pglite-engine.ts(snapshot pattern + try/finally + classifier).TODOS.md— v0.41.8.0 follow-ups filed with gbrain sync hangs indefinitely after 89→92 schema migration (0.40.8.0, PGLite) #1342 investigation, awaitPendingSearchCacheWrites retrofit, drain-helper extraction, and C6 concurrency follow-up.llms-full.txt— regenerated.Verification Results
bun run verify— green (privacy + jsonb + progress + wasm + types + all CI checks)bun run typecheck— greentest/seed-pglite.serial.test.ts(post-rename): 11/11 pass in isolationscripts/run-unit-parallel.sh:61already clamped 8→4 in v0.40.10 for the same reason). CI runners have more memory; serial pass (which catches PGLite-heavy work) completes 441/441 clean.gbrain search "fox" --limit 3 && echo EXIT=$?returnsEXIT=0in <2s;time gbrain query "test" --no-expand --limit 3completes in <2s;gbrain serve --http --port Nstays alive for the duration of test (60s+).Credit
PR #1259 by jehoon supplied the structural drain pattern; validated by @eloe, @bcallender, @61tH0b. PR #1337 by matt-dean-git supplied the snapshot+early-null disconnect pattern and the force-exit idea this wave narrowed to fire only on the drain-timeout path. Both closed with credit on the landing commit.
Test plan
bun run verifycleanbun test test/last-retrieved.test.ts test/pglite-engine-disconnect.serial.test.ts test/pglite-init-classifier.test.ts test/cli-should-force-exit.test.ts test/fix-wave-structural.test.ts test/e2e/pglite-cli-exit.serial.test.ts— 53/53 passbun test test/pglite-engine.test.ts— 100/100 pass (existing v0.13.1 PGLite WASM crash on macOS 26.3 with Bun 1.3.11 #223 source-grep guard survives the catch-block rewrite)bun test test/seed-pglite.serial.test.ts— 11/11 pass (quarantine validated)time gbrain search "x" --limit 3; echo EXIT=$?on fresh PGLite brain returns EXIT=0 in <2sgbrain serve --http --port 31313 &; sleep 60; kill -0 %1confirms daemon survival🤖 Generated with Claude Code