Skip to content

v0.41.8.0 fix(pglite): search/query/get exit cleanly + #1340 hint + #1342 breadcrumbs#1405

Merged
garrytan merged 19 commits into
masterfrom
garrytan/pglite-hang-fix-wave
May 25, 2026
Merged

v0.41.8.0 fix(pglite): search/query/get exit cleanly + #1340 hint + #1342 breadcrumbs#1405
garrytan merged 19 commits into
masterfrom
garrytan/pglite-hang-fix-wave

Conversation

@garrytan

Copy link
Copy Markdown
Owner

Summary

PGLite was the #1 community pain since v0.37: five open issues report the same engine hanging in different shapes. This wave ships the fix for the search/query/get hang class (#1247, #1269, #1290), improves the WASM init error hint for #1340, and adds diagnostic phase breadcrumbs to make the next #1342-shape report actionable.

Performance / fixes

  • gbrain search, gbrain query, gbrain get on PGLite now exit cleanly in <2s instead of hanging at ~95-98% CPU until SIGKILL.
  • Bounded 5s drain on the fire-and-forget bumpLastRetrievedAt IIFE before engine.disconnect(). Tracked in a module-scoped Set<Promise>, mirrors the existing bug: /admin returns 404 Not Found even on fresh gbrain init + gbrain serve --http #1090 awaitPendingSearchCacheWrites precedent.
  • Defense-in-depth: unref'd setTimeout(10s) hard-exit fallback in cli.ts so a hung engine.disconnect() cannot defeat the drain timeout (adversarial-review C13).
  • gbrain serve --http stays alive — narrow force-exit only fires on drain timeout AND non-daemon command via shouldForceExitAfterMain().

Defensive infra

Diagnostic

Tests added: 53/53 pass across the wave

  • test/last-retrieved.test.ts (7 unit cases — including the C1 daemon-leak-guard regression added in the pre-landing fix pass)
  • test/pglite-engine-disconnect.serial.test.ts (5 lifecycle invariants — close-then-release order, snapshot pattern, lock-leak-on-throw, double-disconnect idempotency, reconnect cleanliness)
  • test/pglite-init-classifier.test.ts (12 cases including PGLite WASM initialization fails with ENOENT: //root/pglite.data on macOS 12.7.6 + Bun 1.3.14 #1340 reporter's exact error round-trip + negative case for the tightened regex)
  • test/cli-should-force-exit.test.ts (9 pure-function cases including substring-match avoidance)
  • test/e2e/pglite-cli-exit.serial.test.ts (4 behavioral subprocess cases — search/query/get exit clean + daemon-survival regression guard) — IRON-RULE regression
  • test/fix-wave-structural.test.ts (5 new describe blocks for v0.41.8.0)
  • test/seed-pglite.serial.test.ts (RENAMED from .test.ts — quarantine for parallel-shard PGLite WASM cold-start contention; 11/11 pass in serial pass)

Test Coverage

44/54 paths fully covered (★★★) = 81%. 50/54 with at least smoke = 93%. Coverage gate: PASS (≥80% target).

last-retrieved.ts (drain helper)         ████████████░░  10/14 ★★★
  L1  empty-set fast-path                ★★★
  L2  snapshot + allSettled drain        ★★★
  L3  timeout race branch                ★★★
  L4  clearTimeout cleanup               ★★  (structural)
  L5  add-then-remove on settle          ★★★
  L6  swallow rejection in .catch        ★★★
  L7  empty-pageIds early return         ★★★
  L8  IIFE happy path                    ★★★
  L9  undefined-column fallthrough       ☆  (graceful degradation)
  L10 generic error stderr-warn          ★★
  L11 isTrackingEnabled cache hit        ★
  L12 disable-via-config path            ☆  (graceful degradation)
  L13 getConfig throw fallback           ☆  (graceful degradation)
  L14 test seams                         ★★★

cli-force-exit.ts                        ██████████████  8/8 ★★★
cli.ts (drain + force-exit wiring)       ██████████████  10/10 ★★★
pglite-engine.ts                         ████████████░░  16/18 ★★★
sync.ts (4 stderr breadcrumbs)           ░░░░░░░░░░░░░░  0/4 ☆ (observational)

Gaps are concentrated in graceful-degradation error branches (best-effort code that swallows errors anyway) and observational stderr breadcrumbs (#1342 diagnostic only). Load-bearing fixes (drain, force-exit, classifier, disconnect lifecycle) densely covered with unit + serial + subprocess E2E.

Pre-Landing Review

13 commits, 1327 lines. Reviewed via adversarial (Claude subagent) + testing/maintainability specialist + Codex. Surfaced 4 critical findings; all fixed in commit cb349f0. 8 informational findings reviewed; informational nits filed as v0.41+ TODOs where appropriate.

Fixed in cb349f0:

  • C13 [load-bearing]: await engine.disconnect() can itself hang on PGLite (db.close racing OS-level FS). Installed unref'd setTimeout(10s) hard-exit fallback BEFORE entering the try/catch/finally so a hung disconnect cannot defeat the drain timeout. Daemons excluded via shouldForceExitAfterMain.
  • C9 [data freshness gap]: drain only ran in the try success branch. If bumpLastRetrievedAt fired then formatResult threw, process.exit(1) discarded the UPDATE. Now drains in catch path too (best-effort, bounded).
  • C1 [daemon leak]: a timed-out IIFE used to stay in the Set forever. gbrain serve would accumulate references. Now explicitly deletes the snapshot's tracked promises from the Set after a timeout outcome. Pinned by a new unit test that asserts the next drain after a timeout returns immediately with empty pending count.
  • M1 [silent type drift]: cli.ts duplicated the DrainOutcome literal shape. Now imports type DrainOutcome from last-retrieved.ts so future shape changes propagate.

Deferred to TODOS.md (C6): Concurrent connect()/disconnect() on the same instance can strand (unusual caller pattern; not in production).

Plan Completion

12/13 DONE, 1 CHANGED (T8: split into test/pglite-engine-disconnect.serial.test.ts + test/pglite-init-classifier.test.ts instead of extending existing file — coverage ≥ plan intent). 7 post-merge UNVERIFIABLE actions (close PRs #1259/#1337 with credit, close issues #1247/#1269/#1290 as fixed, leave #1340/#1342 open with notes) — to land after merge.

TODOS

  • File v0.41.8.0 PGLite hang follow-ups section in TODOS.md:

Documentation

  • CHANGELOG.md — consolidated the duplicate "take advantage of v0.41.8.0" sections into a single canonical ## To take advantage of v0.41.8.0 h2 block per the CLAUDE.md template (commit 494e3bd).
  • CLAUDE.md — Key Files entries for src/core/last-retrieved.ts (drain helper + Set tracking + bounded timeout) and src/core/pglite-engine.ts (snapshot pattern + try/finally + classifier).
  • TODOS.md — v0.41.8.0 follow-ups filed with gbrain sync hangs indefinitely after 89→92 schema migration (0.40.8.0, PGLite) #1342 investigation, awaitPendingSearchCacheWrites retrofit, drain-helper extraction, and C6 concurrency follow-up.
  • llms-full.txt — regenerated.

Verification Results

  • bun run verify — green (privacy + jsonb + progress + wasm + types + all CI checks)
  • bun run typecheck — green
  • Wave-touched test suite: 53/53 pass in 26s
  • test/seed-pglite.serial.test.ts (post-rename): 11/11 pass in isolation
  • Full unit suite under parallel 4-shard contention: hits OOM on memory-constrained dev machine (pre-existing master flake class — scripts/run-unit-parallel.sh:61 already clamped 8→4 in v0.40.10 for the same reason). CI runners have more memory; serial pass (which catches PGLite-heavy work) completes 441/441 clean.
  • Manual smoke against a real PGLite brain: gbrain search "fox" --limit 3 && echo EXIT=$? returns EXIT=0 in <2s; time gbrain query "test" --no-expand --limit 3 completes in <2s; gbrain serve --http --port N stays alive for the duration of test (60s+).

Credit

PR #1259 by jehoon supplied the structural drain pattern; validated by @eloe, @bcallender, @61tH0b. PR #1337 by matt-dean-git supplied the snapshot+early-null disconnect pattern and the force-exit idea this wave narrowed to fire only on the drain-timeout path. Both closed with credit on the landing commit.

Test plan

  • bun run verify clean
  • bun test test/last-retrieved.test.ts test/pglite-engine-disconnect.serial.test.ts test/pglite-init-classifier.test.ts test/cli-should-force-exit.test.ts test/fix-wave-structural.test.ts test/e2e/pglite-cli-exit.serial.test.ts — 53/53 pass
  • bun test test/pglite-engine.test.ts — 100/100 pass (existing v0.13.1 PGLite WASM crash on macOS 26.3 with Bun 1.3.11 #223 source-grep guard survives the catch-block rewrite)
  • bun test test/seed-pglite.serial.test.ts — 11/11 pass (quarantine validated)
  • Manual: time gbrain search "x" --limit 3; echo EXIT=$? on fresh PGLite brain returns EXIT=0 in <2s
  • Manual: gbrain serve --http --port 31313 &; sleep 60; kill -0 %1 confirms daemon survival

🤖 Generated with Claude Code

garrytan and others added 14 commits May 24, 2026 11:50
…sconnect

Closes the structural bug class behind #1247, #1269, #1290: PGLite CLI
search/query/get_page commands printed results then hung at ~95-98% CPU
until SIGKILL. Root cause: bumpLastRetrievedAt's IIFE races
engine.disconnect() — PGLite's WASM runtime keeps Bun's event loop alive
while the dangling UPDATE settles.

Mirrors the existing awaitPendingSearchCacheWrites precedent landed in
v0.36.1.x for #1090. Tracks every IIFE promise in a module-scoped Set,
exposes awaitPendingLastRetrievedWrites(timeoutMs) that resolves once
all settle. Bounded with a 5s default timeout via Promise.race so a
future fire-and-forget that hangs forever can't recreate the bug class
at this layer — instead, the drain stderr-warns with a pending count
and returns timeout outcome so the caller can decide its fallback.

Test coverage: 6 unit cases covering empty drain, single + multi-pending
settle, throw-in-IIFE still settles, permanently-pending hits timeout
within bound, empty pageIds does not track.

This commit ships the helper + tracking + tests with NO consumer.
The cli.ts wiring lands in a follow-up commit (atomic bisect units).

Co-Authored-By: Park Je Hoon <jehoon@users.noreply.github.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…uard

Refactor PGLiteEngine.disconnect() with two structural fixes:

(1) Snapshot + early-null pattern: capture db/lock refs and null the
    instance fields BEFORE any await. A concurrent connect() can no
    longer observe `_db` pointing at a handle that's mid-close. This
    is PR #1337's load-bearing contribution that we DID take.

(2) Wrap close + release in try/finally. Without this guard, a thrown
    db.close() would leak the file lock and wedge every next gbrain
    invocation on the stale lock. Codex outside-voice review (eng
    review finding #7) caught this gap when reviewing the snapshot
    refactor.

KEEP the original close-then-release order. PR #1337's diff swapped
this to release-then-close, which we explicitly REJECTED — releasing
the lock before close lets a sibling process try to connect to a
still-closing brain. The new lifecycle test file pins this ordering
so a future maintainer reading PR #1337's diff cannot accidentally
flip it.

Test coverage in test/pglite-engine-disconnect.serial.test.ts: 5
cases — close-before-release ordering, early-null observable inside
close, lock-still-releases on close-throw, double-disconnect
idempotency, reconnect-after-disconnect clean state. `.serial`
because each test creates a fresh PGLite engine (WASM cold-start
cost) — running in parallel shards would starve other tests.

Existing test/pglite-engine.test.ts: 100/100 still green.

Co-Authored-By: Matt Dean <matt-dean-git@users.noreply.github.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…1340)

Closes the user-facing half of #1340: on macOS 12.7.6 + Bun 1.3.14, the
PGLite connect() catch block hardcoded the macOS 26.3 hint (#223). The
actual root cause for #1340 is Bun's vfs: `/$$bunfs/root` is read-only
on older macOS, so PGLite cannot extract its pglite.data WASM payload.

Adds two exported helpers in pglite-engine.ts:

  classifyPgliteInitError(message): 'bunfs' | 'macos-26-3' | 'unknown'
  buildPgliteInitErrorMessage(verdict, original): string

Connect catch block now routes the hint by verdict. The bunfs hint
names `bun upgrade` + Node fallback. The macOS 26.3 hint keeps the
existing #223 link. Unknown falls through to a generic doctor + #223
fallback.

Per Codex eng-review finding #9, the bunfs regex is tightened to match
either the literal `$$bunfs` marker OR ENOENT+pglite.data
co-occurrence — NOT generic `pglite.data` substring (would fire on
unrelated errors). Negative test pinned.

Root fix is upstream Bun; this PR just stops misclassifying the
failure class so support traffic doesn't conflate two unrelated bugs.

Test coverage: 12 pure-function unit cases including the #1340
reporter's exact error string round-trip, the negative case Codex
caught, and all three verdicts × all three message contents.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…#1247, #1269, #1290)

Wires the v0.40.10.0 drain helper into cli.ts and adds the IRON-RULE
behavioral regression test for the search-hang class. The drain is
called unconditionally for every op (not per-op-name gated — that was
the original PR #1259 mistake that left search and get_page exposed).

The narrow force-exit synthesis (decision D7 from the eng review,
informed by Codex outside-voice findings #1+#2+#8): when the drain
returns outcome:'timeout', AFTER engine.disconnect() resolves AND
the command is NOT a daemon, fire process.exit(0). The drain helper
already stderr-warned with the pending count, so the diagnostic
signal is preserved. Without this guard, a hung underlying promise
could still keep Bun's event loop alive past disconnect.

CRITICALLY narrower than PR #1337's blanket force-exit: the timeout
path is the only trigger. In the common case (drain settles cleanly
under 5s), no force-exit fires and the behavioral subprocess test
still catches future regressions. The shouldForceExitAfterMain guard
excludes 'serve' so the stdio + HTTP daemons stay alive past main().

e2e/pglite-cli-exit.serial.test.ts (NEW, IRON RULE):
  - gbrain search "foxtrot" → exits 0 within 15s
  - gbrain get alpha → exits 0 within 15s with foxtrot in stdout
  - gbrain query "foxtrot" --no-expand → exits within 15s (no-API-key
    graceful)
  - gbrain serve --http → stays alive 3+ seconds (daemon-survival
    regression guard)

fix-wave-structural.test.ts:
  - import assertion for awaitPendingLastRetrievedWrites
  - last-retrieved.ts exports + Set tracking + Promise.race + timeout
  - BEHAVIORAL positioning assertion: drain `await` appears textually
    BEFORE engine.disconnect `await` in the op-dispatch local-engine
    path. Survives variable-rename refactors; catches any new
    disconnect path that bypasses the drain.
  - shouldForceExitAfterMain excludes 'serve' AND the gate is
    conditioned on drainResult.outcome==='timeout'

Per D8 (Codex finding #5), explicitly do NOT add a drift-guard
counting bumpLastRetrievedAt callers — would block harmless
refactors and miss aliases.

Co-Authored-By: Park Je Hoon <jehoon@users.noreply.github.com>
Co-Authored-By: Matt Dean <matt-dean-git@users.noreply.github.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The #1342 reporter saw ZERO stderr output before their PGLite sync
hang, which made the bug impossible to triage from a community report
alone. Mirrors the pre-existing `[gbrain phase] sync.git_pull start/done`
pattern at the major pre-pull phase boundaries so the next #1342-shaped
report names WHICH phase spun.

Four new breadcrumbs at:
  - sync.resolve_repo (top of performSyncInner)
  - sync.load_active_pack (before the v0.39 T1.5 pack load)
  - sync.validate_repo_state (only when opts.sourceId is set —
    the re-clone branch)
  - sync.detect_head (before the isDetachedHead probe)

No behavior change — pure stderr instrumentation. Doesn't fix #1342
(which still needs investigation per the TODOS entry filed in this
wave), but converts "hung with no output" into actionable diagnostic
data the next time the bug shape is reported.

Per D9 in the eng review + Codex outside-voice finding #14.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Key Files entries updated:

- src/core/pglite-engine.ts: documents the v0.40.10.0 disconnect
  refactor (snapshot+early-null + try/finally lock-leak guard, KEEPS
  close-then-release order), and the new classifyPgliteInitError /
  buildPgliteInitErrorMessage helpers for #1340 hint routing. Pins
  PR #1337's accepted-but-narrowed contribution and the rejected
  release-then-close ordering swap.

- src/core/last-retrieved.ts (within the brainstorm entry): documents
  the new awaitPendingLastRetrievedWrites drain, the Set tracking
  pattern, the 5s bounded timeout, the cli.ts narrow timeout-only
  force-exit synthesis with the serve-daemon guard, and the three
  community-validated reports (#1247/#1269/#1290) the fix closes.
  Credits PR #1259 (drain pattern) and PR #1337 (snapshot pattern +
  force-exit guard idea).

Regenerated llms.txt + llms-full.txt — build-llms.test.ts gates the
drift, all 7 cases green.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three deferred items from the v0.40.10.0 fix wave:

1. #1342 sync-hang investigation. Single-reporter, JS-tight-loop
   shape, needs reproducer before any fix. Documents the ruled-out
   hypotheses (lock-refresh heartbeat, v91 trigger, while-true loops)
   and three concrete diagnostic next steps. The v0.40.10.0 sync
   phase breadcrumbs make the next report actionable.

2. awaitPendingSearchCacheWrites timeout-symmetry retrofit. The #1090
   drain shipped without a timeout; the v0.40.10.0 #1247 drain ships
   with one. Apply the same Promise.race + stderr warn pattern for
   symmetry.

3. Drain-helper extraction. Per D4 in the eng review: two surfaces is
   the threshold for noticing, three for extracting. Pair with the
   symmetry retrofit above as one focused refactor when a third
   fire-and-forget surface appears.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…1342 breadcrumbs

Closes #1247, #1269, #1290 (PGLite CLI search/query/get hang at ~95-98%
CPU after printing results — three community-validated reports). Also
fixes #1340 (WASM init misroutes to macOS 26.3 hint when real cause is
Bun vfs read-only mount) and adds diagnostic phase breadcrumbs for the
single-reporter #1342 sync-hang investigation.

Core fix: track every fire-and-forget bumpLastRetrievedAt IIFE in a
module-scoped Set; cli.ts awaits the drain before engine.disconnect()
in the op-dispatch finally block; narrow process.exit(0) fires ONLY
when the drain times out AND the command isn't a daemon. Snapshot+
early-null disconnect pattern + try/finally lock-leak guard close the
partial-state race PR #1337 originally surfaced.

Co-Authored-By: Park Je Hoon <jehoon@users.noreply.github.com>
Co-Authored-By: Matt Dean <matt-dean-git@users.noreply.github.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ases

Gap-audit follow-up: cli.ts is a script entrypoint (top-level main()
side effect), so importing it from a test fires the help output as a
side effect. Move shouldForceExitAfterMain into src/core/cli-force-exit.ts
so it can be unit-tested in isolation without the cli.ts script tail
running.

Adds test/cli-should-force-exit.test.ts (9 cases): bare serve, serve
with flags after, global flags BEFORE the command (the load-bearing
case for `gbrain --quiet serve`), op commands return true, non-daemon
CLI commands return true, empty argv defaults to true, flag-only argv,
default-arg fallback to process.argv.slice(2), substring-match
avoidance (`serves` is NOT `serve` — strict equality via Set, not
startsWith/includes).

The daemon command set is now an explicit ReadonlySet — future
daemons (a hypothetical `gbrain watch` or `gbrain daemon`) just add
their name to DAEMON_COMMANDS rather than chaining ||.

Updates fix-wave-structural.test.ts to look for the import + the
new DAEMON_COMMANDS shape.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…0.41.0.0+ landed)

origin/master moved from v0.40.8.1 → v0.41.0.0 while this wave was in
flight (PR #1367 minions cathedral). v0.41.1-v0.41.5 are claimed by
other in-flight branches, so v0.41.6.0 is the next available slot.

Bulk-renamed v0.40.10.0 → v0.41.6.0 across:
- VERSION + package.json (trio audit clean: 0.41.6.0 / 0.41.6.0 / 0.41.6.0)
- CHANGELOG.md (header + 3 prose references)
- CLAUDE.md (Key Files annotations)
- TODOS.md (follow-up entry header)
- src/cli.ts + src/core/cli-force-exit.ts + src/core/last-retrieved.ts
  + src/core/pglite-engine.ts + src/commands/sync.ts (inline comments)
- test/* (describe blocks + test file headers)
- llms-full.txt (regenerated via `bun run build:llms`)

bun.lock unchanged (version-only bump, no dep churn) per Codex #12.

Verify: 52/52 wave tests pass after rename, typecheck clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…tart flake)

The full-suite run during the v0.41.6.0 fix wave ship hit a 30s timeout
in test/seed-pglite.test.ts under heavy 4-shard parallel contention
(4972/4973 passed before SIGKILL). The test passes 11/11 in isolation.

Root cause: each test instantiates a fresh PGLiteEngine (5 instances
across the file, one per test) because each case writes to a different
mkdtemp-ed dbPath. Under parallel shard load, multiple shards each
cold-starting PGLite WASM simultaneously stretches the per-instance
init from ~5s to 30s+. The shared-engine pattern (canonical PGLite
block in CLAUDE.md R3+R4) doesn't apply here — different dbPaths
require different engines.

Fix per CLAUDE.md test-isolation quarantine rules: rename to
`.serial.test.ts` so the file runs in the post-parallel serial pass
with full WASM init capacity. Same pattern as
test/pglite-engine-disconnect.serial.test.ts (added in this wave) and
test/brain-registry.serial.test.ts (pre-existing).

Removes test/seed-pglite.test.ts from check-test-isolation.allowlist
since the .serial.test.ts rename auto-exempts it from the R3+R4 lint
(scan skips *.serial.test.ts). 641 non-serial unit files scanned,
lint clean.

Verify:
- bun test test/seed-pglite.serial.test.ts → 11/11 pass in 4.19s
- scripts/check-test-isolation.sh → OK
- bun run verify → all gates pass

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…atch drain, M1 type drift)

Adversarial review + maintainability specialist surfaced four real
issues in the v0.41.8.0 wave. All four fixed in this commit; one
deferred to TODOS.md as a v0.41+ follow-up (unusual caller pattern).

**C13 [load-bearing, defense-in-depth for the wave's stated goal]:**
`await engine.disconnect()` inside the op-dispatch finally can ITSELF
hang on PGLite (db.close() racing OS-level FS state). When that
happens, the entire wave's force-exit guard never runs — we recreate
the original hang at a new layer. Fix: install an unref'd setTimeout
hard-exit fallback BEFORE entering the try/catch/finally. The timer
fires after DISCONNECT_HARD_DEADLINE_MS=10s with a stderr warn and
process.exit(0). unref ensures it doesn't keep the loop alive on a
healthy exit. Daemons (`serve`) are excluded by reusing the
shouldForceExitAfterMain guard.

**C9 [data freshness gap, narrow but real]:**
The drain ran ONLY in the success branch of try. If
`bumpLastRetrievedAt` fired (handler succeeded) but
`JSON.parse(JSON.stringify(...))` or `formatResult` then threw,
process.exit(1) killed the process and the in-flight UPDATE was
discarded. Fix: drain in the catch path too before process.exit(1)
(best-effort, bounded by the drain's own 5s timeout).

**C1 [daemon leak]:**
A timed-out IIFE used to stay in the pending-writes Set forever
because its `.finally` never fires. Long-lived `gbrain serve` would
accumulate references without bound across repeated timeouts. Fix:
explicitly `delete` the snapshot's tracked promises from the Set
after a timeout outcome. The IIFEs keep running (orphaned), but the
Set no longer leaks references. Pinned by a new unit test that
asserts the second drain after a timeout returns immediately with
empty pending count.

**M1 [silent type drift]:**
`cli.ts` duplicated the `{outcome, pending}` literal shape instead of
importing the `DrainOutcome` type that `last-retrieved.ts` exports
exactly for this purpose. Two-line fix: add `type DrainOutcome` to
the import and use it for `let drainResult`. Future changes to the
return shape now propagate through TypeScript.

**Deferred to TODOS.md (C6 — unusual caller pattern):**
Concurrent connect/disconnect on the same `PGLiteEngine` instance can
strand: disconnect snapshots+nulls the lock while connect is still
in-flight, leaving the resolved engine with no file lock held. Fix
requires an instance-level mutex; not worth the complexity for a
caller pattern that doesn't appear in production (single instance per
process, sequential lifecycle).

Also broadened `test/fix-wave-structural.test.ts` regex to accept
additional type-imports from `last-retrieved.ts` (e.g. the new
`type DrainOutcome` import that M1 added).

Test coverage: 53/53 wave tests pass (added C1-followup case to
last-retrieved.test.ts). The C1 fix is also pinned by tightening the
existing permanent-pending test's post-timeout assertion to expect
empty pending count rather than the prior (stale) "stays in set" note.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Consolidate the duplicate 'take advantage of v0.41.8.0' sections in
the CHANGELOG entry into a single canonical block per the CLAUDE.md
template. The wave originally landed with both '### How to take
advantage' (line 13) and '### To take advantage' (line 57) as h3
headings. CLAUDE.md mandates one '## To take advantage of v[version]'
h2 block per release entry, with verify steps + an issue-filing
fallback for users hitting upgrade failures.

Promoted the second block to h2, added the issue-filing step, and
removed the redundant first block (the upgrade command is already
covered in the verify steps). Itemized changes section was unchanged.

llms.txt + llms-full.txt regenerated; structurally identical so no
content changes shipped.
garrytan and others added 5 commits May 25, 2026 12:58
Master added v0.41.6.0 (CI test speedup — matrix 4→6 + weight-aware
sharding + auto SHA cache + parallel verify, 23min → ~9min).

After the merge, llms-full.txt grew to 603041 bytes — over the
600KB FULL_SIZE_BUDGET, which broke the build-llms.test.ts size
budget assertion on CI shard 1. Per the canonical fix recipe
(`scripts/llms-config.ts:241`: "ship with includeInFull=false
exclusions"), excluded `docs/guides/minions-deployment.md` from
the single-fetch bundle. It's a 13KB deployment runbook that operators
read once; agents rarely need it in context. Web index entry stays
discoverable. Result: 589907 bytes, 10KB headroom for future growth.

Verify gate clean (21/21 parallel checks); wave test suite green
(53/53 across 6 files); 7/7 build-llms tests green.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…536 (CI shard 1)

CI shard 1 failed on this branch with: \"expected 1280 dimensions, not 1536\"
from pgvector's CheckExpectedDim. Root cause: master's v0.36.0 changed
DEFAULT_EMBEDDING_DIMENSIONS from OpenAI's 1536d to ZeroEntropy's 1280d
(src/core/ai/defaults.ts:21). The test's basisEmbedding helper hardcoded
dim=1536, so beforeAll's upsertChunks failed when the schema column was
created at 1280d.

Latent on master: the weight-aware LPT bin-packing in
scripts/sharding.ts assigns files to shards deterministically based on
the COMPLETE file set. My branch adds 5 new test files, which shifted
find-experts-op.test.ts into shard 1. Master's shard 1 doesn't run this
file (it lands in a different shard there), so the bug never surfaced
in master's CI.

Fix: query the actual column dim via
SELECT atttypmod FROM pg_attribute after initSchema, then seed the
embedding at that width. This handles both paths (no-env CI → 1280;
env-configured local → 1536) without hardcoding either default.

Verify:
- bun test test/find-experts-op.test.ts → 11/11 pass with provider env
- env -i bun test test/find-experts-op.test.ts → 11/11 pass without
- bun run verify → all 21 parallel checks clean
- bun run typecheck → clean

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Master added v0.41.7.0 (compact list-format resolver + 300-skill
scaling tutorial). VERSION/package.json/CHANGELOG conflicts resolved
keeping v0.41.8.0 (this PR's claimed slot) + both CHANGELOG entries.

llms-config.ts auto-merged cleanly — master's UPGRADING_DOWNSTREAM_AGENTS
exclusion + this wave's minions-deployment exclusion both landed.
Bundle now 578758 bytes (was 589907, ample headroom under 600KB).

Verify: 21/21 parallel checks pass; typecheck clean; 62/62 wave tests
across 6 files green (+1 from new scaling-skills test or similar
pickup via master).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…CI verify)

CI verify failed on PR #1405 with check:test-isolation flagging
test/scripts/check-test-isolation.test.ts even though that file is
on line 22 of the allowlist (and has been since v0.26.7 as a permanent
exemption — its body contains process.env mutation fixtures that the
lint legitimately matches).

Could not reproduce locally on macOS bash 3.2 + BSD grep across any
locale (C, C.UTF-8, POSIX). Suspect a subtle interaction between the
prior `echo "$ALLOWLIST" | grep -qxF "$f"` form and one of:
Ubuntu 24.04's bash 5 set-e/pipefail semantics, GNU grep edge case on
the first-line entry, or `bun run` + GNU timeout subshell interaction.
Diagnostic value of chasing further is low — the fix is to drop the
grep+pipe form entirely.

Switch is_allowlisted() to pure-bash `case $'\n'"$ALLOWLIST"$'\n' in
*$'\n'"$f"$'\n'*) return 0 ;; esac` whole-line matching:
- Locale-free (no character-class interaction)
- Pipe-free (no pipefail / SIGPIPE / buffering)
- Subshell-free (no env or exit-code propagation gotchas)
- set-e-quirk-free (no left-side compound failure)
- ~100x faster (no fork+exec per call across 689 files)

Verified locally: lint OK (689 files), case-match returns true for the
allowlisted file and false for a non-allowlisted file. bun run verify
clean (21/21 parallel checks pass).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@garrytan garrytan merged commit 27b0e14 into master May 25, 2026
15 checks passed
garrytan added a commit that referenced this pull request May 25, 2026
Master advanced past v0.41.7.0:
- v0.41.8.0: PGLite search/query/get exit cleanly + #1340 hint + #1342 breadcrumbs (#1405)

The headline conflict was scripts/check-test-isolation.sh: master shipped
the SAME fix I had pushed (different code, same bug), and master's is
structurally better — pure-bash `case` whole-line match instead of the
file-direct grep I used. Both eliminate the Ubuntu 24.04 + bash 5 +
GNU grep flake. Master's wins because:
  - no pipe, no subshell, no grep
  - locale-free, set-e-quirk-free
  - ~100x faster per call

Resolved by taking master's `is_allowlisted` body (the pure-bash case)
and restoring the cached `ALLOWLIST=` setup it depends on. My v0.41.9.0
file-direct grep approach is superseded.

VERSION + package.json + CHANGELOG conflicts resolved (v0.41.9.0 still
holds; CHANGELOG interleaves master's v0.41.8.0 entry below ours).

llms-full.txt regenerated: 580,462 bytes (~120KB headroom under the
v0.41.9.0 700KB budget, after master's expanded includeInFull exclusions
landed in v0.41.7.0).

3-line audit clean. Verify: typecheck clean, check-test-isolation OK
(694 files), build-llms 7/7 pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
garrytan added a commit that referenced this pull request May 25, 2026
Brings in #1405 (v0.41.8.0 fix: PGLite search/query/get exit cleanly +
#1340 hint + #1342 breadcrumbs).

Standard trio conflicts resolved per CLAUDE.md procedure:
- VERSION:      ours wins (0.41.11.0).
- package.json: ours wins (version line; rest of file auto-merged clean).
- CHANGELOG.md: both entries kept; ours stays topmost.

No code-file conflicts this time — CLAUDE.md, llms-full.txt, src/cli.ts
auto-merged cleanly.

Post-merge verification:
- bun install (no changes)
- typecheck clean
- bun run verify PASS (21 checks, 18s parallel)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
mgunnin added a commit to mgunnin/gbrain that referenced this pull request May 28, 2026
* upstream/master:
  v0.41.10.1 fix-wave: dream.* config + batch retry + extract_atoms idempotency + ze-switch env-gate (garrytan#1445)
  v0.41.10.0 feat: orphan reduction via --by-mention + UTF-16 surrogate-pair fix (garrytan#1442)
  v0.41.9.0 — UX/reliability fix wave (5 defects from production report) (garrytan#1440)
  v0.41.8.0 fix(pglite): search/query/get exit cleanly + garrytan#1340 hint + garrytan#1342 breadcrumbs (garrytan#1405)
  v0.41.7.0 feat: compact list-format resolver + 300-skill scaling tutorial (garrytan#1407)
  v0.41.6.0 feat(ci): CI test speedup — 23min → ~9min via matrix 4→6 + weight-aware sharding + auto SHA cache + parallel verify (garrytan#1444)
  v0.41.5.0 fix-wave: warm-narwhal — 6 community PRs + E2E reliability (garrytan#1374)

# Conflicts:
#	src/core/ai/recipes/openai.ts
garrytan-agents pushed a commit to garrytan-agents/gbrain that referenced this pull request Jun 13, 2026
…hint + garrytan#1342 breadcrumbs (garrytan#1405)

* fix(pglite): drain fire-and-forget last_retrieved_at writes before disconnect

Closes the structural bug class behind garrytan#1247, garrytan#1269, garrytan#1290: PGLite CLI
search/query/get_page commands printed results then hung at ~95-98% CPU
until SIGKILL. Root cause: bumpLastRetrievedAt's IIFE races
engine.disconnect() — PGLite's WASM runtime keeps Bun's event loop alive
while the dangling UPDATE settles.

Mirrors the existing awaitPendingSearchCacheWrites precedent landed in
v0.36.1.x for garrytan#1090. Tracks every IIFE promise in a module-scoped Set,
exposes awaitPendingLastRetrievedWrites(timeoutMs) that resolves once
all settle. Bounded with a 5s default timeout via Promise.race so a
future fire-and-forget that hangs forever can't recreate the bug class
at this layer — instead, the drain stderr-warns with a pending count
and returns timeout outcome so the caller can decide its fallback.

Test coverage: 6 unit cases covering empty drain, single + multi-pending
settle, throw-in-IIFE still settles, permanently-pending hits timeout
within bound, empty pageIds does not track.

This commit ships the helper + tracking + tests with NO consumer.
The cli.ts wiring lands in a follow-up commit (atomic bisect units).

Co-Authored-By: Park Je Hoon <jehoon@users.noreply.github.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(pglite): snapshot+early-null disconnect + try/finally lock-leak guard

Refactor PGLiteEngine.disconnect() with two structural fixes:

(1) Snapshot + early-null pattern: capture db/lock refs and null the
    instance fields BEFORE any await. A concurrent connect() can no
    longer observe `_db` pointing at a handle that's mid-close. This
    is PR garrytan#1337's load-bearing contribution that we DID take.

(2) Wrap close + release in try/finally. Without this guard, a thrown
    db.close() would leak the file lock and wedge every next gbrain
    invocation on the stale lock. Codex outside-voice review (eng
    review finding garrytan#7) caught this gap when reviewing the snapshot
    refactor.

KEEP the original close-then-release order. PR garrytan#1337's diff swapped
this to release-then-close, which we explicitly REJECTED — releasing
the lock before close lets a sibling process try to connect to a
still-closing brain. The new lifecycle test file pins this ordering
so a future maintainer reading PR garrytan#1337's diff cannot accidentally
flip it.

Test coverage in test/pglite-engine-disconnect.serial.test.ts: 5
cases — close-before-release ordering, early-null observable inside
close, lock-still-releases on close-throw, double-disconnect
idempotency, reconnect-after-disconnect clean state. `.serial`
because each test creates a fresh PGLite engine (WASM cold-start
cost) — running in parallel shards would starve other tests.

Existing test/pglite-engine.test.ts: 100/100 still green.

Co-Authored-By: Matt Dean <matt-dean-git@users.noreply.github.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(pglite): classify WASM init errors so garrytan#1340 gets the right hint (garrytan#1340)

Closes the user-facing half of garrytan#1340: on macOS 12.7.6 + Bun 1.3.14, the
PGLite connect() catch block hardcoded the macOS 26.3 hint (garrytan#223). The
actual root cause for garrytan#1340 is Bun's vfs: `/$$bunfs/root` is read-only
on older macOS, so PGLite cannot extract its pglite.data WASM payload.

Adds two exported helpers in pglite-engine.ts:

  classifyPgliteInitError(message): 'bunfs' | 'macos-26-3' | 'unknown'
  buildPgliteInitErrorMessage(verdict, original): string

Connect catch block now routes the hint by verdict. The bunfs hint
names `bun upgrade` + Node fallback. The macOS 26.3 hint keeps the
existing garrytan#223 link. Unknown falls through to a generic doctor + garrytan#223
fallback.

Per Codex eng-review finding garrytan#9, the bunfs regex is tightened to match
either the literal `$$bunfs` marker OR ENOENT+pglite.data
co-occurrence — NOT generic `pglite.data` substring (would fire on
unrelated errors). Negative test pinned.

Root fix is upstream Bun; this PR just stops misclassifying the
failure class so support traffic doesn't conflate two unrelated bugs.

Test coverage: 12 pure-function unit cases including the garrytan#1340
reporter's exact error string round-trip, the negative case Codex
caught, and all three verdicts × all three message contents.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(cli): await last-retrieved drain + narrow timeout-only force-exit (garrytan#1247, garrytan#1269, garrytan#1290)

Wires the v0.40.10.0 drain helper into cli.ts and adds the IRON-RULE
behavioral regression test for the search-hang class. The drain is
called unconditionally for every op (not per-op-name gated — that was
the original PR garrytan#1259 mistake that left search and get_page exposed).

The narrow force-exit synthesis (decision D7 from the eng review,
informed by Codex outside-voice findings garrytan#1+garrytan#2+garrytan#8): when the drain
returns outcome:'timeout', AFTER engine.disconnect() resolves AND
the command is NOT a daemon, fire process.exit(0). The drain helper
already stderr-warned with the pending count, so the diagnostic
signal is preserved. Without this guard, a hung underlying promise
could still keep Bun's event loop alive past disconnect.

CRITICALLY narrower than PR garrytan#1337's blanket force-exit: the timeout
path is the only trigger. In the common case (drain settles cleanly
under 5s), no force-exit fires and the behavioral subprocess test
still catches future regressions. The shouldForceExitAfterMain guard
excludes 'serve' so the stdio + HTTP daemons stay alive past main().

e2e/pglite-cli-exit.serial.test.ts (NEW, IRON RULE):
  - gbrain search "foxtrot" → exits 0 within 15s
  - gbrain get alpha → exits 0 within 15s with foxtrot in stdout
  - gbrain query "foxtrot" --no-expand → exits within 15s (no-API-key
    graceful)
  - gbrain serve --http → stays alive 3+ seconds (daemon-survival
    regression guard)

fix-wave-structural.test.ts:
  - import assertion for awaitPendingLastRetrievedWrites
  - last-retrieved.ts exports + Set tracking + Promise.race + timeout
  - BEHAVIORAL positioning assertion: drain `await` appears textually
    BEFORE engine.disconnect `await` in the op-dispatch local-engine
    path. Survives variable-rename refactors; catches any new
    disconnect path that bypasses the drain.
  - shouldForceExitAfterMain excludes 'serve' AND the gate is
    conditioned on drainResult.outcome==='timeout'

Per D8 (Codex finding garrytan#5), explicitly do NOT add a drift-guard
counting bumpLastRetrievedAt callers — would block harmless
refactors and miss aliases.

Co-Authored-By: Park Je Hoon <jehoon@users.noreply.github.com>
Co-Authored-By: Matt Dean <matt-dean-git@users.noreply.github.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(sync): add phase breadcrumbs to performSyncInner for garrytan#1342 triage

The garrytan#1342 reporter saw ZERO stderr output before their PGLite sync
hang, which made the bug impossible to triage from a community report
alone. Mirrors the pre-existing `[gbrain phase] sync.git_pull start/done`
pattern at the major pre-pull phase boundaries so the next garrytan#1342-shaped
report names WHICH phase spun.

Four new breadcrumbs at:
  - sync.resolve_repo (top of performSyncInner)
  - sync.load_active_pack (before the v0.39 T1.5 pack load)
  - sync.validate_repo_state (only when opts.sourceId is set —
    the re-clone branch)
  - sync.detect_head (before the isDetachedHead probe)

No behavior change — pure stderr instrumentation. Doesn't fix garrytan#1342
(which still needs investigation per the TODOS entry filed in this
wave), but converts "hung with no output" into actionable diagnostic
data the next time the bug shape is reported.

Per D9 in the eng review + Codex outside-voice finding garrytan#14.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs: annotate v0.40.10.0 PGLite hang wave in CLAUDE.md + regen llms

Key Files entries updated:

- src/core/pglite-engine.ts: documents the v0.40.10.0 disconnect
  refactor (snapshot+early-null + try/finally lock-leak guard, KEEPS
  close-then-release order), and the new classifyPgliteInitError /
  buildPgliteInitErrorMessage helpers for garrytan#1340 hint routing. Pins
  PR garrytan#1337's accepted-but-narrowed contribution and the rejected
  release-then-close ordering swap.

- src/core/last-retrieved.ts (within the brainstorm entry): documents
  the new awaitPendingLastRetrievedWrites drain, the Set tracking
  pattern, the 5s bounded timeout, the cli.ts narrow timeout-only
  force-exit synthesis with the serve-daemon guard, and the three
  community-validated reports (garrytan#1247/garrytan#1269/garrytan#1290) the fix closes.
  Credits PR garrytan#1259 (drain pattern) and PR garrytan#1337 (snapshot pattern +
  force-exit guard idea).

Regenerated llms.txt + llms-full.txt — build-llms.test.ts gates the
drift, all 7 cases green.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs(todos): file v0.40.10.0 PGLite hang follow-ups

Three deferred items from the v0.40.10.0 fix wave:

1. garrytan#1342 sync-hang investigation. Single-reporter, JS-tight-loop
   shape, needs reproducer before any fix. Documents the ruled-out
   hypotheses (lock-refresh heartbeat, v91 trigger, while-true loops)
   and three concrete diagnostic next steps. The v0.40.10.0 sync
   phase breadcrumbs make the next report actionable.

2. awaitPendingSearchCacheWrites timeout-symmetry retrofit. The garrytan#1090
   drain shipped without a timeout; the v0.40.10.0 garrytan#1247 drain ships
   with one. Apply the same Promise.race + stderr warn pattern for
   symmetry.

3. Drain-helper extraction. Per D4 in the eng review: two surfaces is
   the threshold for noticing, three for extracting. Pair with the
   symmetry retrofit above as one focused refactor when a third
   fire-and-forget surface appears.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* v0.40.10.0 fix(pglite): search/query/get exit cleanly + garrytan#1340 hint + garrytan#1342 breadcrumbs

Closes garrytan#1247, garrytan#1269, garrytan#1290 (PGLite CLI search/query/get hang at ~95-98%
CPU after printing results — three community-validated reports). Also
fixes garrytan#1340 (WASM init misroutes to macOS 26.3 hint when real cause is
Bun vfs read-only mount) and adds diagnostic phase breadcrumbs for the
single-reporter garrytan#1342 sync-hang investigation.

Core fix: track every fire-and-forget bumpLastRetrievedAt IIFE in a
module-scoped Set; cli.ts awaits the drain before engine.disconnect()
in the op-dispatch finally block; narrow process.exit(0) fires ONLY
when the drain times out AND the command isn't a daemon. Snapshot+
early-null disconnect pattern + try/finally lock-leak guard close the
partial-state race PR garrytan#1337 originally surfaced.

Co-Authored-By: Park Je Hoon <jehoon@users.noreply.github.com>
Co-Authored-By: Matt Dean <matt-dean-git@users.noreply.github.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test: extract shouldForceExitAfterMain to its own module + add unit cases

Gap-audit follow-up: cli.ts is a script entrypoint (top-level main()
side effect), so importing it from a test fires the help output as a
side effect. Move shouldForceExitAfterMain into src/core/cli-force-exit.ts
so it can be unit-tested in isolation without the cli.ts script tail
running.

Adds test/cli-should-force-exit.test.ts (9 cases): bare serve, serve
with flags after, global flags BEFORE the command (the load-bearing
case for `gbrain --quiet serve`), op commands return true, non-daemon
CLI commands return true, empty argv defaults to true, flag-only argv,
default-arg fallback to process.argv.slice(2), substring-match
avoidance (`serves` is NOT `serve` — strict equality via Set, not
startsWith/includes).

The daemon command set is now an explicit ReadonlySet — future
daemons (a hypothetical `gbrain watch` or `gbrain daemon`) just add
their name to DAEMON_COMMANDS rather than chaining ||.

Updates fix-wave-structural.test.ts to look for the import + the
new DAEMON_COMMANDS shape.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* chore(version): rebase v0.40.10.0 → v0.41.6.0 (slot collision after v0.41.0.0+ landed)

origin/master moved from v0.40.8.1 → v0.41.0.0 while this wave was in
flight (PR garrytan#1367 minions cathedral). v0.41.1-v0.41.5 are claimed by
other in-flight branches, so v0.41.6.0 is the next available slot.

Bulk-renamed v0.40.10.0 → v0.41.6.0 across:
- VERSION + package.json (trio audit clean: 0.41.6.0 / 0.41.6.0 / 0.41.6.0)
- CHANGELOG.md (header + 3 prose references)
- CLAUDE.md (Key Files annotations)
- TODOS.md (follow-up entry header)
- src/cli.ts + src/core/cli-force-exit.ts + src/core/last-retrieved.ts
  + src/core/pglite-engine.ts + src/commands/sync.ts (inline comments)
- test/* (describe blocks + test file headers)
- llms-full.txt (regenerated via `bun run build:llms`)

bun.lock unchanged (version-only bump, no dep churn) per Codex garrytan#12.

Verify: 52/52 wave tests pass after rename, typecheck clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test: quarantine seed-pglite to .serial.test.ts (parallel WASM cold-start flake)

The full-suite run during the v0.41.6.0 fix wave ship hit a 30s timeout
in test/seed-pglite.test.ts under heavy 4-shard parallel contention
(4972/4973 passed before SIGKILL). The test passes 11/11 in isolation.

Root cause: each test instantiates a fresh PGLiteEngine (5 instances
across the file, one per test) because each case writes to a different
mkdtemp-ed dbPath. Under parallel shard load, multiple shards each
cold-starting PGLite WASM simultaneously stretches the per-instance
init from ~5s to 30s+. The shared-engine pattern (canonical PGLite
block in CLAUDE.md R3+R4) doesn't apply here — different dbPaths
require different engines.

Fix per CLAUDE.md test-isolation quarantine rules: rename to
`.serial.test.ts` so the file runs in the post-parallel serial pass
with full WASM init capacity. Same pattern as
test/pglite-engine-disconnect.serial.test.ts (added in this wave) and
test/brain-registry.serial.test.ts (pre-existing).

Removes test/seed-pglite.test.ts from check-test-isolation.allowlist
since the .serial.test.ts rename auto-exempts it from the R3+R4 lint
(scan skips *.serial.test.ts). 641 non-serial unit files scanned,
lint clean.

Verify:
- bun test test/seed-pglite.serial.test.ts → 11/11 pass in 4.19s
- scripts/check-test-isolation.sh → OK
- bun run verify → all gates pass

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix: pre-landing review fixes (C13 disconnect-hang, C1 set leak, C9 catch drain, M1 type drift)

Adversarial review + maintainability specialist surfaced four real
issues in the v0.41.8.0 wave. All four fixed in this commit; one
deferred to TODOS.md as a v0.41+ follow-up (unusual caller pattern).

**C13 [load-bearing, defense-in-depth for the wave's stated goal]:**
`await engine.disconnect()` inside the op-dispatch finally can ITSELF
hang on PGLite (db.close() racing OS-level FS state). When that
happens, the entire wave's force-exit guard never runs — we recreate
the original hang at a new layer. Fix: install an unref'd setTimeout
hard-exit fallback BEFORE entering the try/catch/finally. The timer
fires after DISCONNECT_HARD_DEADLINE_MS=10s with a stderr warn and
process.exit(0). unref ensures it doesn't keep the loop alive on a
healthy exit. Daemons (`serve`) are excluded by reusing the
shouldForceExitAfterMain guard.

**C9 [data freshness gap, narrow but real]:**
The drain ran ONLY in the success branch of try. If
`bumpLastRetrievedAt` fired (handler succeeded) but
`JSON.parse(JSON.stringify(...))` or `formatResult` then threw,
process.exit(1) killed the process and the in-flight UPDATE was
discarded. Fix: drain in the catch path too before process.exit(1)
(best-effort, bounded by the drain's own 5s timeout).

**C1 [daemon leak]:**
A timed-out IIFE used to stay in the pending-writes Set forever
because its `.finally` never fires. Long-lived `gbrain serve` would
accumulate references without bound across repeated timeouts. Fix:
explicitly `delete` the snapshot's tracked promises from the Set
after a timeout outcome. The IIFEs keep running (orphaned), but the
Set no longer leaks references. Pinned by a new unit test that
asserts the second drain after a timeout returns immediately with
empty pending count.

**M1 [silent type drift]:**
`cli.ts` duplicated the `{outcome, pending}` literal shape instead of
importing the `DrainOutcome` type that `last-retrieved.ts` exports
exactly for this purpose. Two-line fix: add `type DrainOutcome` to
the import and use it for `let drainResult`. Future changes to the
return shape now propagate through TypeScript.

**Deferred to TODOS.md (C6 — unusual caller pattern):**
Concurrent connect/disconnect on the same `PGLiteEngine` instance can
strand: disconnect snapshots+nulls the lock while connect is still
in-flight, leaving the resolved engine with no file lock held. Fix
requires an instance-level mutex; not worth the complexity for a
caller pattern that doesn't appear in production (single instance per
process, sequential lifecycle).

Also broadened `test/fix-wave-structural.test.ts` regex to accept
additional type-imports from `last-retrieved.ts` (e.g. the new
`type DrainOutcome` import that M1 added).

Test coverage: 53/53 wave tests pass (added C1-followup case to
last-retrieved.test.ts). The C1 fix is also pinned by tightening the
existing permanent-pending test's post-timeout assertion to expect
empty pending count rather than the prior (stale) "stays in set" note.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs: post-ship documentation sync for v0.41.8.0

Consolidate the duplicate 'take advantage of v0.41.8.0' sections in
the CHANGELOG entry into a single canonical block per the CLAUDE.md
template. The wave originally landed with both '### How to take
advantage' (line 13) and '### To take advantage' (line 57) as h3
headings. CLAUDE.md mandates one '## To take advantage of v[version]'
h2 block per release entry, with verify steps + an issue-filing
fallback for users hitting upgrade failures.

Promoted the second block to h2, added the issue-filing step, and
removed the redundant first block (the upgrade command is already
covered in the verify steps). Itemized changes section was unchanged.

llms.txt + llms-full.txt regenerated; structurally identical so no
content changes shipped.

* fix(test): find-experts-op queries schema dim instead of hardcoding 1536 (CI shard 1)

CI shard 1 failed on this branch with: \"expected 1280 dimensions, not 1536\"
from pgvector's CheckExpectedDim. Root cause: master's v0.36.0 changed
DEFAULT_EMBEDDING_DIMENSIONS from OpenAI's 1536d to ZeroEntropy's 1280d
(src/core/ai/defaults.ts:21). The test's basisEmbedding helper hardcoded
dim=1536, so beforeAll's upsertChunks failed when the schema column was
created at 1280d.

Latent on master: the weight-aware LPT bin-packing in
scripts/sharding.ts assigns files to shards deterministically based on
the COMPLETE file set. My branch adds 5 new test files, which shifted
find-experts-op.test.ts into shard 1. Master's shard 1 doesn't run this
file (it lands in a different shard there), so the bug never surfaced
in master's CI.

Fix: query the actual column dim via
SELECT atttypmod FROM pg_attribute after initSchema, then seed the
embedding at that width. This handles both paths (no-env CI → 1280;
env-configured local → 1536) without hardcoding either default.

Verify:
- bun test test/find-experts-op.test.ts → 11/11 pass with provider env
- env -i bun test test/find-experts-op.test.ts → 11/11 pass without
- bun run verify → all 21 parallel checks clean
- bun run typecheck → clean

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(lint): robust pure-bash allowlist match in check-test-isolation (CI verify)

CI verify failed on PR garrytan#1405 with check:test-isolation flagging
test/scripts/check-test-isolation.test.ts even though that file is
on line 22 of the allowlist (and has been since v0.26.7 as a permanent
exemption — its body contains process.env mutation fixtures that the
lint legitimately matches).

Could not reproduce locally on macOS bash 3.2 + BSD grep across any
locale (C, C.UTF-8, POSIX). Suspect a subtle interaction between the
prior `echo "$ALLOWLIST" | grep -qxF "$f"` form and one of:
Ubuntu 24.04's bash 5 set-e/pipefail semantics, GNU grep edge case on
the first-line entry, or `bun run` + GNU timeout subshell interaction.
Diagnostic value of chasing further is low — the fix is to drop the
grep+pipe form entirely.

Switch is_allowlisted() to pure-bash `case $'\n'"$ALLOWLIST"$'\n' in
*$'\n'"$f"$'\n'*) return 0 ;; esac` whole-line matching:
- Locale-free (no character-class interaction)
- Pipe-free (no pipefail / SIGPIPE / buffering)
- Subshell-free (no env or exit-code propagation gotchas)
- set-e-quirk-free (no left-side compound failure)
- ~100x faster (no fork+exec per call across 689 files)

Verified locally: lint OK (689 files), case-match returns true for the
allowlisted file and false for a non-allowlisted file. bun run verify
clean (21/21 parallel checks pass).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Park Je Hoon <jehoon@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-authored-by: Matt Dean <matt-dean-git@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant