Skip to content

fix: stop PGLite CLI commands from hanging#1337

Closed
matt-dean-git wants to merge 1 commit into
garrytan:masterfrom
matt-dean-git:fix/pglite-cli-exit-autopilot
Closed

fix: stop PGLite CLI commands from hanging#1337
matt-dean-git wants to merge 1 commit into
garrytan:masterfrom
matt-dean-git:fix/pglite-cli-exit-autopilot

Conversation

@matt-dean-git

Copy link
Copy Markdown
Contributor

Summary

  • release the PGLite advisory/file lock before optional close during disconnect
  • let one-shot CLI commands skip the hanging PGLite close path and force-exit after successful completion
  • preserve daemon/server behavior by not force-exiting serve --http

Verification

  • bun run typecheck
  • fresh PGLite temp brain: gbrain list and gbrain config show both exit in <1s instead of hanging
  • fresh PGLite temp brain: gbrain serve --http --port <free> remains running and serves OAuth metadata
  • bun test test/cycle-pglite-lock-ordering.serial.test.ts test/admin-embed-spawn.serial.test.ts

Notes

  • Full local bun run test was attempted before narrowing; it still has unrelated environment-sensitive serial failures around embedding/provider setup. The regression surfaces touched here are covered by the focused checks above.

garrytan added a commit that referenced this pull request May 25, 2026
…1342 breadcrumbs (#1405)

* fix(pglite): drain fire-and-forget last_retrieved_at writes before disconnect

Closes the structural bug class behind #1247, #1269, #1290: PGLite CLI
search/query/get_page commands printed results then hung at ~95-98% CPU
until SIGKILL. Root cause: bumpLastRetrievedAt's IIFE races
engine.disconnect() — PGLite's WASM runtime keeps Bun's event loop alive
while the dangling UPDATE settles.

Mirrors the existing awaitPendingSearchCacheWrites precedent landed in
v0.36.1.x for #1090. Tracks every IIFE promise in a module-scoped Set,
exposes awaitPendingLastRetrievedWrites(timeoutMs) that resolves once
all settle. Bounded with a 5s default timeout via Promise.race so a
future fire-and-forget that hangs forever can't recreate the bug class
at this layer — instead, the drain stderr-warns with a pending count
and returns timeout outcome so the caller can decide its fallback.

Test coverage: 6 unit cases covering empty drain, single + multi-pending
settle, throw-in-IIFE still settles, permanently-pending hits timeout
within bound, empty pageIds does not track.

This commit ships the helper + tracking + tests with NO consumer.
The cli.ts wiring lands in a follow-up commit (atomic bisect units).

Co-Authored-By: Park Je Hoon <jehoon@users.noreply.github.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(pglite): snapshot+early-null disconnect + try/finally lock-leak guard

Refactor PGLiteEngine.disconnect() with two structural fixes:

(1) Snapshot + early-null pattern: capture db/lock refs and null the
    instance fields BEFORE any await. A concurrent connect() can no
    longer observe `_db` pointing at a handle that's mid-close. This
    is PR #1337's load-bearing contribution that we DID take.

(2) Wrap close + release in try/finally. Without this guard, a thrown
    db.close() would leak the file lock and wedge every next gbrain
    invocation on the stale lock. Codex outside-voice review (eng
    review finding #7) caught this gap when reviewing the snapshot
    refactor.

KEEP the original close-then-release order. PR #1337's diff swapped
this to release-then-close, which we explicitly REJECTED — releasing
the lock before close lets a sibling process try to connect to a
still-closing brain. The new lifecycle test file pins this ordering
so a future maintainer reading PR #1337's diff cannot accidentally
flip it.

Test coverage in test/pglite-engine-disconnect.serial.test.ts: 5
cases — close-before-release ordering, early-null observable inside
close, lock-still-releases on close-throw, double-disconnect
idempotency, reconnect-after-disconnect clean state. `.serial`
because each test creates a fresh PGLite engine (WASM cold-start
cost) — running in parallel shards would starve other tests.

Existing test/pglite-engine.test.ts: 100/100 still green.

Co-Authored-By: Matt Dean <matt-dean-git@users.noreply.github.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(pglite): classify WASM init errors so #1340 gets the right hint (#1340)

Closes the user-facing half of #1340: on macOS 12.7.6 + Bun 1.3.14, the
PGLite connect() catch block hardcoded the macOS 26.3 hint (#223). The
actual root cause for #1340 is Bun's vfs: `/$$bunfs/root` is read-only
on older macOS, so PGLite cannot extract its pglite.data WASM payload.

Adds two exported helpers in pglite-engine.ts:

  classifyPgliteInitError(message): 'bunfs' | 'macos-26-3' | 'unknown'
  buildPgliteInitErrorMessage(verdict, original): string

Connect catch block now routes the hint by verdict. The bunfs hint
names `bun upgrade` + Node fallback. The macOS 26.3 hint keeps the
existing #223 link. Unknown falls through to a generic doctor + #223
fallback.

Per Codex eng-review finding #9, the bunfs regex is tightened to match
either the literal `$$bunfs` marker OR ENOENT+pglite.data
co-occurrence — NOT generic `pglite.data` substring (would fire on
unrelated errors). Negative test pinned.

Root fix is upstream Bun; this PR just stops misclassifying the
failure class so support traffic doesn't conflate two unrelated bugs.

Test coverage: 12 pure-function unit cases including the #1340
reporter's exact error string round-trip, the negative case Codex
caught, and all three verdicts × all three message contents.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(cli): await last-retrieved drain + narrow timeout-only force-exit (#1247, #1269, #1290)

Wires the v0.40.10.0 drain helper into cli.ts and adds the IRON-RULE
behavioral regression test for the search-hang class. The drain is
called unconditionally for every op (not per-op-name gated — that was
the original PR #1259 mistake that left search and get_page exposed).

The narrow force-exit synthesis (decision D7 from the eng review,
informed by Codex outside-voice findings #1+#2+#8): when the drain
returns outcome:'timeout', AFTER engine.disconnect() resolves AND
the command is NOT a daemon, fire process.exit(0). The drain helper
already stderr-warned with the pending count, so the diagnostic
signal is preserved. Without this guard, a hung underlying promise
could still keep Bun's event loop alive past disconnect.

CRITICALLY narrower than PR #1337's blanket force-exit: the timeout
path is the only trigger. In the common case (drain settles cleanly
under 5s), no force-exit fires and the behavioral subprocess test
still catches future regressions. The shouldForceExitAfterMain guard
excludes 'serve' so the stdio + HTTP daemons stay alive past main().

e2e/pglite-cli-exit.serial.test.ts (NEW, IRON RULE):
  - gbrain search "foxtrot" → exits 0 within 15s
  - gbrain get alpha → exits 0 within 15s with foxtrot in stdout
  - gbrain query "foxtrot" --no-expand → exits within 15s (no-API-key
    graceful)
  - gbrain serve --http → stays alive 3+ seconds (daemon-survival
    regression guard)

fix-wave-structural.test.ts:
  - import assertion for awaitPendingLastRetrievedWrites
  - last-retrieved.ts exports + Set tracking + Promise.race + timeout
  - BEHAVIORAL positioning assertion: drain `await` appears textually
    BEFORE engine.disconnect `await` in the op-dispatch local-engine
    path. Survives variable-rename refactors; catches any new
    disconnect path that bypasses the drain.
  - shouldForceExitAfterMain excludes 'serve' AND the gate is
    conditioned on drainResult.outcome==='timeout'

Per D8 (Codex finding #5), explicitly do NOT add a drift-guard
counting bumpLastRetrievedAt callers — would block harmless
refactors and miss aliases.

Co-Authored-By: Park Je Hoon <jehoon@users.noreply.github.com>
Co-Authored-By: Matt Dean <matt-dean-git@users.noreply.github.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(sync): add phase breadcrumbs to performSyncInner for #1342 triage

The #1342 reporter saw ZERO stderr output before their PGLite sync
hang, which made the bug impossible to triage from a community report
alone. Mirrors the pre-existing `[gbrain phase] sync.git_pull start/done`
pattern at the major pre-pull phase boundaries so the next #1342-shaped
report names WHICH phase spun.

Four new breadcrumbs at:
  - sync.resolve_repo (top of performSyncInner)
  - sync.load_active_pack (before the v0.39 T1.5 pack load)
  - sync.validate_repo_state (only when opts.sourceId is set —
    the re-clone branch)
  - sync.detect_head (before the isDetachedHead probe)

No behavior change — pure stderr instrumentation. Doesn't fix #1342
(which still needs investigation per the TODOS entry filed in this
wave), but converts "hung with no output" into actionable diagnostic
data the next time the bug shape is reported.

Per D9 in the eng review + Codex outside-voice finding #14.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs: annotate v0.40.10.0 PGLite hang wave in CLAUDE.md + regen llms

Key Files entries updated:

- src/core/pglite-engine.ts: documents the v0.40.10.0 disconnect
  refactor (snapshot+early-null + try/finally lock-leak guard, KEEPS
  close-then-release order), and the new classifyPgliteInitError /
  buildPgliteInitErrorMessage helpers for #1340 hint routing. Pins
  PR #1337's accepted-but-narrowed contribution and the rejected
  release-then-close ordering swap.

- src/core/last-retrieved.ts (within the brainstorm entry): documents
  the new awaitPendingLastRetrievedWrites drain, the Set tracking
  pattern, the 5s bounded timeout, the cli.ts narrow timeout-only
  force-exit synthesis with the serve-daemon guard, and the three
  community-validated reports (#1247/#1269/#1290) the fix closes.
  Credits PR #1259 (drain pattern) and PR #1337 (snapshot pattern +
  force-exit guard idea).

Regenerated llms.txt + llms-full.txt — build-llms.test.ts gates the
drift, all 7 cases green.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs(todos): file v0.40.10.0 PGLite hang follow-ups

Three deferred items from the v0.40.10.0 fix wave:

1. #1342 sync-hang investigation. Single-reporter, JS-tight-loop
   shape, needs reproducer before any fix. Documents the ruled-out
   hypotheses (lock-refresh heartbeat, v91 trigger, while-true loops)
   and three concrete diagnostic next steps. The v0.40.10.0 sync
   phase breadcrumbs make the next report actionable.

2. awaitPendingSearchCacheWrites timeout-symmetry retrofit. The #1090
   drain shipped without a timeout; the v0.40.10.0 #1247 drain ships
   with one. Apply the same Promise.race + stderr warn pattern for
   symmetry.

3. Drain-helper extraction. Per D4 in the eng review: two surfaces is
   the threshold for noticing, three for extracting. Pair with the
   symmetry retrofit above as one focused refactor when a third
   fire-and-forget surface appears.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* v0.40.10.0 fix(pglite): search/query/get exit cleanly + #1340 hint + #1342 breadcrumbs

Closes #1247, #1269, #1290 (PGLite CLI search/query/get hang at ~95-98%
CPU after printing results — three community-validated reports). Also
fixes #1340 (WASM init misroutes to macOS 26.3 hint when real cause is
Bun vfs read-only mount) and adds diagnostic phase breadcrumbs for the
single-reporter #1342 sync-hang investigation.

Core fix: track every fire-and-forget bumpLastRetrievedAt IIFE in a
module-scoped Set; cli.ts awaits the drain before engine.disconnect()
in the op-dispatch finally block; narrow process.exit(0) fires ONLY
when the drain times out AND the command isn't a daemon. Snapshot+
early-null disconnect pattern + try/finally lock-leak guard close the
partial-state race PR #1337 originally surfaced.

Co-Authored-By: Park Je Hoon <jehoon@users.noreply.github.com>
Co-Authored-By: Matt Dean <matt-dean-git@users.noreply.github.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test: extract shouldForceExitAfterMain to its own module + add unit cases

Gap-audit follow-up: cli.ts is a script entrypoint (top-level main()
side effect), so importing it from a test fires the help output as a
side effect. Move shouldForceExitAfterMain into src/core/cli-force-exit.ts
so it can be unit-tested in isolation without the cli.ts script tail
running.

Adds test/cli-should-force-exit.test.ts (9 cases): bare serve, serve
with flags after, global flags BEFORE the command (the load-bearing
case for `gbrain --quiet serve`), op commands return true, non-daemon
CLI commands return true, empty argv defaults to true, flag-only argv,
default-arg fallback to process.argv.slice(2), substring-match
avoidance (`serves` is NOT `serve` — strict equality via Set, not
startsWith/includes).

The daemon command set is now an explicit ReadonlySet — future
daemons (a hypothetical `gbrain watch` or `gbrain daemon`) just add
their name to DAEMON_COMMANDS rather than chaining ||.

Updates fix-wave-structural.test.ts to look for the import + the
new DAEMON_COMMANDS shape.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* chore(version): rebase v0.40.10.0 → v0.41.6.0 (slot collision after v0.41.0.0+ landed)

origin/master moved from v0.40.8.1 → v0.41.0.0 while this wave was in
flight (PR #1367 minions cathedral). v0.41.1-v0.41.5 are claimed by
other in-flight branches, so v0.41.6.0 is the next available slot.

Bulk-renamed v0.40.10.0 → v0.41.6.0 across:
- VERSION + package.json (trio audit clean: 0.41.6.0 / 0.41.6.0 / 0.41.6.0)
- CHANGELOG.md (header + 3 prose references)
- CLAUDE.md (Key Files annotations)
- TODOS.md (follow-up entry header)
- src/cli.ts + src/core/cli-force-exit.ts + src/core/last-retrieved.ts
  + src/core/pglite-engine.ts + src/commands/sync.ts (inline comments)
- test/* (describe blocks + test file headers)
- llms-full.txt (regenerated via `bun run build:llms`)

bun.lock unchanged (version-only bump, no dep churn) per Codex #12.

Verify: 52/52 wave tests pass after rename, typecheck clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test: quarantine seed-pglite to .serial.test.ts (parallel WASM cold-start flake)

The full-suite run during the v0.41.6.0 fix wave ship hit a 30s timeout
in test/seed-pglite.test.ts under heavy 4-shard parallel contention
(4972/4973 passed before SIGKILL). The test passes 11/11 in isolation.

Root cause: each test instantiates a fresh PGLiteEngine (5 instances
across the file, one per test) because each case writes to a different
mkdtemp-ed dbPath. Under parallel shard load, multiple shards each
cold-starting PGLite WASM simultaneously stretches the per-instance
init from ~5s to 30s+. The shared-engine pattern (canonical PGLite
block in CLAUDE.md R3+R4) doesn't apply here — different dbPaths
require different engines.

Fix per CLAUDE.md test-isolation quarantine rules: rename to
`.serial.test.ts` so the file runs in the post-parallel serial pass
with full WASM init capacity. Same pattern as
test/pglite-engine-disconnect.serial.test.ts (added in this wave) and
test/brain-registry.serial.test.ts (pre-existing).

Removes test/seed-pglite.test.ts from check-test-isolation.allowlist
since the .serial.test.ts rename auto-exempts it from the R3+R4 lint
(scan skips *.serial.test.ts). 641 non-serial unit files scanned,
lint clean.

Verify:
- bun test test/seed-pglite.serial.test.ts → 11/11 pass in 4.19s
- scripts/check-test-isolation.sh → OK
- bun run verify → all gates pass

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix: pre-landing review fixes (C13 disconnect-hang, C1 set leak, C9 catch drain, M1 type drift)

Adversarial review + maintainability specialist surfaced four real
issues in the v0.41.8.0 wave. All four fixed in this commit; one
deferred to TODOS.md as a v0.41+ follow-up (unusual caller pattern).

**C13 [load-bearing, defense-in-depth for the wave's stated goal]:**
`await engine.disconnect()` inside the op-dispatch finally can ITSELF
hang on PGLite (db.close() racing OS-level FS state). When that
happens, the entire wave's force-exit guard never runs — we recreate
the original hang at a new layer. Fix: install an unref'd setTimeout
hard-exit fallback BEFORE entering the try/catch/finally. The timer
fires after DISCONNECT_HARD_DEADLINE_MS=10s with a stderr warn and
process.exit(0). unref ensures it doesn't keep the loop alive on a
healthy exit. Daemons (`serve`) are excluded by reusing the
shouldForceExitAfterMain guard.

**C9 [data freshness gap, narrow but real]:**
The drain ran ONLY in the success branch of try. If
`bumpLastRetrievedAt` fired (handler succeeded) but
`JSON.parse(JSON.stringify(...))` or `formatResult` then threw,
process.exit(1) killed the process and the in-flight UPDATE was
discarded. Fix: drain in the catch path too before process.exit(1)
(best-effort, bounded by the drain's own 5s timeout).

**C1 [daemon leak]:**
A timed-out IIFE used to stay in the pending-writes Set forever
because its `.finally` never fires. Long-lived `gbrain serve` would
accumulate references without bound across repeated timeouts. Fix:
explicitly `delete` the snapshot's tracked promises from the Set
after a timeout outcome. The IIFEs keep running (orphaned), but the
Set no longer leaks references. Pinned by a new unit test that
asserts the second drain after a timeout returns immediately with
empty pending count.

**M1 [silent type drift]:**
`cli.ts` duplicated the `{outcome, pending}` literal shape instead of
importing the `DrainOutcome` type that `last-retrieved.ts` exports
exactly for this purpose. Two-line fix: add `type DrainOutcome` to
the import and use it for `let drainResult`. Future changes to the
return shape now propagate through TypeScript.

**Deferred to TODOS.md (C6 — unusual caller pattern):**
Concurrent connect/disconnect on the same `PGLiteEngine` instance can
strand: disconnect snapshots+nulls the lock while connect is still
in-flight, leaving the resolved engine with no file lock held. Fix
requires an instance-level mutex; not worth the complexity for a
caller pattern that doesn't appear in production (single instance per
process, sequential lifecycle).

Also broadened `test/fix-wave-structural.test.ts` regex to accept
additional type-imports from `last-retrieved.ts` (e.g. the new
`type DrainOutcome` import that M1 added).

Test coverage: 53/53 wave tests pass (added C1-followup case to
last-retrieved.test.ts). The C1 fix is also pinned by tightening the
existing permanent-pending test's post-timeout assertion to expect
empty pending count rather than the prior (stale) "stays in set" note.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs: post-ship documentation sync for v0.41.8.0

Consolidate the duplicate 'take advantage of v0.41.8.0' sections in
the CHANGELOG entry into a single canonical block per the CLAUDE.md
template. The wave originally landed with both '### How to take
advantage' (line 13) and '### To take advantage' (line 57) as h3
headings. CLAUDE.md mandates one '## To take advantage of v[version]'
h2 block per release entry, with verify steps + an issue-filing
fallback for users hitting upgrade failures.

Promoted the second block to h2, added the issue-filing step, and
removed the redundant first block (the upgrade command is already
covered in the verify steps). Itemized changes section was unchanged.

llms.txt + llms-full.txt regenerated; structurally identical so no
content changes shipped.

* fix(test): find-experts-op queries schema dim instead of hardcoding 1536 (CI shard 1)

CI shard 1 failed on this branch with: \"expected 1280 dimensions, not 1536\"
from pgvector's CheckExpectedDim. Root cause: master's v0.36.0 changed
DEFAULT_EMBEDDING_DIMENSIONS from OpenAI's 1536d to ZeroEntropy's 1280d
(src/core/ai/defaults.ts:21). The test's basisEmbedding helper hardcoded
dim=1536, so beforeAll's upsertChunks failed when the schema column was
created at 1280d.

Latent on master: the weight-aware LPT bin-packing in
scripts/sharding.ts assigns files to shards deterministically based on
the COMPLETE file set. My branch adds 5 new test files, which shifted
find-experts-op.test.ts into shard 1. Master's shard 1 doesn't run this
file (it lands in a different shard there), so the bug never surfaced
in master's CI.

Fix: query the actual column dim via
SELECT atttypmod FROM pg_attribute after initSchema, then seed the
embedding at that width. This handles both paths (no-env CI → 1280;
env-configured local → 1536) without hardcoding either default.

Verify:
- bun test test/find-experts-op.test.ts → 11/11 pass with provider env
- env -i bun test test/find-experts-op.test.ts → 11/11 pass without
- bun run verify → all 21 parallel checks clean
- bun run typecheck → clean

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(lint): robust pure-bash allowlist match in check-test-isolation (CI verify)

CI verify failed on PR #1405 with check:test-isolation flagging
test/scripts/check-test-isolation.test.ts even though that file is
on line 22 of the allowlist (and has been since v0.26.7 as a permanent
exemption — its body contains process.env mutation fixtures that the
lint legitimately matches).

Could not reproduce locally on macOS bash 3.2 + BSD grep across any
locale (C, C.UTF-8, POSIX). Suspect a subtle interaction between the
prior `echo "$ALLOWLIST" | grep -qxF "$f"` form and one of:
Ubuntu 24.04's bash 5 set-e/pipefail semantics, GNU grep edge case on
the first-line entry, or `bun run` + GNU timeout subshell interaction.
Diagnostic value of chasing further is low — the fix is to drop the
grep+pipe form entirely.

Switch is_allowlisted() to pure-bash `case $'\n'"$ALLOWLIST"$'\n' in
*$'\n'"$f"$'\n'*) return 0 ;; esac` whole-line matching:
- Locale-free (no character-class interaction)
- Pipe-free (no pipefail / SIGPIPE / buffering)
- Subshell-free (no env or exit-code propagation gotchas)
- set-e-quirk-free (no left-side compound failure)
- ~100x faster (no fork+exec per call across 689 files)

Verified locally: lint OK (689 files), case-match returns true for the
allowlisted file and false for a non-allowlisted file. bun run verify
clean (21/21 parallel checks pass).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Park Je Hoon <jehoon@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-authored-by: Matt Dean <matt-dean-git@users.noreply.github.com>
@garrytan

Copy link
Copy Markdown
Owner

Thanks for this, @matt-dean-git — closing as superseded. Your work shipped in v0.41.8.0. From the CHANGELOG:\n\n> Two community PRs proposed competing fixes. PR #1259 (jehoon, validated by @eloe, @bcallender, and @61tH0b) added the structural "wait for the background write to finish" pattern — the right approach, mirroring an existing fix we sh…\n\nReal appreciation for the contribution.

@garrytan garrytan closed this May 26, 2026
rayers added a commit to rayers/gbrain that referenced this pull request May 26, 2026
…ackend via pool.reserve()

`PostgresEngine.initSchema()` acquired and released advisory lock 42
against bare tag-template syntax on the pool. postgres.js pulls a fresh
physical connection from the pool on each call, so the acquire ran on
backend A and the unlock fired on backend B. Postgres returned `false`
silently from the unlock (lock not held by current session) and the
original session sat idle in the pool indefinitely, still holding lock
42.

Every subsequent `initSchema()` then blocked on `SELECT
pg_advisory_lock(42)` until `statement_timeout` killed the query.
Symptom: CLI cold-start hangs ~5 min on every CLI command that connects
to the engine. Diagnosed against a real 440K-page Postgres brain where
one backend had been holding lock 42 idle for 14h since the autopilot
worker's first `initSchema()` call. Killing the stale backend with
`pg_terminate_backend` dropped cold-start from 5+ min to 1.28s.

The pre-fix comment claimed "Lane A does the routing and keeps
pg_advisory_lock(42) on the SAME connection so the lock is correct" —
that claim was wrong. `connectionManager.ddl()` returns the pool, not
a reserved single connection.

Fix: wrap the lock acquire + release pair in `pool.reserve()` so both
queries pin to the same physical connection. Same pattern `executeRaw`
uses elsewhere in this file (line 776). The reserved connection is also
threaded through `applyForwardReferenceBootstrap` and the `SCHEMA_SQL`
replay so the bootstrap probes continue to run inside the lock scope
(preserves the Codex P1 fix from the v0.36 dreamy-thompson wave).
`runMigrations`, `verifySchema`, and `dropZombieIndexes` still take
`this` and may acquire their own pool connections — those are
independent operations and their internal connection management is
unchanged.

Relation to PR garrytan#1337: matt-dean-git filed the same bug class for PGLite
(file-based lock not releasing cleanly). This PR is the Postgres half.

Test: regression case at test/e2e/postgres-initschema-advisory-lock.test.ts
exercises two assertions that would fail pre-fix:

  1. After a single `initSchema()`, lock 42 is not held by any backend.
     Pre-fix the acquire-on-A / unlock-on-B split left lock 42 granted
     on session A and the assertion returned a non-empty pg_locks row.
  2. Two consecutive `initSchema()` calls each complete in <30s. Pre-fix
     the second call blocked on `SELECT pg_advisory_lock(42)` until the
     5-min statement_timeout fired.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
ChenyqThu pushed a commit to ChenyqThu/jarvis-knowledge-os-v2 that referenced this pull request May 26, 2026
Both upstream-PR-ready patches got new status check entries reflecting
the v0.41.14.0 sync state. No content/decision changes — just keeping
the docs honest about how the fork patches have evolved through the
intervening 20 minor releases.

- v018-pglite-wal-durability-fix.md
  - Adds 2026-05-26 status check entry.
  - Notes that upstream v0.41.8.0 refactored disconnect() into snapshot-
    and-early-null + try/finally (PR garrytan#1337 partial-state race fix), and
    the fork's pg_switch_wal() patch is now FOLDED INTO upstream's new
    structure (lives inside the new `try { if (db) { ... } }` block).
    Refreshed line-number pointers (pglite-engine.ts:283,
    brain-db.ts:164).
  - Upstream-PR opportunity is BETTER now — patch surface is smaller
    after the fold; upstream just needs the WAL flush before db.close().

- v034-google-recipe-max-batch-tokens.md
  - Adds "Last status check: 2026-05-26" line.
  - Un-stales the kos-compat-api /ingest example wording (that surface
    retired in §6.28); replaced with the active surfaces (cron gbrain
    invocations, gbrain serve --http MCP put_page response output,
    interactive query). Motivation unchanged — Google embedding still
    missing the field 20 minor versions later.

Both docs are now ready for a fork-operator manual PR file (`gh pr
create` from a personal upstream fork, cherry-picking the relevant
1-2 commit each, body = the markdown doc).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
rayers added a commit to rayers/gbrain that referenced this pull request May 29, 2026
…ackend via pool.reserve()

`PostgresEngine.initSchema()` acquired and released advisory lock 42
against bare tag-template syntax on the pool. postgres.js pulls a fresh
physical connection from the pool on each call, so the acquire ran on
backend A and the unlock fired on backend B. Postgres returned `false`
silently from the unlock (lock not held by current session) and the
original session sat idle in the pool indefinitely, still holding lock
42.

Every subsequent `initSchema()` then blocked on `SELECT
pg_advisory_lock(42)` until `statement_timeout` killed the query.
Symptom: CLI cold-start hangs ~5 min on every CLI command that connects
to the engine. Diagnosed against a real 440K-page Postgres brain where
one backend had been holding lock 42 idle for 14h since the autopilot
worker's first `initSchema()` call. Killing the stale backend with
`pg_terminate_backend` dropped cold-start from 5+ min to 1.28s.

The pre-fix comment claimed "Lane A does the routing and keeps
pg_advisory_lock(42) on the SAME connection so the lock is correct" —
that claim was wrong. `connectionManager.ddl()` returns the pool, not
a reserved single connection.

Fix: wrap the lock acquire + release pair in `pool.reserve()` so both
queries pin to the same physical connection. Same pattern `executeRaw`
uses elsewhere in this file (line 776). The reserved connection is also
threaded through `applyForwardReferenceBootstrap` and the `SCHEMA_SQL`
replay so the bootstrap probes continue to run inside the lock scope
(preserves the Codex P1 fix from the v0.36 dreamy-thompson wave).
`runMigrations`, `verifySchema`, and `dropZombieIndexes` still take
`this` and may acquire their own pool connections — those are
independent operations and their internal connection management is
unchanged.

Relation to PR garrytan#1337: matt-dean-git filed the same bug class for PGLite
(file-based lock not releasing cleanly). This PR is the Postgres half.

Test: regression case at test/e2e/postgres-initschema-advisory-lock.test.ts
exercises two assertions that would fail pre-fix:

  1. After a single `initSchema()`, lock 42 is not held by any backend.
     Pre-fix the acquire-on-A / unlock-on-B split left lock 42 granted
     on session A and the assertion returned a non-empty pg_locks row.
  2. Two consecutive `initSchema()` calls each complete in <30s. Pre-fix
     the second call blocked on `SELECT pg_advisory_lock(42)` until the
     5-min statement_timeout fired.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
garrytan-agents pushed a commit to garrytan-agents/gbrain that referenced this pull request Jun 13, 2026
…hint + garrytan#1342 breadcrumbs (garrytan#1405)

* fix(pglite): drain fire-and-forget last_retrieved_at writes before disconnect

Closes the structural bug class behind garrytan#1247, garrytan#1269, garrytan#1290: PGLite CLI
search/query/get_page commands printed results then hung at ~95-98% CPU
until SIGKILL. Root cause: bumpLastRetrievedAt's IIFE races
engine.disconnect() — PGLite's WASM runtime keeps Bun's event loop alive
while the dangling UPDATE settles.

Mirrors the existing awaitPendingSearchCacheWrites precedent landed in
v0.36.1.x for garrytan#1090. Tracks every IIFE promise in a module-scoped Set,
exposes awaitPendingLastRetrievedWrites(timeoutMs) that resolves once
all settle. Bounded with a 5s default timeout via Promise.race so a
future fire-and-forget that hangs forever can't recreate the bug class
at this layer — instead, the drain stderr-warns with a pending count
and returns timeout outcome so the caller can decide its fallback.

Test coverage: 6 unit cases covering empty drain, single + multi-pending
settle, throw-in-IIFE still settles, permanently-pending hits timeout
within bound, empty pageIds does not track.

This commit ships the helper + tracking + tests with NO consumer.
The cli.ts wiring lands in a follow-up commit (atomic bisect units).

Co-Authored-By: Park Je Hoon <jehoon@users.noreply.github.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(pglite): snapshot+early-null disconnect + try/finally lock-leak guard

Refactor PGLiteEngine.disconnect() with two structural fixes:

(1) Snapshot + early-null pattern: capture db/lock refs and null the
    instance fields BEFORE any await. A concurrent connect() can no
    longer observe `_db` pointing at a handle that's mid-close. This
    is PR garrytan#1337's load-bearing contribution that we DID take.

(2) Wrap close + release in try/finally. Without this guard, a thrown
    db.close() would leak the file lock and wedge every next gbrain
    invocation on the stale lock. Codex outside-voice review (eng
    review finding garrytan#7) caught this gap when reviewing the snapshot
    refactor.

KEEP the original close-then-release order. PR garrytan#1337's diff swapped
this to release-then-close, which we explicitly REJECTED — releasing
the lock before close lets a sibling process try to connect to a
still-closing brain. The new lifecycle test file pins this ordering
so a future maintainer reading PR garrytan#1337's diff cannot accidentally
flip it.

Test coverage in test/pglite-engine-disconnect.serial.test.ts: 5
cases — close-before-release ordering, early-null observable inside
close, lock-still-releases on close-throw, double-disconnect
idempotency, reconnect-after-disconnect clean state. `.serial`
because each test creates a fresh PGLite engine (WASM cold-start
cost) — running in parallel shards would starve other tests.

Existing test/pglite-engine.test.ts: 100/100 still green.

Co-Authored-By: Matt Dean <matt-dean-git@users.noreply.github.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(pglite): classify WASM init errors so garrytan#1340 gets the right hint (garrytan#1340)

Closes the user-facing half of garrytan#1340: on macOS 12.7.6 + Bun 1.3.14, the
PGLite connect() catch block hardcoded the macOS 26.3 hint (garrytan#223). The
actual root cause for garrytan#1340 is Bun's vfs: `/$$bunfs/root` is read-only
on older macOS, so PGLite cannot extract its pglite.data WASM payload.

Adds two exported helpers in pglite-engine.ts:

  classifyPgliteInitError(message): 'bunfs' | 'macos-26-3' | 'unknown'
  buildPgliteInitErrorMessage(verdict, original): string

Connect catch block now routes the hint by verdict. The bunfs hint
names `bun upgrade` + Node fallback. The macOS 26.3 hint keeps the
existing garrytan#223 link. Unknown falls through to a generic doctor + garrytan#223
fallback.

Per Codex eng-review finding garrytan#9, the bunfs regex is tightened to match
either the literal `$$bunfs` marker OR ENOENT+pglite.data
co-occurrence — NOT generic `pglite.data` substring (would fire on
unrelated errors). Negative test pinned.

Root fix is upstream Bun; this PR just stops misclassifying the
failure class so support traffic doesn't conflate two unrelated bugs.

Test coverage: 12 pure-function unit cases including the garrytan#1340
reporter's exact error string round-trip, the negative case Codex
caught, and all three verdicts × all three message contents.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(cli): await last-retrieved drain + narrow timeout-only force-exit (garrytan#1247, garrytan#1269, garrytan#1290)

Wires the v0.40.10.0 drain helper into cli.ts and adds the IRON-RULE
behavioral regression test for the search-hang class. The drain is
called unconditionally for every op (not per-op-name gated — that was
the original PR garrytan#1259 mistake that left search and get_page exposed).

The narrow force-exit synthesis (decision D7 from the eng review,
informed by Codex outside-voice findings garrytan#1+garrytan#2+garrytan#8): when the drain
returns outcome:'timeout', AFTER engine.disconnect() resolves AND
the command is NOT a daemon, fire process.exit(0). The drain helper
already stderr-warned with the pending count, so the diagnostic
signal is preserved. Without this guard, a hung underlying promise
could still keep Bun's event loop alive past disconnect.

CRITICALLY narrower than PR garrytan#1337's blanket force-exit: the timeout
path is the only trigger. In the common case (drain settles cleanly
under 5s), no force-exit fires and the behavioral subprocess test
still catches future regressions. The shouldForceExitAfterMain guard
excludes 'serve' so the stdio + HTTP daemons stay alive past main().

e2e/pglite-cli-exit.serial.test.ts (NEW, IRON RULE):
  - gbrain search "foxtrot" → exits 0 within 15s
  - gbrain get alpha → exits 0 within 15s with foxtrot in stdout
  - gbrain query "foxtrot" --no-expand → exits within 15s (no-API-key
    graceful)
  - gbrain serve --http → stays alive 3+ seconds (daemon-survival
    regression guard)

fix-wave-structural.test.ts:
  - import assertion for awaitPendingLastRetrievedWrites
  - last-retrieved.ts exports + Set tracking + Promise.race + timeout
  - BEHAVIORAL positioning assertion: drain `await` appears textually
    BEFORE engine.disconnect `await` in the op-dispatch local-engine
    path. Survives variable-rename refactors; catches any new
    disconnect path that bypasses the drain.
  - shouldForceExitAfterMain excludes 'serve' AND the gate is
    conditioned on drainResult.outcome==='timeout'

Per D8 (Codex finding garrytan#5), explicitly do NOT add a drift-guard
counting bumpLastRetrievedAt callers — would block harmless
refactors and miss aliases.

Co-Authored-By: Park Je Hoon <jehoon@users.noreply.github.com>
Co-Authored-By: Matt Dean <matt-dean-git@users.noreply.github.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(sync): add phase breadcrumbs to performSyncInner for garrytan#1342 triage

The garrytan#1342 reporter saw ZERO stderr output before their PGLite sync
hang, which made the bug impossible to triage from a community report
alone. Mirrors the pre-existing `[gbrain phase] sync.git_pull start/done`
pattern at the major pre-pull phase boundaries so the next garrytan#1342-shaped
report names WHICH phase spun.

Four new breadcrumbs at:
  - sync.resolve_repo (top of performSyncInner)
  - sync.load_active_pack (before the v0.39 T1.5 pack load)
  - sync.validate_repo_state (only when opts.sourceId is set —
    the re-clone branch)
  - sync.detect_head (before the isDetachedHead probe)

No behavior change — pure stderr instrumentation. Doesn't fix garrytan#1342
(which still needs investigation per the TODOS entry filed in this
wave), but converts "hung with no output" into actionable diagnostic
data the next time the bug shape is reported.

Per D9 in the eng review + Codex outside-voice finding garrytan#14.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs: annotate v0.40.10.0 PGLite hang wave in CLAUDE.md + regen llms

Key Files entries updated:

- src/core/pglite-engine.ts: documents the v0.40.10.0 disconnect
  refactor (snapshot+early-null + try/finally lock-leak guard, KEEPS
  close-then-release order), and the new classifyPgliteInitError /
  buildPgliteInitErrorMessage helpers for garrytan#1340 hint routing. Pins
  PR garrytan#1337's accepted-but-narrowed contribution and the rejected
  release-then-close ordering swap.

- src/core/last-retrieved.ts (within the brainstorm entry): documents
  the new awaitPendingLastRetrievedWrites drain, the Set tracking
  pattern, the 5s bounded timeout, the cli.ts narrow timeout-only
  force-exit synthesis with the serve-daemon guard, and the three
  community-validated reports (garrytan#1247/garrytan#1269/garrytan#1290) the fix closes.
  Credits PR garrytan#1259 (drain pattern) and PR garrytan#1337 (snapshot pattern +
  force-exit guard idea).

Regenerated llms.txt + llms-full.txt — build-llms.test.ts gates the
drift, all 7 cases green.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs(todos): file v0.40.10.0 PGLite hang follow-ups

Three deferred items from the v0.40.10.0 fix wave:

1. garrytan#1342 sync-hang investigation. Single-reporter, JS-tight-loop
   shape, needs reproducer before any fix. Documents the ruled-out
   hypotheses (lock-refresh heartbeat, v91 trigger, while-true loops)
   and three concrete diagnostic next steps. The v0.40.10.0 sync
   phase breadcrumbs make the next report actionable.

2. awaitPendingSearchCacheWrites timeout-symmetry retrofit. The garrytan#1090
   drain shipped without a timeout; the v0.40.10.0 garrytan#1247 drain ships
   with one. Apply the same Promise.race + stderr warn pattern for
   symmetry.

3. Drain-helper extraction. Per D4 in the eng review: two surfaces is
   the threshold for noticing, three for extracting. Pair with the
   symmetry retrofit above as one focused refactor when a third
   fire-and-forget surface appears.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* v0.40.10.0 fix(pglite): search/query/get exit cleanly + garrytan#1340 hint + garrytan#1342 breadcrumbs

Closes garrytan#1247, garrytan#1269, garrytan#1290 (PGLite CLI search/query/get hang at ~95-98%
CPU after printing results — three community-validated reports). Also
fixes garrytan#1340 (WASM init misroutes to macOS 26.3 hint when real cause is
Bun vfs read-only mount) and adds diagnostic phase breadcrumbs for the
single-reporter garrytan#1342 sync-hang investigation.

Core fix: track every fire-and-forget bumpLastRetrievedAt IIFE in a
module-scoped Set; cli.ts awaits the drain before engine.disconnect()
in the op-dispatch finally block; narrow process.exit(0) fires ONLY
when the drain times out AND the command isn't a daemon. Snapshot+
early-null disconnect pattern + try/finally lock-leak guard close the
partial-state race PR garrytan#1337 originally surfaced.

Co-Authored-By: Park Je Hoon <jehoon@users.noreply.github.com>
Co-Authored-By: Matt Dean <matt-dean-git@users.noreply.github.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test: extract shouldForceExitAfterMain to its own module + add unit cases

Gap-audit follow-up: cli.ts is a script entrypoint (top-level main()
side effect), so importing it from a test fires the help output as a
side effect. Move shouldForceExitAfterMain into src/core/cli-force-exit.ts
so it can be unit-tested in isolation without the cli.ts script tail
running.

Adds test/cli-should-force-exit.test.ts (9 cases): bare serve, serve
with flags after, global flags BEFORE the command (the load-bearing
case for `gbrain --quiet serve`), op commands return true, non-daemon
CLI commands return true, empty argv defaults to true, flag-only argv,
default-arg fallback to process.argv.slice(2), substring-match
avoidance (`serves` is NOT `serve` — strict equality via Set, not
startsWith/includes).

The daemon command set is now an explicit ReadonlySet — future
daemons (a hypothetical `gbrain watch` or `gbrain daemon`) just add
their name to DAEMON_COMMANDS rather than chaining ||.

Updates fix-wave-structural.test.ts to look for the import + the
new DAEMON_COMMANDS shape.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* chore(version): rebase v0.40.10.0 → v0.41.6.0 (slot collision after v0.41.0.0+ landed)

origin/master moved from v0.40.8.1 → v0.41.0.0 while this wave was in
flight (PR garrytan#1367 minions cathedral). v0.41.1-v0.41.5 are claimed by
other in-flight branches, so v0.41.6.0 is the next available slot.

Bulk-renamed v0.40.10.0 → v0.41.6.0 across:
- VERSION + package.json (trio audit clean: 0.41.6.0 / 0.41.6.0 / 0.41.6.0)
- CHANGELOG.md (header + 3 prose references)
- CLAUDE.md (Key Files annotations)
- TODOS.md (follow-up entry header)
- src/cli.ts + src/core/cli-force-exit.ts + src/core/last-retrieved.ts
  + src/core/pglite-engine.ts + src/commands/sync.ts (inline comments)
- test/* (describe blocks + test file headers)
- llms-full.txt (regenerated via `bun run build:llms`)

bun.lock unchanged (version-only bump, no dep churn) per Codex garrytan#12.

Verify: 52/52 wave tests pass after rename, typecheck clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test: quarantine seed-pglite to .serial.test.ts (parallel WASM cold-start flake)

The full-suite run during the v0.41.6.0 fix wave ship hit a 30s timeout
in test/seed-pglite.test.ts under heavy 4-shard parallel contention
(4972/4973 passed before SIGKILL). The test passes 11/11 in isolation.

Root cause: each test instantiates a fresh PGLiteEngine (5 instances
across the file, one per test) because each case writes to a different
mkdtemp-ed dbPath. Under parallel shard load, multiple shards each
cold-starting PGLite WASM simultaneously stretches the per-instance
init from ~5s to 30s+. The shared-engine pattern (canonical PGLite
block in CLAUDE.md R3+R4) doesn't apply here — different dbPaths
require different engines.

Fix per CLAUDE.md test-isolation quarantine rules: rename to
`.serial.test.ts` so the file runs in the post-parallel serial pass
with full WASM init capacity. Same pattern as
test/pglite-engine-disconnect.serial.test.ts (added in this wave) and
test/brain-registry.serial.test.ts (pre-existing).

Removes test/seed-pglite.test.ts from check-test-isolation.allowlist
since the .serial.test.ts rename auto-exempts it from the R3+R4 lint
(scan skips *.serial.test.ts). 641 non-serial unit files scanned,
lint clean.

Verify:
- bun test test/seed-pglite.serial.test.ts → 11/11 pass in 4.19s
- scripts/check-test-isolation.sh → OK
- bun run verify → all gates pass

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix: pre-landing review fixes (C13 disconnect-hang, C1 set leak, C9 catch drain, M1 type drift)

Adversarial review + maintainability specialist surfaced four real
issues in the v0.41.8.0 wave. All four fixed in this commit; one
deferred to TODOS.md as a v0.41+ follow-up (unusual caller pattern).

**C13 [load-bearing, defense-in-depth for the wave's stated goal]:**
`await engine.disconnect()` inside the op-dispatch finally can ITSELF
hang on PGLite (db.close() racing OS-level FS state). When that
happens, the entire wave's force-exit guard never runs — we recreate
the original hang at a new layer. Fix: install an unref'd setTimeout
hard-exit fallback BEFORE entering the try/catch/finally. The timer
fires after DISCONNECT_HARD_DEADLINE_MS=10s with a stderr warn and
process.exit(0). unref ensures it doesn't keep the loop alive on a
healthy exit. Daemons (`serve`) are excluded by reusing the
shouldForceExitAfterMain guard.

**C9 [data freshness gap, narrow but real]:**
The drain ran ONLY in the success branch of try. If
`bumpLastRetrievedAt` fired (handler succeeded) but
`JSON.parse(JSON.stringify(...))` or `formatResult` then threw,
process.exit(1) killed the process and the in-flight UPDATE was
discarded. Fix: drain in the catch path too before process.exit(1)
(best-effort, bounded by the drain's own 5s timeout).

**C1 [daemon leak]:**
A timed-out IIFE used to stay in the pending-writes Set forever
because its `.finally` never fires. Long-lived `gbrain serve` would
accumulate references without bound across repeated timeouts. Fix:
explicitly `delete` the snapshot's tracked promises from the Set
after a timeout outcome. The IIFEs keep running (orphaned), but the
Set no longer leaks references. Pinned by a new unit test that
asserts the second drain after a timeout returns immediately with
empty pending count.

**M1 [silent type drift]:**
`cli.ts` duplicated the `{outcome, pending}` literal shape instead of
importing the `DrainOutcome` type that `last-retrieved.ts` exports
exactly for this purpose. Two-line fix: add `type DrainOutcome` to
the import and use it for `let drainResult`. Future changes to the
return shape now propagate through TypeScript.

**Deferred to TODOS.md (C6 — unusual caller pattern):**
Concurrent connect/disconnect on the same `PGLiteEngine` instance can
strand: disconnect snapshots+nulls the lock while connect is still
in-flight, leaving the resolved engine with no file lock held. Fix
requires an instance-level mutex; not worth the complexity for a
caller pattern that doesn't appear in production (single instance per
process, sequential lifecycle).

Also broadened `test/fix-wave-structural.test.ts` regex to accept
additional type-imports from `last-retrieved.ts` (e.g. the new
`type DrainOutcome` import that M1 added).

Test coverage: 53/53 wave tests pass (added C1-followup case to
last-retrieved.test.ts). The C1 fix is also pinned by tightening the
existing permanent-pending test's post-timeout assertion to expect
empty pending count rather than the prior (stale) "stays in set" note.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs: post-ship documentation sync for v0.41.8.0

Consolidate the duplicate 'take advantage of v0.41.8.0' sections in
the CHANGELOG entry into a single canonical block per the CLAUDE.md
template. The wave originally landed with both '### How to take
advantage' (line 13) and '### To take advantage' (line 57) as h3
headings. CLAUDE.md mandates one '## To take advantage of v[version]'
h2 block per release entry, with verify steps + an issue-filing
fallback for users hitting upgrade failures.

Promoted the second block to h2, added the issue-filing step, and
removed the redundant first block (the upgrade command is already
covered in the verify steps). Itemized changes section was unchanged.

llms.txt + llms-full.txt regenerated; structurally identical so no
content changes shipped.

* fix(test): find-experts-op queries schema dim instead of hardcoding 1536 (CI shard 1)

CI shard 1 failed on this branch with: \"expected 1280 dimensions, not 1536\"
from pgvector's CheckExpectedDim. Root cause: master's v0.36.0 changed
DEFAULT_EMBEDDING_DIMENSIONS from OpenAI's 1536d to ZeroEntropy's 1280d
(src/core/ai/defaults.ts:21). The test's basisEmbedding helper hardcoded
dim=1536, so beforeAll's upsertChunks failed when the schema column was
created at 1280d.

Latent on master: the weight-aware LPT bin-packing in
scripts/sharding.ts assigns files to shards deterministically based on
the COMPLETE file set. My branch adds 5 new test files, which shifted
find-experts-op.test.ts into shard 1. Master's shard 1 doesn't run this
file (it lands in a different shard there), so the bug never surfaced
in master's CI.

Fix: query the actual column dim via
SELECT atttypmod FROM pg_attribute after initSchema, then seed the
embedding at that width. This handles both paths (no-env CI → 1280;
env-configured local → 1536) without hardcoding either default.

Verify:
- bun test test/find-experts-op.test.ts → 11/11 pass with provider env
- env -i bun test test/find-experts-op.test.ts → 11/11 pass without
- bun run verify → all 21 parallel checks clean
- bun run typecheck → clean

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(lint): robust pure-bash allowlist match in check-test-isolation (CI verify)

CI verify failed on PR garrytan#1405 with check:test-isolation flagging
test/scripts/check-test-isolation.test.ts even though that file is
on line 22 of the allowlist (and has been since v0.26.7 as a permanent
exemption — its body contains process.env mutation fixtures that the
lint legitimately matches).

Could not reproduce locally on macOS bash 3.2 + BSD grep across any
locale (C, C.UTF-8, POSIX). Suspect a subtle interaction between the
prior `echo "$ALLOWLIST" | grep -qxF "$f"` form and one of:
Ubuntu 24.04's bash 5 set-e/pipefail semantics, GNU grep edge case on
the first-line entry, or `bun run` + GNU timeout subshell interaction.
Diagnostic value of chasing further is low — the fix is to drop the
grep+pipe form entirely.

Switch is_allowlisted() to pure-bash `case $'\n'"$ALLOWLIST"$'\n' in
*$'\n'"$f"$'\n'*) return 0 ;; esac` whole-line matching:
- Locale-free (no character-class interaction)
- Pipe-free (no pipefail / SIGPIPE / buffering)
- Subshell-free (no env or exit-code propagation gotchas)
- set-e-quirk-free (no left-side compound failure)
- ~100x faster (no fork+exec per call across 689 files)

Verified locally: lint OK (689 files), case-match returns true for the
allowlisted file and false for a non-allowlisted file. bun run verify
clean (21/21 parallel checks pass).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Park Je Hoon <jehoon@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-authored-by: Matt Dean <matt-dean-git@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants