Skip to content

v0.42.20.0 fix: reliability wave — PGLite capture lock-pin + Postgres reconnect race + search embed-hang (#1762 #1745 #1775)#1810

Merged
garrytan merged 10 commits into
masterfrom
garrytan/pglite-lock-fix-wave
Jun 3, 2026
Merged

v0.42.20.0 fix: reliability wave — PGLite capture lock-pin + Postgres reconnect race + search embed-hang (#1762 #1745 #1775)#1810
garrytan merged 10 commits into
masterfrom
garrytan/pglite-lock-fix-wave

Conversation

@garrytan

@garrytan garrytan commented Jun 3, 2026

Copy link
Copy Markdown
Owner

v0.42.20.0 — reliability fix wave

Closes a cluster of teardown/reliability bugs around how GBrain shuts down a command and talks to AI providers.

Bugs fixed

Unifying changes (make this bug class structural)

  • src/core/background-work.ts (NEW): a registry of the four fire-and-forget DB-write sinks (last-retrieved, facts, search-cache, eval-capture). drainAllBackgroundWorkForCliExit is called on every CLI exit path before engine.disconnect(). A future 5th sink auto-participates.
  • Gateway-wide AI-HTTP timeout: chat / expand / OCR / embed (per sub-batch) / multimodal now carry a default wall-clock deadline (GBRAIN_AI_{CHAT,EMBED,MULTIMODAL}_TIMEOUT_MS), composed with caller signals via AbortSignal.any. Covers the native-Anthropic path and bounds SDK retries.

Commits (bisectable)

  1. fix(core) background-work registry + drain-before-disconnect (CLI_ONLY commands (e.g. gbrain capture) hang after completing on multi-chunk pages, pinning the PGLite single-writer lock #1762)
  2. fix(ai) gateway-wide AI-HTTP timeout (CLI_ONLY commands (e.g. gbrain capture) hang after completing on multi-chunk pages, pinning the PGLite single-writer lock #1762/search/query render no output + engine.disconnect() 10s force-exit on 0.42.8.0 (Postgres brain); 0.22.8 works #1775)
  3. fix(postgres) module-mode reconnect preserves the shared singleton ("connect() has not been called" still reproduces on 0.41.28 — concurrent minion-worker path not covered by #1570 fix #1745)
  4. fix(search) bounded query-embed deadline + keyword fallback (search/query render no output + engine.disconnect() 10s force-exit on 0.42.8.0 (Postgres brain); 0.22.8 works #1775)
  5. test+chore tests + v0.42.20.0

Disconnect-site audit

The only CLI-exit disconnect sites that enqueue local background work are the two finally blocks (op-dispatch + handleCliOnly), both now draining. search diagnose embeds but enqueues nothing and is 60s-bounded; serve is excluded; thin-client is remote.

Testing

  • Full unit suite green (9,500 pass / 0 fail). bun run verify 29/29.
  • New: test/core/background-work.test.ts, test/search/query-embed-deadline.test.ts, test/eval-capture-drain.test.ts, test/e2e/postgres-reconnect-singleton.test.ts (real Postgres, "connect() has not been called" still reproduces on 0.41.28 — concurrent minion-worker path not covered by #1570 fix #1745 — verified green against pgvector:pg16), a gbrain capture exit-clean case in test/e2e/pglite-cli-exit.serial.test.ts, updated test/fix-wave-structural.test.ts.
  • Pre-landing adversarial review (codex) on the implemented diff; 3 findings fixed (op-dispatch force-exit honors exit code; module reconnect fail-loud; query-embed budget floored so a healthy embed isn't starved by slow expansion).

Reviews

Plan went through /plan-eng-review (twice) + 3 codex passes; the final code through a codex adversarial pre-landing review.

Deferred (follow-ups, in TODOS.md)

  • Convert runSync's internal process.exit sites to exitCode + return (graceful drain on sync error exits; today they avoid the hang by skipping disconnect).
  • Decouple the op-dispatch force-exit timer to wrap disconnect() only + fix its message.
  • Gateway idle-timeout (vs absolute) for streaming chat.

Incorporates and hardens #1763 (by @ElliotDrel, who diagnosed the corrected root cause); the residual hung-Haiku hole is closed by the facts shutdown() abort belt. Will close #1763 as superseded.

🤖 Generated with Claude Code

garrytan and others added 7 commits June 3, 2026 01:29
…ound-work registry (#1762)

New src/core/background-work.ts registry (Map<name,drainer>, ordered drain,
awaited abort). facts-queue (order 0, abort=shutdown), last-retrieved (1), and
eval-capture (3, now self-tracked) register as sinks. Both CLI exit paths
(op-dispatch finally + handleCliOnly finally) drain the registry before
engine.disconnect() so a PGLite db.close() can't race in-flight work into the
re-pump busy-loop that pinned the single-writer lock. Op-dispatch error path
converts process.exit(1) to exitCode+return so the finally still drains.
#1762/#1775)

withDefaultTimeout composes a per-touchpoint default deadline (chat 300s,
embed/multimodal 60s) with any caller signal via AbortSignal.any. Applied at the
SDK call layer (chat generateText, expand generateObject, OCR, per-sub-batch
embed) — covers native-anthropic + retries — plus per-request multimodal fetch.
embedQuery forwards abortSignal. Env: GBRAIN_AI_{CHAT,EMBED,MULTIMODAL}_TIMEOUT_MS.
…1745)

reconnect() branches on connection style. Module-singleton engines re-establish
idempotently via db.connect() (no-op when alive) + refresh the ConnectionManager
read pool, never db.disconnect() — so a transient blip no longer nulls the shared
sql out from under concurrent ops (which threw 'connect() has not been called').
Fail-loud on real connect failure. Instance pools keep teardown+recreate.
…ord (#1775)

search/query default to cheap-hybrid (embeds the query); a stalled provider made
the embed never settle, so the keyword fallback never engaged and the command
force-exited with no output. One shared QueryEmbedDeadline (6s, floored 2s per
embed) covers both the cache-lookup and inner embeds via embedQueryBounded
(abortSignal + Promise.race) → existing keyword fallback engages. Also registers
the search-cache background-work drainer (now bounded). Env: GBRAIN_QUERY_EMBED_TIMEOUT_MS.
New: background-work registry unit, query-embed deadline unit, eval-capture
drain unit, postgres reconnect E2E (#1745), gbrain capture exit-clean case in
the PGLite serial test. Updated fix-wave-structural assertions to the registry
shape. VERSION/package.json/CHANGELOG -> 0.42.11.0; TODOS retrofit marked done.

Incorporates + hardens PR #1763 (drain-before-disconnect + embed fetch timeout);
the residual hung-Haiku hole is closed by the facts shutdown() abort belt.

Co-Authored-By: ElliotDrel <noreply@github.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Rename the reliability-wave release version per request. Trio
(VERSION / package.json / CHANGELOG) reconciled; in-code version-tag
comments and test fixtures updated; llms regenerated.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@garrytan garrytan changed the title v0.42.11.0 fix: reliability wave — PGLite capture lock-pin + Postgres reconnect race + search embed-hang (#1762 #1745 #1775) v0.42.20.0 fix: reliability wave — PGLite capture lock-pin + Postgres reconnect race + search embed-hang (#1762 #1745 #1775) Jun 3, 2026
garrytan and others added 3 commits June 3, 2026 07:51
Resolve conflicts:
- VERSION / package.json: keep wave version 0.42.20.0 (> master 0.42.17.0).
- CHANGELOG.md: keep both entries, 0.42.20.0 on top of master's stack.
- CLAUDE.md: take master's slimmed structure (per-file index moved to
  docs/architecture/KEY_FILES.md); port the background-work.ts entry +
  v0.42.20.0 companion behavior into KEY_FILES.md as current-state prose.
- src/core/postgres-engine.ts: merge the reconnect() JSDoc — body already
  auto-merged (my #1745 module-singleton branch + master's #1685 GAP B
  pool-recovery audit coexist on master's ctx?:{error} signature).
- llms-full.txt / llms.txt: regenerated via build:llms.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Resolve conflicts:
- VERSION / package.json: keep wave version 0.42.20.0 (> master 0.42.18.0).
- CHANGELOG.md: keep both entries, 0.42.20.0 on top of master's stack
  (0.42.20.0 > 0.42.18.0 > 0.42.17.0 > ...).
- src/cli.ts: merge the handleCliOnly finally — keep master's
  syncWatchdog?.dispose() (#1633) AND the drain-before-disconnect +
  force-exit block (#1762); both now run on clean exit.
- llms regenerated via build:llms.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Resolve conflicts:
- VERSION / package.json: keep wave version 0.42.20.0 (> master 0.42.19.0).
- CHANGELOG.md: keep both entries, 0.42.20.0 on top of master's stack
  (0.42.20.0 > 0.42.19.0 > 0.42.18.0 > ...).
- master's #1809 skillopt fix auto-merged into source; llms regenerated.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@garrytan garrytan merged commit ec5fed2 into master Jun 3, 2026
21 checks passed
garrytan added a commit that referenced this pull request Jun 3, 2026
…2.21.0)

Resolves conflicts from master's v0.42.19.0–0.42.20.0 reliability wave:
- src/core/postgres-engine.ts reconnect(): took master's style-aware reconnect
  (#1745/#1810 — module-singleton path recovers idempotently via db.connect()
  WITHOUT tearing down the shared pool; instance path rebuilds + pool-recovery
  audit). Kept this branch's _ownsModuleSingleton ownership token in
  connect()/disconnect(). Dropped this branch's _reconnectPromise in favor of
  master's _reconnecting guard (master's module-never-teardown obviates it);
  updated the singleton-ownership test + retry.ts/batchRetry comments to match.
- src/cli.ts: took master's drain-before-disconnect block (#1762) on the
  CLI_ONLY owner-disconnect (this resolves the F5 facts-queue-drain TODO for the
  fall-through path); kept the #1471 owner-disconnect-last invariant note.
- VERSION/package.json/CHANGELOG/TODOS → 0.42.21.0, both sides' entries kept.
- llms bundles regenerated.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
mgunnin added a commit to mgunnin/gbrain that referenced this pull request Jun 3, 2026
* upstream/master:
  v0.42.23.0 feat(jobs): --nice scheduling-priority flag for jobs work/supervisor (garrytan#1815) (garrytan#1820)
  v0.42.22.0 fix(minions): supervisor progress watchdog + worker DB self-defense — alive-but-wedged worker self-heals (garrytan#1801) (garrytan#1824)
  v0.42.21.0 fix(postgres): module-singleton ownership — canonical landing for the dream-cycle "connect() has not been called" class (garrytan#1404/garrytan#1471/garrytan#1619) (garrytan#1805)
  v0.42.20.0 fix: reliability wave — PGLite capture lock-pin + Postgres reconnect race + search embed-hang (garrytan#1762 garrytan#1745 garrytan#1775) (garrytan#1810)
  v0.42.19.0 fix(skillopt): close the last gap in the AI SDK v6 tool-loop fix (write-capture mapper + regression test) (garrytan#1809)
  v0.42.18.0 fix: sync orphan-pileup watchdog (garrytan#1633) + links-lag µs stamp (garrytan#1768) (garrytan#1807)
  v0.42.17.0 fix(sync): resumable incremental sync — killed mid-import no longer loses progress (garrytan#1794) (garrytan#1808)
  v0.42.16.0 feat(doctor): brain health as a solved problem — cause-ranked doctor + OOM-loop line + auto-drain + pool-reap (garrytan#1685) (garrytan#1802)
  v0.42.15.0 fix: decouple CLI primary output from process.stdout.isTTY (garrytan#1784) (garrytan#1806)
  v0.42.14.0 fix(zero-config): code-* readiness signal + init embedding-key validation + lock self-heal (garrytan#1780) (garrytan#1804)
  v0.42.13.0 fix(search): archive/ content findable by default, demoted not hard-excluded (garrytan#1777) (garrytan#1797)
  v0.42.12.0 feat: self-upgrading gbrain — invocation-riding update check + opt-in auto-upgrade (garrytan#1798)
  v0.42.11.0 feat(skillopt): held-out eval gate, honest receipts, ENFORCE + ablation opts (garrytan#1759)
  v0.42.10.0 feat(extract): opt-in global-basename wikilink resolution (closes garrytan#972) (garrytan#1388)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant