v0.42.20.0 fix: reliability wave — PGLite capture lock-pin + Postgres reconnect race + search embed-hang (#1762 #1745 #1775)#1810
Merged
Conversation
…ound-work registry (#1762) New src/core/background-work.ts registry (Map<name,drainer>, ordered drain, awaited abort). facts-queue (order 0, abort=shutdown), last-retrieved (1), and eval-capture (3, now self-tracked) register as sinks. Both CLI exit paths (op-dispatch finally + handleCliOnly finally) drain the registry before engine.disconnect() so a PGLite db.close() can't race in-flight work into the re-pump busy-loop that pinned the single-writer lock. Op-dispatch error path converts process.exit(1) to exitCode+return so the finally still drains.
#1762/#1775) withDefaultTimeout composes a per-touchpoint default deadline (chat 300s, embed/multimodal 60s) with any caller signal via AbortSignal.any. Applied at the SDK call layer (chat generateText, expand generateObject, OCR, per-sub-batch embed) — covers native-anthropic + retries — plus per-request multimodal fetch. embedQuery forwards abortSignal. Env: GBRAIN_AI_{CHAT,EMBED,MULTIMODAL}_TIMEOUT_MS.
…1745) reconnect() branches on connection style. Module-singleton engines re-establish idempotently via db.connect() (no-op when alive) + refresh the ConnectionManager read pool, never db.disconnect() — so a transient blip no longer nulls the shared sql out from under concurrent ops (which threw 'connect() has not been called'). Fail-loud on real connect failure. Instance pools keep teardown+recreate.
…ord (#1775) search/query default to cheap-hybrid (embeds the query); a stalled provider made the embed never settle, so the keyword fallback never engaged and the command force-exited with no output. One shared QueryEmbedDeadline (6s, floored 2s per embed) covers both the cache-lookup and inner embeds via embedQueryBounded (abortSignal + Promise.race) → existing keyword fallback engages. Also registers the search-cache background-work drainer (now bounded). Env: GBRAIN_QUERY_EMBED_TIMEOUT_MS.
New: background-work registry unit, query-embed deadline unit, eval-capture drain unit, postgres reconnect E2E (#1745), gbrain capture exit-clean case in the PGLite serial test. Updated fix-wave-structural assertions to the registry shape. VERSION/package.json/CHANGELOG -> 0.42.11.0; TODOS retrofit marked done. Incorporates + hardens PR #1763 (drain-before-disconnect + embed fetch timeout); the residual hung-Haiku hole is closed by the facts shutdown() abort belt. Co-Authored-By: ElliotDrel <noreply@github.com> Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… in CLAUDE.md (regen llms)
Rename the reliability-wave release version per request. Trio (VERSION / package.json / CHANGELOG) reconciled; in-code version-tag comments and test fixtures updated; llms regenerated. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Resolve conflicts: - VERSION / package.json: keep wave version 0.42.20.0 (> master 0.42.17.0). - CHANGELOG.md: keep both entries, 0.42.20.0 on top of master's stack. - CLAUDE.md: take master's slimmed structure (per-file index moved to docs/architecture/KEY_FILES.md); port the background-work.ts entry + v0.42.20.0 companion behavior into KEY_FILES.md as current-state prose. - src/core/postgres-engine.ts: merge the reconnect() JSDoc — body already auto-merged (my #1745 module-singleton branch + master's #1685 GAP B pool-recovery audit coexist on master's ctx?:{error} signature). - llms-full.txt / llms.txt: regenerated via build:llms. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Resolve conflicts: - VERSION / package.json: keep wave version 0.42.20.0 (> master 0.42.18.0). - CHANGELOG.md: keep both entries, 0.42.20.0 on top of master's stack (0.42.20.0 > 0.42.18.0 > 0.42.17.0 > ...). - src/cli.ts: merge the handleCliOnly finally — keep master's syncWatchdog?.dispose() (#1633) AND the drain-before-disconnect + force-exit block (#1762); both now run on clean exit. - llms regenerated via build:llms. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Resolve conflicts: - VERSION / package.json: keep wave version 0.42.20.0 (> master 0.42.19.0). - CHANGELOG.md: keep both entries, 0.42.20.0 on top of master's stack (0.42.20.0 > 0.42.19.0 > 0.42.18.0 > ...). - master's #1809 skillopt fix auto-merged into source; llms regenerated. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
garrytan
added a commit
that referenced
this pull request
Jun 3, 2026
…2.21.0) Resolves conflicts from master's v0.42.19.0–0.42.20.0 reliability wave: - src/core/postgres-engine.ts reconnect(): took master's style-aware reconnect (#1745/#1810 — module-singleton path recovers idempotently via db.connect() WITHOUT tearing down the shared pool; instance path rebuilds + pool-recovery audit). Kept this branch's _ownsModuleSingleton ownership token in connect()/disconnect(). Dropped this branch's _reconnectPromise in favor of master's _reconnecting guard (master's module-never-teardown obviates it); updated the singleton-ownership test + retry.ts/batchRetry comments to match. - src/cli.ts: took master's drain-before-disconnect block (#1762) on the CLI_ONLY owner-disconnect (this resolves the F5 facts-queue-drain TODO for the fall-through path); kept the #1471 owner-disconnect-last invariant note. - VERSION/package.json/CHANGELOG/TODOS → 0.42.21.0, both sides' entries kept. - llms bundles regenerated. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
mgunnin
added a commit
to mgunnin/gbrain
that referenced
this pull request
Jun 3, 2026
* upstream/master: v0.42.23.0 feat(jobs): --nice scheduling-priority flag for jobs work/supervisor (garrytan#1815) (garrytan#1820) v0.42.22.0 fix(minions): supervisor progress watchdog + worker DB self-defense — alive-but-wedged worker self-heals (garrytan#1801) (garrytan#1824) v0.42.21.0 fix(postgres): module-singleton ownership — canonical landing for the dream-cycle "connect() has not been called" class (garrytan#1404/garrytan#1471/garrytan#1619) (garrytan#1805) v0.42.20.0 fix: reliability wave — PGLite capture lock-pin + Postgres reconnect race + search embed-hang (garrytan#1762 garrytan#1745 garrytan#1775) (garrytan#1810) v0.42.19.0 fix(skillopt): close the last gap in the AI SDK v6 tool-loop fix (write-capture mapper + regression test) (garrytan#1809) v0.42.18.0 fix: sync orphan-pileup watchdog (garrytan#1633) + links-lag µs stamp (garrytan#1768) (garrytan#1807) v0.42.17.0 fix(sync): resumable incremental sync — killed mid-import no longer loses progress (garrytan#1794) (garrytan#1808) v0.42.16.0 feat(doctor): brain health as a solved problem — cause-ranked doctor + OOM-loop line + auto-drain + pool-reap (garrytan#1685) (garrytan#1802) v0.42.15.0 fix: decouple CLI primary output from process.stdout.isTTY (garrytan#1784) (garrytan#1806) v0.42.14.0 fix(zero-config): code-* readiness signal + init embedding-key validation + lock self-heal (garrytan#1780) (garrytan#1804) v0.42.13.0 fix(search): archive/ content findable by default, demoted not hard-excluded (garrytan#1777) (garrytan#1797) v0.42.12.0 feat: self-upgrading gbrain — invocation-riding update check + opt-in auto-upgrade (garrytan#1798) v0.42.11.0 feat(skillopt): held-out eval gate, honest receipts, ENFORCE + ablation opts (garrytan#1759) v0.42.10.0 feat(extract): opt-in global-basename wikilink resolution (closes garrytan#972) (garrytan#1388)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
v0.42.20.0 — reliability fix wave
Closes a cluster of teardown/reliability bugs around how GBrain shuts down a command and talks to AI providers.
Bugs fixed
gbrain capture) hang after completing on multi-chunk pages, pinning the PGLite single-writer lock #1762 —gbrain capturehangs on PGLite, pinning the single-writer lock.put_pagefires a fire-and-forgetfacts:absorbjob after printing the receipt; on a multi-chunk page that job is still running whenhandleCliOnlydisconnects, anddb.close()racing the in-flight job spins into a 100%-CPU loop that holds the lock. Now every fire-and-forget sink is drained before disconnect.gbrain dreamfails mid-cycle on Postgres ("connect() has not been called"). A blip madereconnect()tear down the shared module singleton out from under concurrent cycle/minion ops. Module-mode reconnect now re-establishes idempotently without a teardown.search/queryprint no output + 10s force-exit (regression from 0.22.8). Cheap-hybrid embeds the query; a stalled provider made the embed never settle so keyword fallback never engaged. The query embed is now time-bounded → falls back to keyword.Unifying changes (make this bug class structural)
src/core/background-work.ts(NEW): a registry of the four fire-and-forget DB-write sinks (last-retrieved, facts, search-cache, eval-capture).drainAllBackgroundWorkForCliExitis called on every CLI exit path beforeengine.disconnect(). A future 5th sink auto-participates.GBRAIN_AI_{CHAT,EMBED,MULTIMODAL}_TIMEOUT_MS), composed with caller signals viaAbortSignal.any. Covers the native-Anthropic path and bounds SDK retries.Commits (bisectable)
fix(core)background-work registry + drain-before-disconnect (CLI_ONLY commands (e.g.gbrain capture) hang after completing on multi-chunk pages, pinning the PGLite single-writer lock #1762)fix(ai)gateway-wide AI-HTTP timeout (CLI_ONLY commands (e.g.gbrain capture) hang after completing on multi-chunk pages, pinning the PGLite single-writer lock #1762/search/query render no output + engine.disconnect() 10s force-exit on 0.42.8.0 (Postgres brain); 0.22.8 works #1775)fix(postgres)module-mode reconnect preserves the shared singleton ("connect() has not been called" still reproduces on 0.41.28 — concurrent minion-worker path not covered by #1570 fix #1745)fix(search)bounded query-embed deadline + keyword fallback (search/query render no output + engine.disconnect() 10s force-exit on 0.42.8.0 (Postgres brain); 0.22.8 works #1775)test+choretests + v0.42.20.0Disconnect-site audit
The only CLI-exit disconnect sites that enqueue local background work are the two finally blocks (op-dispatch + handleCliOnly), both now draining.
search diagnoseembeds but enqueues nothing and is 60s-bounded;serveis excluded; thin-client is remote.Testing
bun run verify29/29.test/core/background-work.test.ts,test/search/query-embed-deadline.test.ts,test/eval-capture-drain.test.ts,test/e2e/postgres-reconnect-singleton.test.ts(real Postgres, "connect() has not been called" still reproduces on 0.41.28 — concurrent minion-worker path not covered by #1570 fix #1745 — verified green against pgvector:pg16), agbrain captureexit-clean case intest/e2e/pglite-cli-exit.serial.test.ts, updatedtest/fix-wave-structural.test.ts.Reviews
Plan went through
/plan-eng-review(twice) + 3 codex passes; the final code through a codex adversarial pre-landing review.Deferred (follow-ups, in TODOS.md)
runSync's internalprocess.exitsites toexitCode + return(graceful drain on sync error exits; today they avoid the hang by skipping disconnect).disconnect()only + fix its message.Incorporates and hardens #1763 (by @ElliotDrel, who diagnosed the corrected root cause); the residual hung-Haiku hole is closed by the facts
shutdown()abort belt. Will close #1763 as superseded.🤖 Generated with Claude Code