v0.41.27.0 fix: withRetry self-heals on null singleton + facts:absorb drain + disconnect audit (closes #1570)#1608
Merged
Merged
Conversation
Master shipped v0.41.25.0 (#1538 batched sync deletes) and v0.41.26.0 (#1571 dream --source fix) while this branch was in flight. Conflict resolution rebumps to the next available slot. - VERSION: 0.41.25.0 → 0.41.27.0 - package.json: synced - CHANGELOG.md: my v0.41.27.0 entry placed above master's v0.41.26.0 and v0.41.25.0; in-entry version references updated 0.41.25.0 → 0.41.27.0 and forward-references bumped to v0.41.28+. - TODOS.md: kept master's v0.41.20.x section + my v0.41.27.0+ follow-ups No source-file conflicts during the merge.
Instruments every db.disconnect() and PostgresEngine.disconnect() call with a JSONL audit record so the next user-reported #1570 cycle gives us the offender's caller stack instead of the symptomatic "No database connection" error. Audit shape (~/.gbrain/audit/db-disconnect-YYYY-Www.jsonl): {ts, engine_kind, connection_style, caller_stack[], command, pid} - src/core/audit/db-disconnect-audit.ts (NEW): the audit writer, built on the v0.40.4.0 createAuditWriter cathedral. Captures a 6-frame stack via new Error().stack so the offender is readable without spending stderr noise. - src/core/db.ts: logDbDisconnect call at the top of disconnect() (best-effort; never blocks the real teardown). - src/core/postgres-engine.ts: same instrumentation in PostgresEngine.disconnect() — distinguishes 'module' vs 'instance' connection_style so we can tell legitimate worker-pool teardowns apart from the load-bearing module-singleton class. - src/commands/doctor.ts: extends batch_retry_health to surface 24h disconnect count + most-recent caller stack. Warns when the caller frame isn't a known CLI-exit frame (e.g. cli.ts's finally block at the end of an op-dispatch). This is the diagnostic that tells v0.41.28+ where to apply the real ownership fix. - test/db-disconnect-audit.test.ts: unit coverage for the audit writer + caller-stack capture + JSONL shape. - test/e2e/db-singleton-shared-recovery.test.ts: real-Postgres regression that exercises the singleton-null path end-to-end. Refs #1570
…1.27.0) withRetry gains an opt-in reconnect callback that fires between the isRetryableConnError classification and the inter-attempt sleep. PostgresEngine.batchRetry injects this.reconnect() — race-safe via the existing _reconnecting guard, handles module and instance pools. Closes the production loss reported in #1570: dream cycles on Supabase no longer drop ~150 link rows per cycle when the singleton goes null mid-batch. The retry now rebuilds the connection between attempts so the second try has somewhere to write to. - src/core/retry.ts: WithRetryOpts gains `reconnect?: () => Promise<void>`. Awaited in the catch branch. onRetry is also now awaited (back-compat- safe: every existing in-tree caller is a sync arrow). Reconnect failures propagate as the real cause — replaces the symptomatic "No database connection" error with whatever the connect() throw was, so operators see the truth. - src/core/postgres-engine.ts:batchRetry — injects `reconnect: () => this.reconnect()`. Covers all 9 batch-retry call sites (addLinksBatch, addTimelineEntriesBatch, upsertChunks, plus the 6 caller-supplied auditSite labels in extract / sync / reindex). - test/core/retry-reconnect.test.ts: 8 hermetic cases pinning the contract — reconnect fires before sleep, only on retryable errors, back-compat when omitted, signal-aborted bypasses reconnect, onRetry is awaited, full success path end-to-end. The deeper bug (who's calling disconnect mid-cycle) is left unaddressed in this commit by design — the diagnostic instrumentation in the prior commit will tell us in the next production run. Refs #1570
Closes the silent 'No database connection' tail-end errors after
gbrain capture / put_page: the facts:absorb fire-and-forget queue
sometimes outlived the CLI process's connection lifetime, so absorb
attempts after engine.disconnect() landed in stderr as the
GBrainError shape.
- src/core/facts/queue.ts: new drainPending({timeout: 1000}) method
distinct from shutdown(). Stops accepting new enqueues, awaits
in-flight settle, bounded by timeout, returns count of unfinished.
Semantically different from shutdown() (which aborts in-flight)
so the symptom — drop work that hasn't started yet but let
in-flight work finish — matches what CLI exit actually needs.
- src/cli.ts: op-dispatch finally block awaits the drain BEFORE
engine.disconnect(). Bounded 1s. Opt-out env GBRAIN_NO_FACTS_DRAIN
for callers that don't enqueue (keeps fast-exit paths fast).
Mirrors the v0.41.8.0 awaitPendingLastRetrievedWrites pattern.
- test/facts-queue-drain-pending.test.ts: 6 hermetic cases — empty
drain returns immediately, single in-flight settles, timeout
bounds wait, shutdown-after-drain is idempotent, post-drain
enqueues are dropped, signal-aborted skips waiting.
Refs #1570
…v0.41.27.0 Master shipped v0.41.26.1 (#1572 lock-renewal cathedral) during this branch's review cycle. Merging it in: - VERSION stays at 0.41.27.0 (still the next available slot above master's 0.41.26.1 — neither side claimed 0.41.27.0) - CHANGELOG.md: my v0.41.27.0 entry on top; master's v0.41.26.1 entry preserved verbatim below - TODOS.md: my v0.41.27.0 #1570 follow-ups (forward-references bumped to v0.41.28+); master's v0.41.26.1 lock-renewal follow-ups preserved - package.json: synced to 0.41.27.0 No source-file conflicts — master's lock-renewal work is in src/core/audit/lock-renewal-audit.ts (new), src/core/minions/worker.ts (modified), and a handful of audit/test files. None overlap with the #1570 surface (retry.ts, facts/queue.ts, db.ts, postgres-engine.ts).
README.md: added troubleshooting entry for the v0.41.27.0 retry-reconnect + facts:absorb drain fix (closes #1570), pointing operators at `gbrain doctor --json` to find the offending disconnect caller. CLAUDE.md: extended `src/core/retry.ts` entry with the new optional `reconnect` callback (v0.41.27.0); added two new Key Files entries for `src/core/audit/db-disconnect-audit.ts` (the diagnostic half of the "instrument first, fix later" pivot) and `FactsQueue.drainPending`; extended `doctor.ts:checkBatchRetryHealth` entry with the in-place extension that surfaces 24h disconnect-call count. llms-full.txt: regenerated to absorb CLAUDE.md edits. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Master shipped v0.41.27.0 (#1573 git-aware sync_freshness check) which claimed the version slot this branch originally targeted. This branch already rebumped to v0.41.28.0 in the prior commit; merging master in: - VERSION stays 0.41.28.0 (next slot above master's 0.41.27.0) - CHANGELOG.md: my v0.41.28.0 entry on top; master's v0.41.27.0 entry preserved verbatim below - package.json: synced to 0.41.28.0 - src/commands/doctor.ts, CLAUDE.md, llms-full.txt: auto-merged (master's sync_freshness check + my batch_retry_health disconnect audit extension both land; no overlap) - New from master: src/core/git-head.ts + test/core/git-head.test.ts No #1570-surface conflicts — master's sync_freshness work is orthogonal to the retry/facts/disconnect-audit changes.
…ard fix) Both files failed on CI shards 1 and 8 under the cross-file gateway-state leak class (CLAUDE.md "Test-isolation lint and helpers"). The v0.41.28.0 merge reshuffled the weight-based shard bin-packing, landing a gateway-mutating sibling ahead of these two victims in the same `bun test` process. Mechanism: - put-page-provenance: put_page embeds via the gateway. A sibling left the gateway configured with OpenAI + the CI placeholder `sk-test` (captured at configureGateway time, survives the withEnv restore as cached gateway state). put_page's embed then fired against live OpenAI and 401'd. The bunfig legacy-embedding preload's beforeEach only re-applies legacy when the gateway was RESET — it does NOT correct a sibling that configured a different LIVE config. - embedding-dim-check: initSchema builds the content_chunks vector column at the gateway's configured dim. A sibling leaking ZE/1280 made the column 1280-d, so `expect(dims).toBe(1536)` failed. Fix (victim-side pinning, the escape hatch the preload documents): - Both: configure the gateway explicitly in beforeAll BEFORE initSchema (OpenAI/1536), resetGateway() in afterAll so neither leaks onward. - put-page-provenance also stubs the embed transport via __setEmbedTransportForTests so embed is deterministic and offline; a dummy OPENAI_API_KEY is supplied in the gateway env because instantiateEmbedding builds the OpenAI client (key check) BEFORE the stubbed transport is reached — the stub then intercepts the actual call so the key never leaves the process. Verified: CI shards 1 (1337 pass) + 8 (905 pass) green with OPENAI_API_KEY unset, plus adversarial sibling orderings (gateway.test / doctor-ze-checks preceding). Typecheck + check-test-isolation clean.
mgunnin
added a commit
to mgunnin/gbrain
that referenced
this pull request
Jun 3, 2026
* upstream/master: v0.41.29.0 feat(conversation-parser): bold-name-no-time builtin + fix(orphans): source-scoped orphan_ratio (supersedes garrytan#1613) (garrytan#1620) v0.41.27.0 fix: withRetry self-heals on null singleton + facts:absorb drain + disconnect audit (closes garrytan#1570) (garrytan#1608) v0.41.27.0 fix(doctor): git-aware sync_freshness (supersedes garrytan#1564) (garrytan#1573)
This was referenced Jun 8, 2026
fix(postgres-engine): pair pg_advisory_lock acquire+release on same backend via pool.reserve()
#1408
Closed
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Closes #1570 — production loss of ~150 link rows per
gbrain dreamcycle on Supabase, plus silent'No database connection'errors aftergbrain capture. Three small surgical commits + diagnostic instrumentation.This is the instrument-then-fix pivot the codex outside-voice review recommended. The original plan proposed an architectural refactor (remove singleton nullability, rename
disconnect → shutdown, ~30-50 file diff) — but it designed a refactor for a root cause never actually identified. v0.41.27.0 ships:Symptom fix (retry self-heals on null singleton) —
withRetrygains an opt-inreconnect?: () => Promise<void>callback awaited between theisRetryableConnErrorclassification and the inter-attempt sleep.PostgresEngine.batchRetryinjects() => this.reconnect()— race-safe via the existing_reconnectingguard, handles both module and instance pools. Closes the headline production loss: next attempt sees a live pool instead of the same null singleton.Facts queue drain (well-understood bug class) —
FactsQueue.drainPending({timeout: 1000})— new method distinct fromshutdown(). Stops accepting new enqueues, awaits in-flight settle, bounded by timeout.cli.tsop-dispatch finally awaits the drain BEFOREengine.disconnect(). Mirrors the v0.41.8.0awaitPendingLastRetrievedWritespattern. Closes the silent'No database connection'tail-end errors when the fire-and-forget facts:absorb queue outlives the CLI process's connection lifetime.Diagnostic instrumentation (find the offender for v0.41.28+) —
src/core/audit/db-disconnect-audit.ts(NEW): everydb.disconnect()andPostgresEngine.disconnect()call writes a JSONL audit record with engine_kind + connection_style + 6-frame caller stack + command + pid.gbrain doctor'sbatch_retry_healthcheck surfaces 24h disconnect count + most-recent caller. File:~/.gbrain/audit/db-disconnect-YYYY-Www.jsonl(ISO-week rotated, honorsGBRAIN_AUDIT_DIR). v0.41.28+ will use this production data to identify the actual mid-process disconnect caller and apply a targeted ownership-boundary fix. The singleton-removal architectural plan is deferred per the same review — see TODOS.md.Version
v0.41.27.0— rebumped past two queue collisions: v0.41.25.0 (#1538 batched sync deletes) and v0.41.26.1 (#1572 lock-renewal cathedral) landed during this branch's review cycle.Test Coverage
test/core/retry-reconnect.test.tstest/facts-queue-drain-pending.test.tstest/db-disconnect-audit.test.tstest/e2e/db-singleton-shared-recovery.test.tsFull unit suite on merged state: 8538+ tests across 4 parallel shards, 0 real failures. (Shard 2's wrapper artifact: a stale
.exitfile from an earlier killed run prevented the summary line from being written, but the shard's bun process ran ~180 test files cleanly with 9473 ✓ markers and zero genuine FAIL lines — the 3FAILmatches in its log are intentional eval-gate test fixtures showing the CLI's expected error-path output.)E2E:
test/e2e/db-singleton-shared-recovery.test.ts— 3 pass / 0 fail on real Postgres (port 5434).Pre-checks:
bun run verify— 29/29 checks green.bun run typecheck— clean.Pre-Landing Review
Codex findings addressedtable (~/.claude/plans/system-instruction-you-are-working-cuddly-panda.md)Plan Completion
All 11 implementation tasks (T1-T11) complete. Plan file captures the pivot rationale + Codex findings table.
TODOS
Added
v0.41.27.0 #1570 instrument-then-fix follow-ups (v0.41.28+ / v0.42+)to TODOS.md:Documentation
'No database connection: connect() has not been called'errors. The entry describes thewithRetryreconnect callback, theFactsQueue.drainPending({timeout: 1000})await in op-dispatch finally, and the paste-readygbrain doctor --json | jq '.checks[] | select(.id=="batch_retry_health")'command to surface the offending disconnect caller from the new~/.gbrain/audit/db-disconnect-YYYY-Www.jsonlaudit.src/core/retry.tsKey File entry with the v0.41.27.0reconnect?: () => Promise<void>callback addition (awaited AFTER classification, BEFORE inter-attempt sleep; fail-loud propagation per codex outside-voice finding 3); extendedsrc/commands/doctor.ts:checkBatchRetryHealthto note the in-place extension that appendsDisconnect-call audit: N call(s) in 24h (most recent caller: <frame>).to all three message paths (per codex finding 11: extend, don't add a new check); added two fresh Key Files entries —src/core/audit/db-disconnect-audit.ts(the diagnostic half of the "instrument first, fix later" pivot — built on the v0.40.4.0audit-writer.tscathedral with stack-truncated caller frames andreadRecentDbDisconnects(hours=24)) andsrc/core/facts/queue.ts:FactsQueue.drainPending(semantically distinct fromshutdown()— drain lets in-flight finish without firinginternalAbort.abort(), default 1000ms timeout so commands that don't enqueue facts pay no observable cost).bun run build:llmsto absorb the CLAUDE.md edits (per the project rule that CLAUDE.md edits must be followed bybuild:llmsto keeptest/build-llms.test.tsgreen in CI shard 1).To take advantage of v0.41.27.0
gbrain upgraderuns automatically. Ifgbrain doctorwarns about disconnect audit data after agbrain dreamcycle, the message names the caller frame — please file an issue at https://github.com/garrytan/gbrain/issues with that data so v0.41.28+ can target the actual ownership-boundary fix.Test plan
bun run typecheck— clean (post-merge of master v0.41.26.1)bun run verify— 29/29 checks greentest/e2e/db-singleton-shared-recovery.test.ts— 3 pass / 0 fail on real Postgres~/.gbrain/audit/db-disconnect-YYYY-Www.jsonlafter next user-reported bug: withRetry doesn't reconnect on 'No database connection' — batch rows lost every dream cycle #1570 cycle; identify offender; ship v0.41.28+ targeted fix🤖 Generated with Claude Code
Co-Authored-By: Claude Opus 4.7 noreply@anthropic.com