v0.41.27.0 fix: withRetry self-heals on null singleton + facts:absorb drain + disconnect audit (closes #1570) by garrytan · Pull Request #1608 · garrytan/gbrain

garrytan · 2026-05-29T01:02:04Z

Summary

Closes #1570 — production loss of ~150 link rows per gbrain dream cycle on Supabase, plus silent 'No database connection' errors after gbrain capture. Three small surgical commits + diagnostic instrumentation.

This is the instrument-then-fix pivot the codex outside-voice review recommended. The original plan proposed an architectural refactor (remove singleton nullability, rename disconnect → shutdown, ~30-50 file diff) — but it designed a refactor for a root cause never actually identified. v0.41.27.0 ships:

Symptom fix (retry self-heals on null singleton) — withRetry gains an opt-in reconnect?: () => Promise<void> callback awaited between the isRetryableConnError classification and the inter-attempt sleep. PostgresEngine.batchRetry injects () => this.reconnect() — race-safe via the existing _reconnecting guard, handles both module and instance pools. Closes the headline production loss: next attempt sees a live pool instead of the same null singleton.
Facts queue drain (well-understood bug class) — FactsQueue.drainPending({timeout: 1000}) — new method distinct from shutdown(). Stops accepting new enqueues, awaits in-flight settle, bounded by timeout. cli.ts op-dispatch finally awaits the drain BEFORE engine.disconnect(). Mirrors the v0.41.8.0 awaitPendingLastRetrievedWrites pattern. Closes the silent 'No database connection' tail-end errors when the fire-and-forget facts:absorb queue outlives the CLI process's connection lifetime.
Diagnostic instrumentation (find the offender for v0.41.28+) — src/core/audit/db-disconnect-audit.ts (NEW): every db.disconnect() and PostgresEngine.disconnect() call writes a JSONL audit record with engine_kind + connection_style + 6-frame caller stack + command + pid. gbrain doctor's batch_retry_health check surfaces 24h disconnect count + most-recent caller. File: ~/.gbrain/audit/db-disconnect-YYYY-Www.jsonl (ISO-week rotated, honors GBRAIN_AUDIT_DIR). v0.41.28+ will use this production data to identify the actual mid-process disconnect caller and apply a targeted ownership-boundary fix. The singleton-removal architectural plan is deferred per the same review — see TODOS.md.

Version

v0.41.27.0 — rebumped past two queue collisions: v0.41.25.0 (#1538 batched sync deletes) and v0.41.26.1 (#1572 lock-renewal cathedral) landed during this branch's review cycle.

Test Coverage

Surface	Tests	Status
retry-reconnect callback contract	8 hermetic unit cases	NEW `test/core/retry-reconnect.test.ts`
FactsQueue.drainPending	6 hermetic unit cases	NEW `test/facts-queue-drain-pending.test.ts`
db-disconnect audit writer	unit coverage for shape + stack capture	NEW `test/db-disconnect-audit.test.ts`
singleton shared recovery E2E	real-Postgres regression	NEW `test/e2e/db-singleton-shared-recovery.test.ts`

Full unit suite on merged state: 8538+ tests across 4 parallel shards, 0 real failures. (Shard 2's wrapper artifact: a stale .exit file from an earlier killed run prevented the summary line from being written, but the shard's bun process ran ~180 test files cleanly with 9473 ✓ markers and zero genuine FAIL lines — the 3 FAIL matches in its log are intentional eval-gate test fixtures showing the CLI's expected error-path output.)

E2E: test/e2e/db-singleton-shared-recovery.test.ts — 3 pass / 0 fail on real Postgres (port 5434).

Pre-checks: bun run verify — 29/29 checks green. bun run typecheck — clean.

Pre-Landing Review

/plan-eng-review: CLEAR (status=clean, 12 issues raised + addressed, mode=FULL_REVIEW)
Codex outside-voice: 15 findings, all addressed in the plan file's Codex findings addressed table (~/.claude/plans/system-instruction-you-are-working-cuddly-panda.md)
/ship adversarial passes: Claude subagent + Codex adversarial ran on the diff; no new findings beyond what plan-eng-review caught.

Plan Completion

All 11 implementation tasks (T1-T11) complete. Plan file captures the pivot rationale + Codex findings table.

TODOS

Added v0.41.27.0 #1570 instrument-then-fix follow-ups (v0.41.28+ / v0.42+) to TODOS.md:

v0.41.28+: Read disconnect-audit data from next production run, fix the offending ownership boundary.
v0.42+: Re-evaluate module-singleton removal IF the targeted fix doesn't close the bug class.

Documentation

README.md — added a Troubleshooting entry for the v0.41.27.0 bug: withRetry doesn't reconnect on 'No database connection' — batch rows lost every dream cycle #1570 fix wave: dream-cycle losing ~150 link rows per run with 'No database connection: connect() has not been called' errors. The entry describes the withRetry reconnect callback, the FactsQueue.drainPending({timeout: 1000}) await in op-dispatch finally, and the paste-ready gbrain doctor --json | jq '.checks[] | select(.id=="batch_retry_health")' command to surface the offending disconnect caller from the new ~/.gbrain/audit/db-disconnect-YYYY-Www.jsonl audit.
CLAUDE.md — extended src/core/retry.ts Key File entry with the v0.41.27.0 reconnect?: () => Promise<void> callback addition (awaited AFTER classification, BEFORE inter-attempt sleep; fail-loud propagation per codex outside-voice finding 3); extended src/commands/doctor.ts:checkBatchRetryHealth to note the in-place extension that appends Disconnect-call audit: N call(s) in 24h (most recent caller: <frame>). to all three message paths (per codex finding 11: extend, don't add a new check); added two fresh Key Files entries — src/core/audit/db-disconnect-audit.ts (the diagnostic half of the "instrument first, fix later" pivot — built on the v0.40.4.0 audit-writer.ts cathedral with stack-truncated caller frames and readRecentDbDisconnects(hours=24)) and src/core/facts/queue.ts:FactsQueue.drainPending (semantically distinct from shutdown() — drain lets in-flight finish without firing internalAbort.abort(), default 1000ms timeout so commands that don't enqueue facts pay no observable cost).
llms-full.txt — regenerated via bun run build:llms to absorb the CLAUDE.md edits (per the project rule that CLAUDE.md edits must be followed by build:llms to keep test/build-llms.test.ts green in CI shard 1).

To take advantage of v0.41.27.0

gbrain upgrade runs automatically. If gbrain doctor warns about disconnect audit data after a gbrain dream cycle, the message names the caller frame — please file an issue at https://github.com/garrytan/gbrain/issues with that data so v0.41.28+ can target the actual ownership-boundary fix.

Test plan

bun run typecheck — clean (post-merge of master v0.41.26.1)
bun run verify — 29/29 checks green
Full unit suite parallel run — 8538+ tests, 0 fail
test/e2e/db-singleton-shared-recovery.test.ts — 3 pass / 0 fail on real Postgres
Post-deploy: read ~/.gbrain/audit/db-disconnect-YYYY-Www.jsonl after next user-reported bug: withRetry doesn't reconnect on 'No database connection' — batch rows lost every dream cycle #1570 cycle; identify offender; ship v0.41.28+ targeted fix

🤖 Generated with Claude Code

Co-Authored-By: Claude Opus 4.7 noreply@anthropic.com

Master shipped v0.41.25.0 (#1538 batched sync deletes) and v0.41.26.0 (#1571 dream --source fix) while this branch was in flight. Conflict resolution rebumps to the next available slot. - VERSION: 0.41.25.0 → 0.41.27.0 - package.json: synced - CHANGELOG.md: my v0.41.27.0 entry placed above master's v0.41.26.0 and v0.41.25.0; in-entry version references updated 0.41.25.0 → 0.41.27.0 and forward-references bumped to v0.41.28+. - TODOS.md: kept master's v0.41.20.x section + my v0.41.27.0+ follow-ups No source-file conflicts during the merge.

Instruments every db.disconnect() and PostgresEngine.disconnect() call with a JSONL audit record so the next user-reported #1570 cycle gives us the offender's caller stack instead of the symptomatic "No database connection" error. Audit shape (~/.gbrain/audit/db-disconnect-YYYY-Www.jsonl): {ts, engine_kind, connection_style, caller_stack[], command, pid} - src/core/audit/db-disconnect-audit.ts (NEW): the audit writer, built on the v0.40.4.0 createAuditWriter cathedral. Captures a 6-frame stack via new Error().stack so the offender is readable without spending stderr noise. - src/core/db.ts: logDbDisconnect call at the top of disconnect() (best-effort; never blocks the real teardown). - src/core/postgres-engine.ts: same instrumentation in PostgresEngine.disconnect() — distinguishes 'module' vs 'instance' connection_style so we can tell legitimate worker-pool teardowns apart from the load-bearing module-singleton class. - src/commands/doctor.ts: extends batch_retry_health to surface 24h disconnect count + most-recent caller stack. Warns when the caller frame isn't a known CLI-exit frame (e.g. cli.ts's finally block at the end of an op-dispatch). This is the diagnostic that tells v0.41.28+ where to apply the real ownership fix. - test/db-disconnect-audit.test.ts: unit coverage for the audit writer + caller-stack capture + JSONL shape. - test/e2e/db-singleton-shared-recovery.test.ts: real-Postgres regression that exercises the singleton-null path end-to-end. Refs #1570

…1.27.0) withRetry gains an opt-in reconnect callback that fires between the isRetryableConnError classification and the inter-attempt sleep. PostgresEngine.batchRetry injects this.reconnect() — race-safe via the existing _reconnecting guard, handles module and instance pools. Closes the production loss reported in #1570: dream cycles on Supabase no longer drop ~150 link rows per cycle when the singleton goes null mid-batch. The retry now rebuilds the connection between attempts so the second try has somewhere to write to. - src/core/retry.ts: WithRetryOpts gains `reconnect?: () => Promise<void>`. Awaited in the catch branch. onRetry is also now awaited (back-compat- safe: every existing in-tree caller is a sync arrow). Reconnect failures propagate as the real cause — replaces the symptomatic "No database connection" error with whatever the connect() throw was, so operators see the truth. - src/core/postgres-engine.ts:batchRetry — injects `reconnect: () => this.reconnect()`. Covers all 9 batch-retry call sites (addLinksBatch, addTimelineEntriesBatch, upsertChunks, plus the 6 caller-supplied auditSite labels in extract / sync / reindex). - test/core/retry-reconnect.test.ts: 8 hermetic cases pinning the contract — reconnect fires before sleep, only on retryable errors, back-compat when omitted, signal-aborted bypasses reconnect, onRetry is awaited, full success path end-to-end. The deeper bug (who's calling disconnect mid-cycle) is left unaddressed in this commit by design — the diagnostic instrumentation in the prior commit will tell us in the next production run. Refs #1570

Closes the silent 'No database connection' tail-end errors after gbrain capture / put_page: the facts:absorb fire-and-forget queue sometimes outlived the CLI process's connection lifetime, so absorb attempts after engine.disconnect() landed in stderr as the GBrainError shape. - src/core/facts/queue.ts: new drainPending({timeout: 1000}) method distinct from shutdown(). Stops accepting new enqueues, awaits in-flight settle, bounded by timeout, returns count of unfinished. Semantically different from shutdown() (which aborts in-flight) so the symptom — drop work that hasn't started yet but let in-flight work finish — matches what CLI exit actually needs. - src/cli.ts: op-dispatch finally block awaits the drain BEFORE engine.disconnect(). Bounded 1s. Opt-out env GBRAIN_NO_FACTS_DRAIN for callers that don't enqueue (keeps fast-exit paths fast). Mirrors the v0.41.8.0 awaitPendingLastRetrievedWrites pattern. - test/facts-queue-drain-pending.test.ts: 6 hermetic cases — empty drain returns immediately, single in-flight settles, timeout bounds wait, shutdown-after-drain is idempotent, post-drain enqueues are dropped, signal-aborted skips waiting. Refs #1570

…v0.41.27.0 Master shipped v0.41.26.1 (#1572 lock-renewal cathedral) during this branch's review cycle. Merging it in: - VERSION stays at 0.41.27.0 (still the next available slot above master's 0.41.26.1 — neither side claimed 0.41.27.0) - CHANGELOG.md: my v0.41.27.0 entry on top; master's v0.41.26.1 entry preserved verbatim below - TODOS.md: my v0.41.27.0 #1570 follow-ups (forward-references bumped to v0.41.28+); master's v0.41.26.1 lock-renewal follow-ups preserved - package.json: synced to 0.41.27.0 No source-file conflicts — master's lock-renewal work is in src/core/audit/lock-renewal-audit.ts (new), src/core/minions/worker.ts (modified), and a handful of audit/test files. None overlap with the #1570 surface (retry.ts, facts/queue.ts, db.ts, postgres-engine.ts).

README.md: added troubleshooting entry for the v0.41.27.0 retry-reconnect + facts:absorb drain fix (closes #1570), pointing operators at `gbrain doctor --json` to find the offending disconnect caller. CLAUDE.md: extended `src/core/retry.ts` entry with the new optional `reconnect` callback (v0.41.27.0); added two new Key Files entries for `src/core/audit/db-disconnect-audit.ts` (the diagnostic half of the "instrument first, fix later" pivot) and `FactsQueue.drainPending`; extended `doctor.ts:checkBatchRetryHealth` entry with the in-place extension that surfaces 24h disconnect-call count. llms-full.txt: regenerated to absorb CLAUDE.md edits. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Master shipped v0.41.27.0 (#1573 git-aware sync_freshness) claiming the same slot. Rebump to the next available version. - VERSION + package.json → 0.41.28.0 - CHANGELOG.md: my entry header + in-entry refs 0.41.27.0 → 0.41.28.0 - TODOS.md: my #1570 follow-up section header + body refs bumped

Master shipped v0.41.27.0 (#1573 git-aware sync_freshness check) which claimed the version slot this branch originally targeted. This branch already rebumped to v0.41.28.0 in the prior commit; merging master in: - VERSION stays 0.41.28.0 (next slot above master's 0.41.27.0) - CHANGELOG.md: my v0.41.28.0 entry on top; master's v0.41.27.0 entry preserved verbatim below - package.json: synced to 0.41.28.0 - src/commands/doctor.ts, CLAUDE.md, llms-full.txt: auto-merged (master's sync_freshness check + my batch_retry_health disconnect audit extension both land; no overlap) - New from master: src/core/git-head.ts + test/core/git-head.test.ts No #1570-surface conflicts — master's sync_freshness work is orthogonal to the retry/facts/disconnect-audit changes.

…ard fix) Both files failed on CI shards 1 and 8 under the cross-file gateway-state leak class (CLAUDE.md "Test-isolation lint and helpers"). The v0.41.28.0 merge reshuffled the weight-based shard bin-packing, landing a gateway-mutating sibling ahead of these two victims in the same `bun test` process. Mechanism: - put-page-provenance: put_page embeds via the gateway. A sibling left the gateway configured with OpenAI + the CI placeholder `sk-test` (captured at configureGateway time, survives the withEnv restore as cached gateway state). put_page's embed then fired against live OpenAI and 401'd. The bunfig legacy-embedding preload's beforeEach only re-applies legacy when the gateway was RESET — it does NOT correct a sibling that configured a different LIVE config. - embedding-dim-check: initSchema builds the content_chunks vector column at the gateway's configured dim. A sibling leaking ZE/1280 made the column 1280-d, so `expect(dims).toBe(1536)` failed. Fix (victim-side pinning, the escape hatch the preload documents): - Both: configure the gateway explicitly in beforeAll BEFORE initSchema (OpenAI/1536), resetGateway() in afterAll so neither leaks onward. - put-page-provenance also stubs the embed transport via __setEmbedTransportForTests so embed is deterministic and offline; a dummy OPENAI_API_KEY is supplied in the gateway env because instantiateEmbedding builds the OpenAI client (key check) BEFORE the stubbed transport is reached — the stub then intercepts the actual call so the key never leaves the process. Verified: CI shards 1 (1337 pass) + 8 (905 pass) green with OPENAI_API_KEY unset, plus adversarial sibling orderings (gateway.test / doctor-ze-checks preceding). Typecheck + check-test-isolation clean.

* upstream/master: v0.41.29.0 feat(conversation-parser): bold-name-no-time builtin + fix(orphans): source-scoped orphan_ratio (supersedes garrytan#1613) (garrytan#1620) v0.41.27.0 fix: withRetry self-heals on null singleton + facts:absorb drain + disconnect audit (closes garrytan#1570) (garrytan#1608) v0.41.27.0 fix(doctor): git-aware sync_freshness (supersedes garrytan#1564) (garrytan#1573)

garrytan and others added 9 commits May 28, 2026 08:26

garrytan merged commit ffac8ce into master May 29, 2026
21 checks passed

pabloglzg mentioned this pull request May 29, 2026

fix(postgres): don't close a module singleton an engine only joined (deeper #1570 root cause) #1622

Closed

xaviroblessarries mentioned this pull request May 30, 2026

gbrain dream: extract phase silently drops all link/timeline batches when the module-singleton DB connection is null mid-cycle (retry-on-null can't recover) #1619

Open

toilalesondev mentioned this pull request Jun 2, 2026

fix(db): lazy-reconnect module singleton to survive mid-cycle teardown (#1570) #1758

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v0.41.27.0 fix: withRetry self-heals on null singleton + facts:absorb drain + disconnect audit (closes #1570)#1608

v0.41.27.0 fix: withRetry self-heals on null singleton + facts:absorb drain + disconnect audit (closes #1570)#1608
garrytan merged 9 commits into
masterfrom
garrytan/missoula-v3

garrytan commented May 29, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

garrytan commented May 29, 2026

Summary

Version

Test Coverage

Pre-Landing Review

Plan Completion

TODOS

Documentation

To take advantage of v0.41.27.0

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant