Skip to content

v0.41.27.0 fix: withRetry self-heals on null singleton + facts:absorb drain + disconnect audit (closes #1570)#1608

Merged
garrytan merged 9 commits into
masterfrom
garrytan/missoula-v3
May 29, 2026
Merged

v0.41.27.0 fix: withRetry self-heals on null singleton + facts:absorb drain + disconnect audit (closes #1570)#1608
garrytan merged 9 commits into
masterfrom
garrytan/missoula-v3

Conversation

@garrytan

Copy link
Copy Markdown
Owner

Summary

Closes #1570 — production loss of ~150 link rows per gbrain dream cycle on Supabase, plus silent 'No database connection' errors after gbrain capture. Three small surgical commits + diagnostic instrumentation.

This is the instrument-then-fix pivot the codex outside-voice review recommended. The original plan proposed an architectural refactor (remove singleton nullability, rename disconnect → shutdown, ~30-50 file diff) — but it designed a refactor for a root cause never actually identified. v0.41.27.0 ships:

  1. Symptom fix (retry self-heals on null singleton)withRetry gains an opt-in reconnect?: () => Promise<void> callback awaited between the isRetryableConnError classification and the inter-attempt sleep. PostgresEngine.batchRetry injects () => this.reconnect() — race-safe via the existing _reconnecting guard, handles both module and instance pools. Closes the headline production loss: next attempt sees a live pool instead of the same null singleton.

  2. Facts queue drain (well-understood bug class)FactsQueue.drainPending({timeout: 1000}) — new method distinct from shutdown(). Stops accepting new enqueues, awaits in-flight settle, bounded by timeout. cli.ts op-dispatch finally awaits the drain BEFORE engine.disconnect(). Mirrors the v0.41.8.0 awaitPendingLastRetrievedWrites pattern. Closes the silent 'No database connection' tail-end errors when the fire-and-forget facts:absorb queue outlives the CLI process's connection lifetime.

  3. Diagnostic instrumentation (find the offender for v0.41.28+)src/core/audit/db-disconnect-audit.ts (NEW): every db.disconnect() and PostgresEngine.disconnect() call writes a JSONL audit record with engine_kind + connection_style + 6-frame caller stack + command + pid. gbrain doctor's batch_retry_health check surfaces 24h disconnect count + most-recent caller. File: ~/.gbrain/audit/db-disconnect-YYYY-Www.jsonl (ISO-week rotated, honors GBRAIN_AUDIT_DIR). v0.41.28+ will use this production data to identify the actual mid-process disconnect caller and apply a targeted ownership-boundary fix. The singleton-removal architectural plan is deferred per the same review — see TODOS.md.

Version

v0.41.27.0 — rebumped past two queue collisions: v0.41.25.0 (#1538 batched sync deletes) and v0.41.26.1 (#1572 lock-renewal cathedral) landed during this branch's review cycle.

Test Coverage

Surface Tests Status
retry-reconnect callback contract 8 hermetic unit cases NEW test/core/retry-reconnect.test.ts
FactsQueue.drainPending 6 hermetic unit cases NEW test/facts-queue-drain-pending.test.ts
db-disconnect audit writer unit coverage for shape + stack capture NEW test/db-disconnect-audit.test.ts
singleton shared recovery E2E real-Postgres regression NEW test/e2e/db-singleton-shared-recovery.test.ts

Full unit suite on merged state: 8538+ tests across 4 parallel shards, 0 real failures. (Shard 2's wrapper artifact: a stale .exit file from an earlier killed run prevented the summary line from being written, but the shard's bun process ran ~180 test files cleanly with 9473 ✓ markers and zero genuine FAIL lines — the 3 FAIL matches in its log are intentional eval-gate test fixtures showing the CLI's expected error-path output.)

E2E: test/e2e/db-singleton-shared-recovery.test.ts — 3 pass / 0 fail on real Postgres (port 5434).

Pre-checks: bun run verify — 29/29 checks green. bun run typecheck — clean.

Pre-Landing Review

  • /plan-eng-review: CLEAR (status=clean, 12 issues raised + addressed, mode=FULL_REVIEW)
  • Codex outside-voice: 15 findings, all addressed in the plan file's Codex findings addressed table (~/.claude/plans/system-instruction-you-are-working-cuddly-panda.md)
  • /ship adversarial passes: Claude subagent + Codex adversarial ran on the diff; no new findings beyond what plan-eng-review caught.

Plan Completion

All 11 implementation tasks (T1-T11) complete. Plan file captures the pivot rationale + Codex findings table.

TODOS

Added v0.41.27.0 #1570 instrument-then-fix follow-ups (v0.41.28+ / v0.42+) to TODOS.md:

  • v0.41.28+: Read disconnect-audit data from next production run, fix the offending ownership boundary.
  • v0.42+: Re-evaluate module-singleton removal IF the targeted fix doesn't close the bug class.

Documentation

  • README.md — added a Troubleshooting entry for the v0.41.27.0 bug: withRetry doesn't reconnect on 'No database connection' — batch rows lost every dream cycle #1570 fix wave: dream-cycle losing ~150 link rows per run with 'No database connection: connect() has not been called' errors. The entry describes the withRetry reconnect callback, the FactsQueue.drainPending({timeout: 1000}) await in op-dispatch finally, and the paste-ready gbrain doctor --json | jq '.checks[] | select(.id=="batch_retry_health")' command to surface the offending disconnect caller from the new ~/.gbrain/audit/db-disconnect-YYYY-Www.jsonl audit.
  • CLAUDE.md — extended src/core/retry.ts Key File entry with the v0.41.27.0 reconnect?: () => Promise<void> callback addition (awaited AFTER classification, BEFORE inter-attempt sleep; fail-loud propagation per codex outside-voice finding 3); extended src/commands/doctor.ts:checkBatchRetryHealth to note the in-place extension that appends Disconnect-call audit: N call(s) in 24h (most recent caller: <frame>). to all three message paths (per codex finding 11: extend, don't add a new check); added two fresh Key Files entries — src/core/audit/db-disconnect-audit.ts (the diagnostic half of the "instrument first, fix later" pivot — built on the v0.40.4.0 audit-writer.ts cathedral with stack-truncated caller frames and readRecentDbDisconnects(hours=24)) and src/core/facts/queue.ts:FactsQueue.drainPending (semantically distinct from shutdown() — drain lets in-flight finish without firing internalAbort.abort(), default 1000ms timeout so commands that don't enqueue facts pay no observable cost).
  • llms-full.txt — regenerated via bun run build:llms to absorb the CLAUDE.md edits (per the project rule that CLAUDE.md edits must be followed by build:llms to keep test/build-llms.test.ts green in CI shard 1).

To take advantage of v0.41.27.0

gbrain upgrade runs automatically. If gbrain doctor warns about disconnect audit data after a gbrain dream cycle, the message names the caller frame — please file an issue at https://github.com/garrytan/gbrain/issues with that data so v0.41.28+ can target the actual ownership-boundary fix.

Test plan

🤖 Generated with Claude Code

Co-Authored-By: Claude Opus 4.7 noreply@anthropic.com

garrytan and others added 9 commits May 28, 2026 08:26
Master shipped v0.41.25.0 (#1538 batched sync deletes) and v0.41.26.0
(#1571 dream --source fix) while this branch was in flight. Conflict
resolution rebumps to the next available slot.

- VERSION: 0.41.25.0 → 0.41.27.0
- package.json: synced
- CHANGELOG.md: my v0.41.27.0 entry placed above master's v0.41.26.0
  and v0.41.25.0; in-entry version references updated 0.41.25.0 →
  0.41.27.0 and forward-references bumped to v0.41.28+.
- TODOS.md: kept master's v0.41.20.x section + my v0.41.27.0+ follow-ups

No source-file conflicts during the merge.
Instruments every db.disconnect() and PostgresEngine.disconnect() call
with a JSONL audit record so the next user-reported #1570 cycle gives
us the offender's caller stack instead of the symptomatic
"No database connection" error.

Audit shape (~/.gbrain/audit/db-disconnect-YYYY-Www.jsonl):
  {ts, engine_kind, connection_style, caller_stack[], command, pid}

- src/core/audit/db-disconnect-audit.ts (NEW): the audit writer,
  built on the v0.40.4.0 createAuditWriter cathedral. Captures a
  6-frame stack via new Error().stack so the offender is readable
  without spending stderr noise.
- src/core/db.ts: logDbDisconnect call at the top of disconnect()
  (best-effort; never blocks the real teardown).
- src/core/postgres-engine.ts: same instrumentation in
  PostgresEngine.disconnect() — distinguishes 'module' vs 'instance'
  connection_style so we can tell legitimate worker-pool teardowns
  apart from the load-bearing module-singleton class.
- src/commands/doctor.ts: extends batch_retry_health to surface
  24h disconnect count + most-recent caller stack. Warns when the
  caller frame isn't a known CLI-exit frame (e.g. cli.ts's finally
  block at the end of an op-dispatch). This is the diagnostic that
  tells v0.41.28+ where to apply the real ownership fix.
- test/db-disconnect-audit.test.ts: unit coverage for the audit
  writer + caller-stack capture + JSONL shape.
- test/e2e/db-singleton-shared-recovery.test.ts: real-Postgres
  regression that exercises the singleton-null path end-to-end.

Refs #1570
…1.27.0)

withRetry gains an opt-in reconnect callback that fires between the
isRetryableConnError classification and the inter-attempt sleep.
PostgresEngine.batchRetry injects this.reconnect() — race-safe via
the existing _reconnecting guard, handles module and instance pools.

Closes the production loss reported in #1570: dream cycles on Supabase
no longer drop ~150 link rows per cycle when the singleton goes null
mid-batch. The retry now rebuilds the connection between attempts so
the second try has somewhere to write to.

- src/core/retry.ts: WithRetryOpts gains `reconnect?: () => Promise<void>`.
  Awaited in the catch branch. onRetry is also now awaited (back-compat-
  safe: every existing in-tree caller is a sync arrow). Reconnect
  failures propagate as the real cause — replaces the symptomatic
  "No database connection" error with whatever the connect() throw
  was, so operators see the truth.
- src/core/postgres-engine.ts:batchRetry — injects
  `reconnect: () => this.reconnect()`. Covers all 9 batch-retry call
  sites (addLinksBatch, addTimelineEntriesBatch, upsertChunks, plus
  the 6 caller-supplied auditSite labels in extract / sync / reindex).
- test/core/retry-reconnect.test.ts: 8 hermetic cases pinning the
  contract — reconnect fires before sleep, only on retryable errors,
  back-compat when omitted, signal-aborted bypasses reconnect,
  onRetry is awaited, full success path end-to-end.

The deeper bug (who's calling disconnect mid-cycle) is left
unaddressed in this commit by design — the diagnostic instrumentation
in the prior commit will tell us in the next production run.

Refs #1570
Closes the silent 'No database connection' tail-end errors after
gbrain capture / put_page: the facts:absorb fire-and-forget queue
sometimes outlived the CLI process's connection lifetime, so absorb
attempts after engine.disconnect() landed in stderr as the
GBrainError shape.

- src/core/facts/queue.ts: new drainPending({timeout: 1000}) method
  distinct from shutdown(). Stops accepting new enqueues, awaits
  in-flight settle, bounded by timeout, returns count of unfinished.
  Semantically different from shutdown() (which aborts in-flight)
  so the symptom — drop work that hasn't started yet but let
  in-flight work finish — matches what CLI exit actually needs.
- src/cli.ts: op-dispatch finally block awaits the drain BEFORE
  engine.disconnect(). Bounded 1s. Opt-out env GBRAIN_NO_FACTS_DRAIN
  for callers that don't enqueue (keeps fast-exit paths fast).
  Mirrors the v0.41.8.0 awaitPendingLastRetrievedWrites pattern.
- test/facts-queue-drain-pending.test.ts: 6 hermetic cases — empty
  drain returns immediately, single in-flight settles, timeout
  bounds wait, shutdown-after-drain is idempotent, post-drain
  enqueues are dropped, signal-aborted skips waiting.

Refs #1570
…v0.41.27.0

Master shipped v0.41.26.1 (#1572 lock-renewal cathedral) during this
branch's review cycle. Merging it in:

- VERSION stays at 0.41.27.0 (still the next available slot above
  master's 0.41.26.1 — neither side claimed 0.41.27.0)
- CHANGELOG.md: my v0.41.27.0 entry on top; master's v0.41.26.1 entry
  preserved verbatim below
- TODOS.md: my v0.41.27.0 #1570 follow-ups (forward-references bumped
  to v0.41.28+); master's v0.41.26.1 lock-renewal follow-ups preserved
- package.json: synced to 0.41.27.0

No source-file conflicts — master's lock-renewal work is in
src/core/audit/lock-renewal-audit.ts (new), src/core/minions/worker.ts
(modified), and a handful of audit/test files. None overlap with the
#1570 surface (retry.ts, facts/queue.ts, db.ts, postgres-engine.ts).
README.md: added troubleshooting entry for the v0.41.27.0 retry-reconnect
+ facts:absorb drain fix (closes #1570), pointing operators at
`gbrain doctor --json` to find the offending disconnect caller.

CLAUDE.md: extended `src/core/retry.ts` entry with the new optional
`reconnect` callback (v0.41.27.0); added two new Key Files entries for
`src/core/audit/db-disconnect-audit.ts` (the diagnostic half of the
"instrument first, fix later" pivot) and `FactsQueue.drainPending`;
extended `doctor.ts:checkBatchRetryHealth` entry with the in-place
extension that surfaces 24h disconnect-call count.

llms-full.txt: regenerated to absorb CLAUDE.md edits.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Master shipped v0.41.27.0 (#1573 git-aware sync_freshness) claiming the
same slot. Rebump to the next available version.

- VERSION + package.json → 0.41.28.0
- CHANGELOG.md: my entry header + in-entry refs 0.41.27.0 → 0.41.28.0
- TODOS.md: my #1570 follow-up section header + body refs bumped
Master shipped v0.41.27.0 (#1573 git-aware sync_freshness check) which
claimed the version slot this branch originally targeted. This branch
already rebumped to v0.41.28.0 in the prior commit; merging master in:

- VERSION stays 0.41.28.0 (next slot above master's 0.41.27.0)
- CHANGELOG.md: my v0.41.28.0 entry on top; master's v0.41.27.0 entry
  preserved verbatim below
- package.json: synced to 0.41.28.0
- src/commands/doctor.ts, CLAUDE.md, llms-full.txt: auto-merged
  (master's sync_freshness check + my batch_retry_health disconnect
  audit extension both land; no overlap)
- New from master: src/core/git-head.ts + test/core/git-head.test.ts

No #1570-surface conflicts — master's sync_freshness work is orthogonal
to the retry/facts/disconnect-audit changes.
…ard fix)

Both files failed on CI shards 1 and 8 under the cross-file gateway-state
leak class (CLAUDE.md "Test-isolation lint and helpers"). The v0.41.28.0
merge reshuffled the weight-based shard bin-packing, landing a
gateway-mutating sibling ahead of these two victims in the same `bun test`
process.

Mechanism:
- put-page-provenance: put_page embeds via the gateway. A sibling left
  the gateway configured with OpenAI + the CI placeholder `sk-test`
  (captured at configureGateway time, survives the withEnv restore as
  cached gateway state). put_page's embed then fired against live OpenAI
  and 401'd. The bunfig legacy-embedding preload's beforeEach only
  re-applies legacy when the gateway was RESET — it does NOT correct a
  sibling that configured a different LIVE config.
- embedding-dim-check: initSchema builds the content_chunks vector column
  at the gateway's configured dim. A sibling leaking ZE/1280 made the
  column 1280-d, so `expect(dims).toBe(1536)` failed.

Fix (victim-side pinning, the escape hatch the preload documents):
- Both: configure the gateway explicitly in beforeAll BEFORE initSchema
  (OpenAI/1536), resetGateway() in afterAll so neither leaks onward.
- put-page-provenance also stubs the embed transport via
  __setEmbedTransportForTests so embed is deterministic and offline; a
  dummy OPENAI_API_KEY is supplied in the gateway env because
  instantiateEmbedding builds the OpenAI client (key check) BEFORE the
  stubbed transport is reached — the stub then intercepts the actual
  call so the key never leaves the process.

Verified: CI shards 1 (1337 pass) + 8 (905 pass) green with
OPENAI_API_KEY unset, plus adversarial sibling orderings (gateway.test /
doctor-ze-checks preceding). Typecheck + check-test-isolation clean.
@garrytan garrytan merged commit ffac8ce into master May 29, 2026
21 checks passed
mgunnin added a commit to mgunnin/gbrain that referenced this pull request Jun 3, 2026
* upstream/master:
  v0.41.29.0 feat(conversation-parser): bold-name-no-time builtin + fix(orphans): source-scoped orphan_ratio (supersedes garrytan#1613) (garrytan#1620)
  v0.41.27.0 fix: withRetry self-heals on null singleton + facts:absorb drain + disconnect audit (closes garrytan#1570) (garrytan#1608)
  v0.41.27.0 fix(doctor): git-aware sync_freshness (supersedes garrytan#1564) (garrytan#1573)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

bug: withRetry doesn't reconnect on 'No database connection' — batch rows lost every dream cycle

1 participant