Skip to content

v0.42.21.0 fix(postgres): module-singleton ownership — canonical landing for the dream-cycle "connect() has not been called" class (#1404/#1471/#1619)#1805

Merged
garrytan merged 5 commits into
masterfrom
garrytan/fix-connect-singleton-clobber
Jun 3, 2026
Merged

v0.42.21.0 fix(postgres): module-singleton ownership — canonical landing for the dream-cycle "connect() has not been called" class (#1404/#1471/#1619)#1805
garrytan merged 5 commits into
masterfrom
garrytan/fix-connect-singleton-clobber

Conversation

@garrytan

@garrytan garrytan commented Jun 3, 2026

Copy link
Copy Markdown
Owner

What this fixes

gbrain dream on Postgres (local or Supabase) failed every DB-dependent phase with No database connection: connect() has not been called. lint/backlinks (filesystem-only) worked; sync/synthesize/embed/orphans blew up; extract reported "created 0" while silently dropping every row. Each phase worked in isolation (separate process) — only the full cycle broke, and it left a stuck gbrain_cycle_locks row.

This is the canonical landing for a long-standing class with ~15 competing community PRs and zero merged. Closes #1404, #1471, #1619 (symptom tracked in #1570/#1535).

Root cause (confirmed)

The dream cycle opens one long-lived "owner" connection (the module-level sql singleton in src/core/db.ts). Short-lived "borrower" probe engines (lint config-lift, doctor) open without poolSize, so they share that singleton — but their disconnect() cascaded into db.disconnect() and nulled the singleton the owner was still using. The module sql is only ever nulled by db.disconnect() (postgres.js auto-reconnects its own internal pool and never touches our reference), so the failure was always a borrower-disconnect — never the "idle pooler drop" hypothesized in #1570/#1619. PostgresEngine.disconnect() called db.disconnect() for any _connectionStyle === 'module' engine with no check for whether it actually created the singleton.

Prior partial fixes addressed symptoms: v0.41.27.0's retry-with-reconnect rescued batch phases (extract) but not sync/synthesize (they don't use the retry path); v0.42.5.0 stopped the lint phase from being a borrower but left the structural hole open for every other helper.

The fix

  • Ownership token. db.connect() now returns booleantrue iff THIS call created the singleton. The decision is atomic: there's no await between the if (sql) null-check and the synchronous sql = postgres(...) assignment, so two concurrent connects can't both claim creation. PostgresEngine stores it as _ownsModuleSingleton and only calls db.disconnect() when it owns the connection; borrowers clear their own marker and leave the shared pool alone.
  • Concurrency hardening (same blast radius): db.disconnect() snapshots + nulls sql before awaiting end() (a concurrent connect can't join a closing pool); PostgresEngine.reconnect() uses a shared in-flight _reconnectPromise (replaces the _reconnecting boolean that returned the 2nd caller immediately against a half-rebuilt pool).
  • cli.ts invariant tripwire on the dream owner-disconnect documenting that the owner must outlive its borrowers.

Tests

  • New test/postgres-engine-singleton-ownership.test.ts — source-level guardrails for the ownership contract (atomic token, guarded teardown, no TOCTOU pre-sample).
  • Expanded DB-gated matrix in test/e2e/postgres-engine-disconnect-idempotency.test.ts — owner/borrower, creation-not-role, symmetric CLI-exit (no hang regression), owner-reconnect-with-live-borrower.
  • Module-style getter asymmetry in test/postgres-engine-getter-selfheal.test.ts.
  • bug: withRetry doesn't reconnect on 'No database connection' — batch rows lost every dream cycle #1570 shared-recovery regression updated to assert the fixed contract (it had pinned the bug).

Verification: bun run verify 29/29; blast-radius + e2e + migrate suites 254/254 against real Postgres. Full unit matrix runs in CI (the local box couldn't complete it under load).

Review

Ran /plan-eng-review + /review with both Codex and Claude adversarial passes. Both verified the core fix correct. They converged on one concern — per-engine ownership state about a shared resource can desync under concurrent module connect/reconnect — which is not reachable in current gbrain (sequential cycle phases, nested borrowers, instance-engine worker pool, background writes reuse the owner). Captured as a TODOS.md follow-up (the refcount/generation hardening to do before any concurrent-module-connect path is introduced), alongside two other pre-existing items (facts-queue drain on dream/fall-through paths; stale ConnectionManager after reconnect).

Credit

The ownership-flag approach that shipped: @nullhex-io (#1651) and @joelwp (#1667). Root-cause walkthroughs: @BrendanGahan, @xaviroblessarries, and the other #1471 reporters.

🤖 Generated with Claude Code

garrytan and others added 3 commits June 2, 2026 22:48
…nger nulls the cycle's connection (#1404/#1471/#1619)

gbrain dream on Postgres failed every DB phase with "No database connection:
connect() has not been called": a short-lived borrower probe engine (lint/doctor
config-lift, no poolSize) called db.disconnect() in its own disconnect(), nulling
the shared module singleton the long-lived cycle owner was still using. The module
`sql` is only ever nulled by db.disconnect() (postgres.js auto-reconnects its own
pool), so the failure was always a borrower-disconnect, never an idle-pooler drop.

Fix: db.connect() returns whether THIS call created the singleton (atomic — no
await between the null-check and the sql=postgres() assignment), PostgresEngine
stores it as _ownsModuleSingleton, and disconnect() only calls db.disconnect()
when it owns the connection. Borrowers no-op. Hardening: db.disconnect() snapshots
+nulls sql before awaiting end(); reconnect() shares one in-flight _reconnectPromise.

Tests: new postgres-engine-singleton-ownership.test.ts; expanded DB-gated e2e
matrix (owner/borrower, creation-not-role, symmetric CLI-exit, owner-reconnect-
with-live-borrower); module-style getter asymmetry; #1570 shared-recovery
regression updated to assert the fixed contract.

Co-Authored-By: nullhex-io <noreply@github.com>
Co-Authored-By: joelwp <noreply@github.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…2.21.0)

Resolves conflicts from master's v0.42.11.0–0.42.16.0 wave:
- VERSION/package.json/CHANGELOG → 0.42.21.0 (claimed slot past sibling versions)
- src/core/postgres-engine.ts reconnect() + src/core/retry.ts: merged this branch's
  shared `_reconnectPromise` concurrency fix with master's #1685 pool-recovery audit
  (ctx?.error classification + logPoolRecovery), keeping both.
- CLAUDE.md: took master's reference-map refactor; documented the module-singleton
  ownership behavior as current-state prose in docs/architecture/KEY_FILES.md instead.
- TODOS.md/CHANGELOG.md: kept both this branch's and master's new sections.
- llms bundles regenerated.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@garrytan garrytan changed the title v0.42.15.0 fix(postgres): module-singleton ownership — canonical landing for the dream-cycle "connect() has not been called" class (#1404/#1471/#1619) v0.42.21.0 fix(postgres): module-singleton ownership — canonical landing for the dream-cycle "connect() has not been called" class (#1404/#1471/#1619) Jun 3, 2026
…2.21.0)

Resolves conflicts from master's v0.42.17.0–0.42.18.0 sync wave:
- VERSION/package.json/CHANGELOG → 0.42.21.0 (kept my entry on top + master's new entries)
- src/cli.ts finally block: kept BOTH master's `syncWatchdog?.dispose()` (#1633) and
  this branch's #1471 owner-disconnect tripwire comment.
- llms bundles regenerated.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…2.21.0)

Resolves conflicts from master's v0.42.19.0–0.42.20.0 reliability wave:
- src/core/postgres-engine.ts reconnect(): took master's style-aware reconnect
  (#1745/#1810 — module-singleton path recovers idempotently via db.connect()
  WITHOUT tearing down the shared pool; instance path rebuilds + pool-recovery
  audit). Kept this branch's _ownsModuleSingleton ownership token in
  connect()/disconnect(). Dropped this branch's _reconnectPromise in favor of
  master's _reconnecting guard (master's module-never-teardown obviates it);
  updated the singleton-ownership test + retry.ts/batchRetry comments to match.
- src/cli.ts: took master's drain-before-disconnect block (#1762) on the
  CLI_ONLY owner-disconnect (this resolves the F5 facts-queue-drain TODO for the
  fall-through path); kept the #1471 owner-disconnect-last invariant note.
- VERSION/package.json/CHANGELOG/TODOS → 0.42.21.0, both sides' entries kept.
- llms bundles regenerated.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@garrytan garrytan merged commit f3ade6c into master Jun 3, 2026
21 checks passed
mgunnin added a commit to mgunnin/gbrain that referenced this pull request Jun 3, 2026
* upstream/master:
  v0.42.23.0 feat(jobs): --nice scheduling-priority flag for jobs work/supervisor (garrytan#1815) (garrytan#1820)
  v0.42.22.0 fix(minions): supervisor progress watchdog + worker DB self-defense — alive-but-wedged worker self-heals (garrytan#1801) (garrytan#1824)
  v0.42.21.0 fix(postgres): module-singleton ownership — canonical landing for the dream-cycle "connect() has not been called" class (garrytan#1404/garrytan#1471/garrytan#1619) (garrytan#1805)
  v0.42.20.0 fix: reliability wave — PGLite capture lock-pin + Postgres reconnect race + search embed-hang (garrytan#1762 garrytan#1745 garrytan#1775) (garrytan#1810)
  v0.42.19.0 fix(skillopt): close the last gap in the AI SDK v6 tool-loop fix (write-capture mapper + regression test) (garrytan#1809)
  v0.42.18.0 fix: sync orphan-pileup watchdog (garrytan#1633) + links-lag µs stamp (garrytan#1768) (garrytan#1807)
  v0.42.17.0 fix(sync): resumable incremental sync — killed mid-import no longer loses progress (garrytan#1794) (garrytan#1808)
  v0.42.16.0 feat(doctor): brain health as a solved problem — cause-ranked doctor + OOM-loop line + auto-drain + pool-reap (garrytan#1685) (garrytan#1802)
  v0.42.15.0 fix: decouple CLI primary output from process.stdout.isTTY (garrytan#1784) (garrytan#1806)
  v0.42.14.0 fix(zero-config): code-* readiness signal + init embedding-key validation + lock self-heal (garrytan#1780) (garrytan#1804)
  v0.42.13.0 fix(search): archive/ content findable by default, demoted not hard-excluded (garrytan#1777) (garrytan#1797)
  v0.42.12.0 feat: self-upgrading gbrain — invocation-riding update check + opt-in auto-upgrade (garrytan#1798)
  v0.42.11.0 feat(skillopt): held-out eval gate, honest receipts, ENFORCE + ablation opts (garrytan#1759)
  v0.42.10.0 feat(extract): opt-in global-basename wikilink resolution (closes garrytan#972) (garrytan#1388)
This was referenced Jun 8, 2026
This was referenced Jun 8, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant