Skip to content

fix(db): refcount module singleton to fix mid-cycle "connect() has not been called" (#1570)#1754

Closed
umutkeltek wants to merge 1 commit into
garrytan:masterfrom
umutkeltek:umut/fix-1570-singleton-refcount
Closed

fix(db): refcount module singleton to fix mid-cycle "connect() has not been called" (#1570)#1754
umutkeltek wants to merge 1 commit into
garrytan:masterfrom
umutkeltek:umut/fix-1570-singleton-refcount

Conversation

@umutkeltek

Copy link
Copy Markdown

Summary

Fixes the #1570 "singleton-null" class: during gbrain dream the sync and synthesize phases throw No database connection: connect() has not been called mid-cycle, even though the engine connected successfully at startup. Root cause is a shared-ownership bug on the db.ts module singleton; fix is to reference-count its owners.

This is the production caller the v0.41.25 disconnect instrumentation (src/core/audit/db-disconnect-audit.ts, "only this audit tells us WHO is actually calling disconnect mid-process") was added to find.

Reproduction

gbrain dream against a PgBouncer-pooled Postgres (Supabase-style transaction pooler). The cycle completes "partial":

✗ sync        sync phase failed
    [InternalError/UNKNOWN] No database connection: connect() has not been called.
✗ synthesize  synthesize phase failed
    [InternalError/SYNTH_PHASE_FAIL] No database connection: connect() has not been called.
✓ extract     ... (only survives because the batch-retry layer reconnects mid-phase)

Root cause (traced with live stacks)

The module-level sql singleton in src/core/db.ts is shared by every module-style engine in the process. gbrain dream connects more than one engine (the main cycle engine plus a transient probe/migration engine). When the transient engine calls disconnect(), PostgresEngine.disconnect() (module style) calls db.disconnect(), which runs sql.end() and nulls the shared singleton — while the long-lived cycle engine is still using it.

Instrumented db.disconnect() with a stack — the offending call fires before cycle.sync:

Error: db.disconnect
    at disconnect (src/core/db.ts)            // sql.end() + sql = null
    at disconnect (src/core/postgres-engine.ts)  // PostgresEngine.disconnect, module style
    at processTicksAndRejections (native)

…and then every singleton consumer in sync/synthesize throws:

at getConnection (src/core/db.ts)
at executeRaw (src/core/postgres-engine.ts)
at resolveSourceForDir (src/core/cycle.ts)     // sync phase
at runPhaseSync (src/core/cycle.ts)
at getConnection (src/core/db.ts)
at getConfig (src/core/postgres-engine.ts)
at loadSynthConfig (src/core/cycle/synthesize.ts)   // synthesize phase

extract only survives because the v0.41.25 batch-retry layer detects the dead connection and reconnects mid-phase (the "~150 lost rows" symptom). sync/synthesize have no such retry and hard-fail.

Fix

Reference-count the module singleton in src/core/db.ts:

  • connect() increments the count (and sets it to 1 on a fresh open; resets to 0 on connect failure).
  • disconnect() decrements, and only runs sql.end() + nulls the singleton when the last owner releases it.

A transient engine's disconnect() now decrements without tearing down the connection the cycle engine still depends on. ~17 lines, no behavior change for the single-owner path.

Verification

After the fix, the same gbrain dream run:

✓ sync        +1377 added, ~86 modified, -0 deleted
- synthesize  dream.synthesize.session_corpus_dir is unset   (skipped on config, no longer a crash)
✓ extract / extract_facts / recompute_emotional_weight / embed
"connect() has not been called" errors: 0

Tests

  • New test/db-singleton-refcount.test.ts (mocks postgres, no live DB): asserts connect → connect → disconnect keeps the pool alive, and the final disconnect tears it down exactly once.
  • Existing connection-resilience + connection-manager suites: 51 pass, 0 fail (no regression).

Notes

  • The v0.41.25 batch-retry reconnect (symptom mitigation) remains valuable for genuine connection drops; this patches the actual ownership boundary so the disconnect doesn't happen in the first place.
  • An alternative would be to give transient/probe engines poolSize (instance mode) so they never touch the singleton, but refcounting is the smaller, more general fix and keeps the module-singleton contract intact for all existing callers.

…ll it mid-cycle (garrytan#1570)

The module-level `sql` singleton in src/core/db.ts is shared by every module-style engine in the process. A transient engine (e.g. a probe/migration engine spun up during `gbrain dream`) calling disconnect() ran sql.end() and nulled the shared singleton while a long-lived owner (the dream cycle engine) was still using it. The sync + synthesize phases then threw "No database connection: connect() has not been called" mid-cycle. The v0.41.25 batch-retry layer masked it for extract by reconnecting (hence the ~150-lost-rows symptom), but sync/synthesize have no such retry and hard-fail.

Fix: reference-count the singleton's owners. connect() increments (sets 1 on a fresh open); disconnect() decrements and only ends/nulls the pool when the LAST owner releases it.

Adds test/db-singleton-refcount.test.ts (mocks postgres; asserts connect→connect→disconnect keeps the pool alive, final disconnect tears it down once).
rayers added a commit to rayers/gbrain that referenced this pull request Jun 2, 2026
… null the shared pool (garrytan#1570)

Adapts upstream PR garrytan#1754. The db.ts module singleton 'sql' is shared by every
module-style engine in the process (the autopilot/dream cycle connects this way).
A transient module-style engine's disconnect() called sql.end() and nulled the
singleton while a long-lived owner was still mid-cycle, so the next phase threw
'No database connection: connect() has not been called' and the worker crash-looped.

Refcount owners: connect() increments, disconnect() decrements, pool tears down
only when the LAST owner releases it. Confirmed via db-disconnect-audit as the
real autopilot crash-loop root cause (instance-pool reconnect() fix in this branch
was a separate, complementary hardening that did not touch this module-style path).

Also updates two tests for behavior this branch legitimately changed:
- connection-resilience guard: reconnect() is now build-then-swap (rebuild via
  connect(), tear down prior pool via .end()), not disconnect()-first.
- db-singleton-shared-recovery CASE 1: refcount now PREVENTS the stray-disconnect
  from nulling the shared singleton, so engineA survives without needing reconnect.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@spiky02plateau

Copy link
Copy Markdown

Independent corroboration from a separate deployment (gbrain 0.42.1.0, Supabase transaction pooler), plus a couple of data points that might help scope this.

Same class, pinned caller. Using the same db-disconnect-audit.ts instrumentation, I traced our mid-cycle No database connection to a specific transient borrower: resolveLintContentSanity() in src/commands/lint.ts. On 0.42.1.0 it does createEngine({...}) without a poolSize, so the probe engine is module-style (_connectionStyle==='module', _sql===null); its finally { await engine.disconnect() } then falls through PostgresEngine.disconnect()db.disconnect()sql.end() and nulls the singleton the live cycle engine is still on. It fires ~6ms after lock acquisition (the lint phase is first), so the crash surfaces later (for us at conversation_facts_backfill) but the nuller is lint.

Triage tip for the audit log: classify caller_stack — a frame containing cli.ts = the legitimate end-of-command teardown; a bare disconnect(...) → processTicksAndRejections with no cli.ts/reconnect frame = the mid-process borrower you're hunting. That cleanly separates the two same-path callers.

Relationship to the already-merged lint fix (#1678): master's resolveLintContentSanity(sharedEngine?) now reuses the cycle engine instead of creating+disconnecting its own, which removes that borrower. This PR's refcount is still complementary, not redundant — it hardens against any module-style borrower (e.g. the fire-and-forget engine.disconnect().catch() in the v0.32.2 migration finalizer is worth auditing as another candidate), so a future borrower can't re-introduce the same null.

One scope boundary worth noting: refcounting prevents borrower-induced nulls, but it doesn't cover the original #1570 symptom — a genuine PgBouncer/pooler reap that drops the backend server-side, where a later query finds the pool dead. That residual case needs a reconnect-on-use (e.g. getConnection() rebuilding from the last-known-good config when sql is null) or the batch-retry reconnect in #1593. In practice the full fix looks like #1678 (remove the lint borrower) + this refcount (harden borrowers) + reconnect-on-use (cover real reaps).

Happy to share the audit excerpts if useful.

@rayers

rayers commented Jun 3, 2026

Copy link
Copy Markdown

Adapted this onto our fork's integration branch and can confirm from production it's the real fix for a gbrain autopilot crash loop — the refcount approach here is the right layer.

Context: long-running autopilot daemon, module-singleton engine (it connects via connectEngine() with no pool size, so _sql is null and it uses the db.ts shared sql). It was crash-looping every ~5 min: a full cycle's early phases ran, then a late phase threw No database connection: connect() has not been called, executeJob crashed unhandled, the worker exited code=1, respawned, repeated. The embed phase was never reached, so the embed backlog never drained.

The v0.41.27.0 db-disconnect-audit pinned the caller (exactly what that instrumentation was added to find): the nulling disconnects came from db.ts:237 ← postgres-engine.ts:223 — a transient module-style engine's disconnect() calling sql.end() and nulling the process-global singleton while the cycle engine was still mid-run. Audit entries were all connection_style: module.

After applying the refcount (this PR):

  • No database connection occurrences since the fix went live: 0
  • worker exited code=1: 0
  • embed backlog: 110 → 0, drained on the first healthy cycle and holding
  • cycle now completes end-to-end (score=57 elapsed=6s next=150s)

For anyone comparing the cluster: we'd separately hardened the instance-pool reconnect() (build-then-swap so a failed rebuild can't leave _sql null) — real, but it does NOT touch the module-singleton path autopilot uses, so it didn't fix the loop. The refcount here is what did. It composes cleanly with the retry/reconnect work in #1593/#1615 (prevention + recovery are complementary); #1758 (lazy-reconnect) and #1667 (ownership-guard) target the same layer, but this was the smallest surgical change (+85/-3) and dropped in clean.

One note for adopters whose branch also carries the v0.41.25.0 db-singleton-shared-recovery e2e: CASE 1 asserts the pre-refcount symptom (engine B's disconnect nulls the shared singleton so engine A throws). With refcount, A survives B's disconnect without needing reconnect — so that assertion flips from "bug reproduced" to "bug prevented." Expected, and the better contract.

Thanks for this — clean fix.

@garrytan

garrytan commented Jun 8, 2026

Copy link
Copy Markdown
Owner

Thanks for this! The Postgres module-singleton / "connect() has not been called" ownership bug was fixed canonically in #1805 (atomic owner-only-disconnect), with follow-ups in #1608 and #1572 — all merged to master. Closing as already-resolved. If you still reproduce after upgrading, please reopen, and thank you for the contribution.

@garrytan garrytan closed this Jun 8, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants