fix(db): refcount module singleton to fix mid-cycle "connect() has not been called" (#1570)#1754
Conversation
…ll it mid-cycle (garrytan#1570) The module-level `sql` singleton in src/core/db.ts is shared by every module-style engine in the process. A transient engine (e.g. a probe/migration engine spun up during `gbrain dream`) calling disconnect() ran sql.end() and nulled the shared singleton while a long-lived owner (the dream cycle engine) was still using it. The sync + synthesize phases then threw "No database connection: connect() has not been called" mid-cycle. The v0.41.25 batch-retry layer masked it for extract by reconnecting (hence the ~150-lost-rows symptom), but sync/synthesize have no such retry and hard-fail. Fix: reference-count the singleton's owners. connect() increments (sets 1 on a fresh open); disconnect() decrements and only ends/nulls the pool when the LAST owner releases it. Adds test/db-singleton-refcount.test.ts (mocks postgres; asserts connect→connect→disconnect keeps the pool alive, final disconnect tears it down once).
… null the shared pool (garrytan#1570) Adapts upstream PR garrytan#1754. The db.ts module singleton 'sql' is shared by every module-style engine in the process (the autopilot/dream cycle connects this way). A transient module-style engine's disconnect() called sql.end() and nulled the singleton while a long-lived owner was still mid-cycle, so the next phase threw 'No database connection: connect() has not been called' and the worker crash-looped. Refcount owners: connect() increments, disconnect() decrements, pool tears down only when the LAST owner releases it. Confirmed via db-disconnect-audit as the real autopilot crash-loop root cause (instance-pool reconnect() fix in this branch was a separate, complementary hardening that did not touch this module-style path). Also updates two tests for behavior this branch legitimately changed: - connection-resilience guard: reconnect() is now build-then-swap (rebuild via connect(), tear down prior pool via .end()), not disconnect()-first. - db-singleton-shared-recovery CASE 1: refcount now PREVENTS the stray-disconnect from nulling the shared singleton, so engineA survives without needing reconnect. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
Independent corroboration from a separate deployment (gbrain 0.42.1.0, Supabase transaction pooler), plus a couple of data points that might help scope this. Same class, pinned caller. Using the same Triage tip for the audit log: classify Relationship to the already-merged lint fix (#1678): master's One scope boundary worth noting: refcounting prevents borrower-induced nulls, but it doesn't cover the original #1570 symptom — a genuine PgBouncer/pooler reap that drops the backend server-side, where a later query finds the pool dead. That residual case needs a reconnect-on-use (e.g. Happy to share the audit excerpts if useful. |
|
Adapted this onto our fork's integration branch and can confirm from production it's the real fix for a Context: long-running autopilot daemon, module-singleton engine (it connects via The v0.41.27.0 After applying the refcount (this PR):
For anyone comparing the cluster: we'd separately hardened the instance-pool One note for adopters whose branch also carries the v0.41.25.0 Thanks for this — clean fix. |
|
Thanks for this! The Postgres module-singleton / "connect() has not been called" ownership bug was fixed canonically in #1805 (atomic owner-only-disconnect), with follow-ups in #1608 and #1572 — all merged to master. Closing as already-resolved. If you still reproduce after upgrading, please reopen, and thank you for the contribution. |
Summary
Fixes the
#1570"singleton-null" class: duringgbrain dreamthesyncandsynthesizephases throwNo database connection: connect() has not been calledmid-cycle, even though the engine connected successfully at startup. Root cause is a shared-ownership bug on thedb.tsmodule singleton; fix is to reference-count its owners.This is the production caller the v0.41.25 disconnect instrumentation (
src/core/audit/db-disconnect-audit.ts, "only this audit tells us WHO is actually calling disconnect mid-process") was added to find.Reproduction
gbrain dreamagainst a PgBouncer-pooled Postgres (Supabase-style transaction pooler). The cycle completes "partial":Root cause (traced with live stacks)
The module-level
sqlsingleton insrc/core/db.tsis shared by every module-style engine in the process.gbrain dreamconnects more than one engine (the main cycle engine plus a transient probe/migration engine). When the transient engine callsdisconnect(),PostgresEngine.disconnect()(module style) callsdb.disconnect(), which runssql.end()and nulls the shared singleton — while the long-lived cycle engine is still using it.Instrumented
db.disconnect()with a stack — the offending call fires beforecycle.sync:…and then every singleton consumer in
sync/synthesizethrows:extractonly survives because the v0.41.25 batch-retry layer detects the dead connection and reconnects mid-phase (the "~150 lost rows" symptom).sync/synthesizehave no such retry and hard-fail.Fix
Reference-count the module singleton in
src/core/db.ts:connect()increments the count (and sets it to1on a fresh open; resets to0on connect failure).disconnect()decrements, and only runssql.end()+ nulls the singleton when the last owner releases it.A transient engine's
disconnect()now decrements without tearing down the connection the cycle engine still depends on. ~17 lines, no behavior change for the single-owner path.Verification
After the fix, the same
gbrain dreamrun:Tests
test/db-singleton-refcount.test.ts(mockspostgres, no live DB): assertsconnect → connect → disconnectkeeps the pool alive, and the finaldisconnecttears it down exactly once.connection-resilience+connection-managersuites: 51 pass, 0 fail (no regression).Notes
poolSize(instance mode) so they never touch the singleton, but refcounting is the smaller, more general fix and keeps the module-singleton contract intact for all existing callers.