Summary
On Postgres engine, the autopilot --inline daemon silently loses link rows during the extract.links_fs phase, and engine.getHealth() then throws No database connection: connect() has not been called every cycle. Affects v0.41.0.0 (current at time of filing). Not the same class as #1162 / PR #465 (which is about the reconnect loop crashing the daemon) — this fires even when the daemon stays alive.
Repro
Stock v0.41.0.0 on a Postgres-engine brain. From an interactive shell with your GBRAIN_DATABASE_URL / ZEROENTROPY_API_KEY etc. loaded:
gbrain dream --dir /path/to/your/brain --json 2>&1 | tee /tmp/dream.log
Output around the extract phase will contain:
[cycle.extract] start
[extract.links_fs] 3/3 (100%)
batch error (2 link rows lost): No database connection: connect() has not been called. Fix: Run gbrain init --supabase or gbrain init --url <connection_string>
[extract.links_fs] 3/3 (100%) done
The cycle continues, every subsequent phase that calls this.sql and falls back to the module-level db.getConnection() will fail the same way (in autopilot's main loop this surfaces as [health] ERROR ... after the cycle completes, but the daemon does not crash — logError catches it).
Root cause
PostgresEngine.connect(config) without a poolSize arg goes to the module-level singleton branch (postgres-engine.ts:175-189) — db.connect(config) runs, but the engine instance's own _sql stays null. From then on, this.sql returns db.getConnection() (the module singleton).
Somewhere in the cycle's call graph (still narrowing — observed firing during the extract.links_fs phase, after synthesize completes — even on a brain with no transcripts so synthesize is a near-noop), the module-level sql in db.ts gets nulled. Once that happens, every later engine call routed through the getter throws connect() has not been called. The lost-rows behavior in extract.ts:614-628 is structural — the flush() function catches and logs the throw, but batch.length = 0 in the finally clears the un-inserted rows regardless.
This is the same bug class the source comment at src/commands/auth.ts:52 acknowledges for auth commands:
v0.32: createEngine returns a disconnected instance. PostgresEngine's sql getter falls back to db.getConnection() (the module-level singleton) when _sql is unset, which throws "connect() has not been called" when db.connect() was never invoked either.
The auth-command instance was fixed by having withConfiguredSql call engine.connect(). Long-running daemons that hold the engine across many cycle phases have a different exposure: SOMETHING during the cycle nulls the singleton out from under them.
I did not isolate the exact line that nulls sql in db.ts (the only sites that set sql = null are connect()'s catch path on line 216 and disconnect() on line 230 — neither of which any cycle phase appears to invoke directly). It could be a transient pool error that triggers the catch-path null, or a code path I missed. Filing this as a bug first so a maintainer who knows the cycle code better can localize the source.
What this affects
gbrain autopilot --inline on Postgres: silent row loss in every extract phase that has links to insert. Status reads as partial, but the bare counts shown in the human log line (extracted=N) look correct because they reflect intent, not actual rows persisted. Operators trusting the log line will believe the daemon is healthy.
gbrain autopilot Minions-dispatch path on Postgres: same class, fires when the dispatched job's engine routes through the singleton (per-job worker engines using poolSize are immune; the autopilot's parent engine handling intra-tick orchestration is not).
- Adaptive
engine.getHealth() health check post-cycle: throws connect() has not been called and logError('health', e) writes the line, daemon survives, no adaptive backoff occurs because interval stays at baseInterval.
Relationship to PR #465
PR #465 (open, ~1 month no review) fixes the (engine as any).connect?.() no-config-arg bug in the autopilot reconnect path. That's a real fix and lands when this issue lands. But applying #465 alone leaves THIS bug fully present — I confirmed it by patching #465 locally and re-running. The daemon stays alive (good — that's what #465 ships), but extract.links_fs still loses rows on every cycle and getHealth still throws.
So this is a separate fix wave.
Suggested fix
Two reasonable shapes; I'm filing PR for the second one because it's the narrowest scope and survives architectural disagreement about the first.
Option A — engine-level: Have PostgresEngine.connect(config) always create its own _sql regardless of poolSize. Treat poolSize as a sizing hint, not a "should we own a pool" toggle. This kills the singleton-fallback class entirely. Risk: every existing CLI caller now opens a pool instead of sharing one — small extra connection overhead, but it's the architecturally cleaner answer.
Option B — autopilot-specific (what my PR does): At autopilot startup, after connectEngine() returns, force engine.connect({...savedConfig, poolSize: 5}) so the autopilot's engine moves to instance-owned pool. The long-running daemon is the affected blast radius; one-shot CLI commands are unaffected by the same race because they exit before the singleton has time to get nulled.
I'm not opening Option A as a PR because it's a behavior change for every caller and reviewing it deserves a separate thread.
Environment
- gbrain v0.41.0.0 (commit at install:
bun install -g github:garrytan/gbrain ~2026-05-24)
- Engine: postgres (local Postgres 17.10 via Homebrew)
- macOS 26.4.1, Bun 1.x via global bun install
- ZeroEntropy embeddings (
zeroentropyai:zembed-1, 1280-dim)
- Brain has 15 pages, ~50 chunks
Companion PR
(linked in a moment)
Summary
On Postgres engine, the
autopilot --inlinedaemon silently loses link rows during theextract.links_fsphase, andengine.getHealth()then throwsNo database connection: connect() has not been calledevery cycle. Affects v0.41.0.0 (current at time of filing). Not the same class as #1162 / PR #465 (which is about the reconnect loop crashing the daemon) — this fires even when the daemon stays alive.Repro
Stock v0.41.0.0 on a Postgres-engine brain. From an interactive shell with your
GBRAIN_DATABASE_URL/ZEROENTROPY_API_KEYetc. loaded:Output around the
extractphase will contain:The cycle continues, every subsequent phase that calls
this.sqland falls back to the module-leveldb.getConnection()will fail the same way (in autopilot's main loop this surfaces as[health] ERROR ...after the cycle completes, but the daemon does not crash —logErrorcatches it).Root cause
PostgresEngine.connect(config)without apoolSizearg goes to the module-level singleton branch (postgres-engine.ts:175-189) —db.connect(config)runs, but the engine instance's own_sqlstaysnull. From then on,this.sqlreturnsdb.getConnection()(the module singleton).Somewhere in the cycle's call graph (still narrowing — observed firing during the
extract.links_fsphase, aftersynthesizecompletes — even on a brain with no transcripts so synthesize is a near-noop), the module-levelsqlindb.tsgets nulled. Once that happens, every later engine call routed through the getter throwsconnect() has not been called. The lost-rows behavior inextract.ts:614-628is structural — theflush()function catches and logs the throw, butbatch.length = 0in thefinallyclears the un-inserted rows regardless.This is the same bug class the source comment at
src/commands/auth.ts:52acknowledges for auth commands:The auth-command instance was fixed by having
withConfiguredSqlcallengine.connect(). Long-running daemons that hold the engine across many cycle phases have a different exposure: SOMETHING during the cycle nulls the singleton out from under them.I did not isolate the exact line that nulls
sqlindb.ts(the only sites that setsql = nullareconnect()'s catch path on line 216 anddisconnect()on line 230 — neither of which any cycle phase appears to invoke directly). It could be a transient pool error that triggers the catch-path null, or a code path I missed. Filing this as a bug first so a maintainer who knows the cycle code better can localize the source.What this affects
gbrain autopilot --inlineon Postgres: silent row loss in everyextractphase that has links to insert. Status reads aspartial, but the bare counts shown in the human log line (extracted=N) look correct because they reflect intent, not actual rows persisted. Operators trusting the log line will believe the daemon is healthy.gbrain autopilotMinions-dispatch path on Postgres: same class, fires when the dispatched job's engine routes through the singleton (per-job worker engines usingpoolSizeare immune; the autopilot's parent engine handling intra-tick orchestration is not).engine.getHealth()health check post-cycle: throwsconnect() has not been calledandlogError('health', e)writes the line, daemon survives, no adaptive backoff occurs becauseintervalstays atbaseInterval.Relationship to PR #465
PR #465 (open, ~1 month no review) fixes the
(engine as any).connect?.()no-config-arg bug in the autopilot reconnect path. That's a real fix and lands when this issue lands. But applying #465 alone leaves THIS bug fully present — I confirmed it by patching #465 locally and re-running. The daemon stays alive (good — that's what #465 ships), butextract.links_fsstill loses rows on every cycle andgetHealthstill throws.So this is a separate fix wave.
Suggested fix
Two reasonable shapes; I'm filing PR for the second one because it's the narrowest scope and survives architectural disagreement about the first.
Option A — engine-level: Have
PostgresEngine.connect(config)always create its own_sqlregardless ofpoolSize. TreatpoolSizeas a sizing hint, not a "should we own a pool" toggle. This kills the singleton-fallback class entirely. Risk: every existing CLI caller now opens a pool instead of sharing one — small extra connection overhead, but it's the architecturally cleaner answer.Option B — autopilot-specific (what my PR does): At autopilot startup, after
connectEngine()returns, forceengine.connect({...savedConfig, poolSize: 5})so the autopilot's engine moves to instance-owned pool. The long-running daemon is the affected blast radius; one-shot CLI commands are unaffected by the same race because they exit before the singleton has time to get nulled.I'm not opening Option A as a PR because it's a behavior change for every caller and reviewing it deserves a separate thread.
Environment
bun install -g github:garrytan/gbrain~2026-05-24)zeroentropyai:zembed-1, 1280-dim)Companion PR
(linked in a moment)