Skip to content

"connect() has not been called" still reproduces on 0.41.28 — concurrent minion-worker path not covered by #1570 fix #1745

@andrea-kingautomation

Description

@andrea-kingautomation

Follow-up to #1570 — "connect() has not been called" still reproduces on 0.41.28 (concurrency path the withRetry fix doesn't cover)

Thanks for the fix on #1570. We hit the same No database connection: connect() has not been called error on gbrain 0.41.28.0 with Supabase Postgres, but via a path the withRetry-reconnect fix does not protect: a concurrent operation reading the module singleton while another caller is tearing it down.

Environment

  • gbrain 0.41.28.0, engine: Postgres
  • Supabase, direct connection db.<ref>.supabase.co:5432 (sslmode=require, IPv6) — not the pooler
  • Symptom isolated to the dream cycle's synthesize phase + the minion worker; core retrieval (search/query/embed/sync/extract) unaffected.

Symptom

Every dream/autopilot cycle, synthesize fails and synth_pages=0, with these repeating errors:

Promotion error: No database connection: connect() has not been called. ...
[extract.links_fs] connection blip, retrying (attempt 1/3): No database connection: connect() has not been called. ...
Dream cycle (partial): [InternalError/SYNTH_PHASE_FAIL] No database connection: connect() has not been called. ...
totals: ... synth_transcripts=0 synth_pages=0

Root cause (the remaining path)

#1570's fix made the retrying caller reconnect. But the singleton is still nulled out from under other in-flight operations:

  1. A transient blip triggers PostgresEngine.reconnect() (src/core/postgres-engine.ts:~4515), which does try { await this.disconnect(); } catch {} then reconnects.
  2. In module-singleton mode, this.disconnect() routes to db.disconnect() (src/core/postgres-engine.ts:~223).
  3. db.disconnect() (src/core/db.ts:227) executes if (sql) { await sql.end(); sql = null; connectedUrl = null; } — it nulls the shared module singleton.
  4. The minion worker loop runs concurrently: MinionWorkerthis.queue.promoteDelayed() (src/core/minions/worker.ts:463) → engine.executeRawgetConnection() (src/core/db.ts:~150). It reads sql during the window where it is null and throws "connect() has not been called".
  5. The synthesize phase submits its writer jobs through the same minion queue (src/core/cycle/synthesize.ts engine.executeRaw, surfaced as SYNTH_PHASE_FAIL), so the whole phase aborts.

So withRetry reconnects the one caller it wraps, but db.disconnect() setting sql = null breaks every other operation sharing the singleton during the disconnect→reconnect window. On Postgres this is newly exposed because minions are Postgres-only (jobs work is "Postgres only"); the path never ran on PGLite. We see the in-code instrumentation comment at db.ts:228 referencing #1570 ("identify the caller that's nulling the module singleton mid-cycle") — that caller is reconnect()'s disconnect(), and the victims are the concurrent minion-queue ops.

Isolation facts

  • Not env/config: same process with GBRAIN_DATABASE_URL set runs embed, orphans, purge, extract.timeline_fs fine; only the minion-queue-backed path throws.
  • Not connection instability per se: non-minion phases reuse the same connection without blips.
  • Not dual-worker contention: reproduces with a single gbrain jobs work.

Workaround we applied (stopgap, local source patch)

Gate the singleton release in db.disconnect() so a normal reconnect() no longer nulls the shared connection — postgres.js auto-reconnects dead sockets inside the pool, so keeping the singleton alive for the process lifetime is safe:

// db.ts disconnect(): only release on an explicit full shutdown
if (sql && process.env.GBRAIN_FORCE_DISCONNECT === '1') {
  await sql.end();
  sql = null;
  connectedUrl = null;
}

After this, the error disappears (0 occurrences across many cycles) and synthesize runs to completion instead of SYNTH_PHASE_FAIL.

Suggested durable fix (your call)

Either (a) don't tear down the shared module singleton inside reconnect() — let postgres.js's pool self-heal; or (b) make getConnection() lazily reconnect (await an in-flight connect()) instead of throwing when sql is transiently null; or (c) guard concurrent queue ops against the disconnect window. Happy to test a patch against our Supabase setup.

Reproduction

  1. gbrain init --supabase --url <conn> --embedding-model zeroentropyai:zembed-1 --embedding-dimensions 1280
  2. add a source, sync content, set a synthesize corpus dir
  3. run gbrain autopilot (or gbrain jobs work + gbrain dream --synthesize)
  4. observe synth_pages=0 + the "connect() has not been called" errors during a connection blip.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions