fix(retry): reconnect engine pool in batchRetry onRetry — singleton-null repair by jalagrange · Pull Request #1593 · garrytan/gbrain

jalagrange · 2026-05-28T11:59:50Z

Problem

#1416 added "No database connection" to retry-matcher.ts's retryable patterns, but the retry path never re-establishes the connection pool. When engine.disconnect() runs mid-cycle (for example between dream phases, or via connection-manager lifecycle), the module-level singleton in db.ts goes null; the in-flight batch flunks with "No database connection: connect() has not been called.", withRetry faithfully fires 3/3 retries against the same null singleton, and the batch is silently lost.

Live repro

gbrain dream on v0.41.26.0 against direct Postgres (not Supavisor):

[extract.timeline_fs] 706/706 (100%)
[extract.timeline_fs] connection blip, retrying (attempt 1/3): No database connection: connect() has not been called. Fix: Run gbrain init --supabase or gbrain init --url <connection_string>
[extract.timeline_fs] connection blip, retrying (attempt 2/3): No database connection: connect() has not been called. Fix: Run gbrain init --supabase or gbrain init --url <connection_string>
[extract.timeline_fs] connection blip, retrying (attempt 3/3): No database connection: connect() has not been called. Fix: Run gbrain init --supabase or gbrain init --url <connection_string>
  batch error (13 timeline rows lost): No database connection: connect() has not been called.
[extract.timeline_fs] 706/706 (100%) done
Timeline: created 0 entries from 706 pages

Downstream doctor symptoms:

[FAIL] batch_retry_health: 192 exhausted batch retries in last 24h
       (worst: extract.links_fs = 174). Sustained circuit-breaker incident.
[FAIL] cycle_freshness: Source 'default' has never completed a full cycle.

The cycle_freshness FAIL is a downstream effect: a later phase (conversation_facts_backfill) hits the dead singleton via engine.getConfig() and aborts the cycle.

Root cause

Three load-bearing facts confirmed in source:

retry-matcher.ts:23-27 matches /No database connection/i and adds the typed problem === 'No database connection' branch — explicitly cited as the #1416 follow-up.
postgres-engine.ts:batchRetry onRetry only logs + audits — no reconnect path.
postgres-engine.ts:reconnect() already exists and is exactly what's needed; it's invoked from minions/supervisor.ts:600 but not from the retry path. _savedConfig is preserved at postgres-engine.ts:95 for this purpose.
Bonus: retry.ts:withRetry calls opts.onRetry?.(...) without await, so an async repair callback would race the next attempt regardless.

Fix

Two-file change. ~92 lines including tests.

`src/core/retry.ts`

Widen WithRetryOpts.onRetry to allow Promise<void> and await the call in withRetry. The old fire-and-forget call meant async pre-retry repair work (e.g. engine.reconnect()) could not run atomically with the retry. Sync callers are unaffected — await undefined resolves immediately.

`src/core/postgres-engine.ts`

In batchRetry's onRetry, call this.reconnect() before the next attempt sleeps. Mirrors the supervisor pool-restart pattern. engine.reconnect() no-ops when _savedConfig is null (module-singleton mode without a saved engine config), so the call is defensively safe. withRetry only invokes onRetry after isRetryableConnError matches, so every onRetry call is for a connection-class error where reconnect is the right repair.

   onRetry: async (attempt, err) => {
     // ... existing audit + log ...
+    try {
+      await this.reconnect();
+    } catch (reconnectErr) {
+      process.stderr.write(
+        `[${auditSite}] reconnect failed (will retry against existing pool): ${(reconnectErr as Error).message}\n`,
+      );
+    }
   },

Tests

In test/core/retry.test.ts:

async onRetry is awaited before the next attempt — pins the contract batchRetry now depends on. Asserts event ordering across an async onRetry with a microtask boundary; without the await patch the assertion would break.
async onRetry that throws does not crash withRetry — pins the best-effort repair contract: a throwing reconnect surfaces visibly rather than being swallowed.

$ bun test test/core/retry.test.ts
 39 pass, 0 fail, 194 expect() calls

$ bun test test/core/retry.test.ts test/core/retry-stress.slow.test.ts test/retry-matcher.test.ts test/connection-resilience.test.ts test/doctor-batch-retry.test.ts
 94 pass, 0 fail

$ bun run typecheck
$ tsc --noEmit   # clean

Full bun test run also clean against this branch (the two pglite-engine.test.ts > Stats & Health failures repro on origin/master without these changes — pre-existing, unrelated).

Scope of this PR

Intentionally narrow — only the batch path. The non-batch path (conversation_facts_backfill's direct engine.getConfig() call) has the same root cause but a different fix shape (likely engine.ensureConnected() at phase start, or wrapping getConfig itself in withRetry). Happy to follow up in a second PR if this direction is approved.

Risk

await opts.onRetry?.(...) adds at most one event-loop tick per retry. No throughput impact on the common case.
this.reconnect() is already exercised in production by the supervisor path. Calling it on every connection-class retry adds pool churn on a real connection blip but eliminates the pathological case of N retries against a dead pool. Net win.
reconnect() is _reconnecting-guarded so concurrent retries collapse into a single re-init.

Why not just block the `disconnect()` call?

Considered. engine.disconnect() mid-cycle is intentional in some lifecycle paths (Astra autopilot between phases, test teardown reuse, etc.) and the existing _connectionStyle === 'module' guard already prevents accidental clobbering between engines. Removing legitimate disconnects to paper over the retry-path gap moves the bug, doesn't fix it. The retry path is the right place to be resilient — that's what retry-matcher's "No database connection" entry already implies the design intends.

…ull repair PR garrytan#1416 added "No database connection" to retry-matcher's retryable patterns, but the retry path never re-established the pool. When engine.disconnect() runs mid-cycle (e.g. between dream phases) the module-level singleton goes null; the in-flight batch flunks with "No database connection: connect() has not been called.", retries fire 3/3 against the same null singleton, and the batch is lost. Live repro on v0.41.26.0 against direct Postgres (not Supavisor): [extract.timeline_fs] connection blip, retrying (attempt 1/3): No database connection ... [extract.timeline_fs] connection blip, retrying (attempt 2/3): No database connection ... [extract.timeline_fs] connection blip, retrying (attempt 3/3): No database connection ... batch error (13 timeline rows lost): No database connection ... Symptoms: 192 exhausted batch retries in 24h (doctor batch_retry_health FAIL), cycle_freshness FAIL because conversation_facts_backfill's first engine.getConfig() also hits the dead singleton (immediate, no retry path). Two-file change: src/core/retry.ts Widen WithRetryOpts.onRetry to allow Promise<void> and await it in withRetry. The old fire-and-forget call meant async pre-retry repair work (e.g. engine.reconnect()) could not run atomically with the retry — the next attempt raced against unfinished repair. src/core/postgres-engine.ts In batchRetry's onRetry, call this.reconnect() before the next attempt sleeps. Mirrors the supervisor pool-restart pattern at minions/supervisor.ts:600. engine.reconnect() no-ops when _savedConfig is null (module-singleton mode without a saved engine config) so the call is defensively safe. withRetry only invokes onRetry after isRetryableConnError matches, so every onRetry call is for a connection-class error where reconnect is the right repair. Tests added in test/core/retry.test.ts: - async onRetry is awaited before next attempt (pins the contract batchRetry now depends on; without await, asserts event ordering would break) - onRetry that throws propagates (best-effort repair contract — the reconnect failure surfaces visibly rather than being swallowed) All 39 retry.test.ts tests pass; 94/94 across retry-adjacent suites (retry, retry-stress, retry-matcher, connection-resilience, doctor-batch-retry). Typecheck clean. Does not address the non-batch-wrapped path (conversation_facts_backfill's direct engine.getConfig fails with the same root cause). That's a separate, smaller fix — likely engine.ensureConnected() at phase start — intentionally left out of this PR to keep it surgical.

jalagrange · 2026-05-28T15:15:13Z

Verified end-to-end on production

Pulled this branch into a live install (v0.41.26.0 base, direct Postgres, 749 pages) and ran gbrain dream. Comparing against the same brain on origin/master run earlier today:

`extract.timeline_fs` finalize batch

Before:  Timeline: created  0 entries from 706 pages   (13 timeline rows lost)
After:   Timeline: created 37 entries from 706 pages

`extract.links_fs` connection-blip handling

Audit log entries from ~/.gbrain/audit/batch-retry-2026-W22.jsonl:

Before (24h pre-patch):  192 exhausted retries, worst site extract.links_fs = 174
After (patched run):     1 event total
  ts=15:00:31  site=extract.links_fs  attempt=1  outcome=success  delay=1000ms
               err="No database connection: connect() has not been cal[led]"

The singleton went null mid-extract.links_fs; withRetry fired onRetry, my patch called this.reconnect(), the next attempt found a healthy pool, and the batch completed. One audit row, outcome success, vs. the pre-patch shape of attempt 1/2/3 all exhausted + "batch error (N rows lost)" line.

Doctor

batch_retry_health and cycle_freshness still FAIL in this snapshot — they're 24h-windowed metrics counting the pre-patch exhausted retries that haven't decayed yet. The audit-log delta is the leading indicator and it's clean.

Out-of-scope failures observed in the same run

sync and synthesize phases hit the same "No database connection" at non-batchRetry call sites — exactly what the PR description called out as separate work. Not regressions from this patch; the singleton was equally null there before. Will follow up in a second PR once this direction is approved.

rayers · 2026-06-02T04:16:44Z

Hit this exact No database connection: connect() has not been called class in production on a long-running gbrain autopilot daemon (v0.42.1.0, instance-pool engine against a shared local Postgres). Wanted to flag a piece that sits underneath this PR, because on current local/integration the batchRetry reconnect callback (reconnect: () => this.reconnect()) is already wired in — yet the crash still happens. The remaining defect is in reconnect() itself.

Symptom: worker runs a full cycle's early phases fine, then late phases throw No database connection, executeJob crashes unhandled, worker exits code=1, respawns, repeats every ~5 min. conversation_facts_backfill is the phase that crashes the worker, but it's a red herring — it's default-OFF, so its only DB touch is loadCfg's engine.getConfig() x7, i.e. it's just the first hard call after the engine already died mid-cycle. The Promotion error: No database connection lines from the worker's promoteDelayed() loop confirm _sql was already null process-wide.

Root cause: PostgresEngine.reconnect() disconnects the live pool before attempting connect():

try { await this.disconnect(); } catch {}   // _sql = null HERE
await this.connect(this._savedConfig);        // if THIS throws, _sql stays null forever

So when a transient blip fires the retry→reconnect path and Postgres is still unreachable for those few seconds, connect() throws and _sql is left null permanently. Every subsequent call uses the accessor this._sql || db.getConnection(); with _sql null it falls to the module singleton, which the autopilot process never connected → throw. Non-bulk methods (getConfig, per-phase reads) have no retry wrapper, so the first one after the failed reconnect throws unhandled and kills the worker.

This is why wiring reconnect into batchRetry (this PR) is necessary but not sufficient: the reconnect it calls can itself brick the engine.

Fix that worked for us — build-then-swap (don't tear down the old pool until the new one is proven live; restore on failure so the next query self-heals):

async reconnect(): Promise<void> {
  if (!this._savedConfig || this._reconnecting) return;
  this._reconnecting = true;
  const oldSql = this._sql;
  const oldManager = this.connectionManager;
  try {
    this._sql = null;                       // force connect() to build fresh
    await this.connect(this._savedConfig);  // validates via SELECT 1 before returning
    if (oldSql) { try { await oldSql.end({ timeout: 5 }); } catch {} }
  } catch (err) {
    if (this._sql && this._sql !== oldSql) { try { await this._sql.end({ timeout: 5 }); } catch {} }
    this._sql = oldSql;
    this.connectionManager = oldManager;
    throw err;                              // let batchRetry's backoff handle it
  } finally {
    this._reconnecting = false;
  }
}

Green on test/core/retry-reconnect.test.ts + retry.test.ts (42 pass), test/e2e/db-singleton-shared-recovery.test.ts (3 pass), and the disconnect-idempotency e2e (7 pass). Looks complementary to the module-singleton refcount work in #1754/#1758 too — that layer and this instance-pool layer are orthogonal; both seem needed. Happy to fold this into whichever PR lands as the canonical fix.

jalagrange · 2026-06-05T16:40:24Z

@rayers thank you, this is excellent root-cause work, and it matches what I see on master: the instance-pool branch of reconnect() runs disconnect() (which nulls _sql) before connect(), and the catch only logs and rethrows, so a connect() failure during the blip leaves _sql null with no self-heal. Build-then-swap is the right shape: hold the old pool until the new one validates, restore it on failure so the next query recovers. I'd land that as its own PR since it is cleanly scoped and orthogonal to the #1754 refcount work.

Closing this PR as superseded. The change it proposed (await onRetry plus wiring reconnect into the batch retry path) has since landed on master independently: retry.ts now awaits onRetry and exposes a dedicated reconnect opt, and batchRetry passes reconnect: (ctx) => this.reconnect(ctx). So this diff is redundant against current master.

The two real remaining layers, both confirmed, are now split out:

reconnect() itself is not crash-safe on the instance-pool path (your build-then-swap), yours to land when ready.
The non-batch call sites (getConfig, per-phase reads) had no retry wrapper, so the first hard DB call after a mid-cycle disconnect threw unhandled and crashed the worker. I took that one as fix(retry): reconnect on null instance pool in non-batch config reads (#1593 follow-up) #1891.

Thanks again for the deep dig.

… can't brick the engine (garrytan#1593) The instance-pool reconnect() did disconnect() (nulling _sql) BEFORE connect(), so a connect() failure during a transient Postgres blip left _sql null for the rest of the process — every subsequent non-retry-wrapped call then fell through to the never-connected module singleton and threw 'No database connection', crashing the autopilot worker into a respawn loop. Build-then-swap: snapshot the live pool, build a fresh one, end the old only once the new validates, restore on failure. Keeps upstream's reap-detection + pool-recovery audit. Confirmed by jalagrange on closed PR garrytan#1593; this is the Layer-1 fix he left to us. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

rayers · 2026-06-05T22:30:13Z

Landed the build-then-swap reconnect crash-safety as #1906 (Layer 1 from your split). Rebased onto current master, keeps the reap-detection + pool-recovery audit, and updates the recovery-contract guard. Complementary to your #1891 (non-batch call sites). Thanks for the clean split here.

jalagrange mentioned this pull request May 28, 2026

fix(engine): module-singleton shared-ownership — joiner must not tear down shared pool #1599

Closed

spiky02plateau mentioned this pull request Jun 2, 2026

fix(db): refcount module singleton to fix mid-cycle "connect() has not been called" (#1570) #1754

Closed

jalagrange mentioned this pull request Jun 5, 2026

fix(retry): reconnect on null instance pool in non-batch config reads (#1593 follow-up) #1891

Open

jalagrange closed this Jun 5, 2026

rayers mentioned this pull request Jun 5, 2026

fix(postgres-engine): build-then-swap reconnect() so a failed rebuild can't brick the engine #1906

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(retry): reconnect engine pool in batchRetry onRetry — singleton-null repair#1593

fix(retry): reconnect engine pool in batchRetry onRetry — singleton-null repair#1593
jalagrange wants to merge 1 commit into
garrytan:masterfrom
jalagrange:fix/reconnect-on-singleton-null

jalagrange commented May 28, 2026

Uh oh!

jalagrange commented May 28, 2026

Uh oh!

rayers commented Jun 2, 2026

Uh oh!

jalagrange commented Jun 5, 2026

Uh oh!

rayers commented Jun 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jalagrange commented May 28, 2026

Problem

Live repro

Root cause

Fix

src/core/retry.ts

src/core/postgres-engine.ts

Tests

Scope of this PR

Risk

Why not just block the disconnect() call?

Uh oh!

jalagrange commented May 28, 2026

Verified end-to-end on production

extract.timeline_fs finalize batch

extract.links_fs connection-blip handling

Doctor

Out-of-scope failures observed in the same run

Uh oh!

rayers commented Jun 2, 2026

Uh oh!

jalagrange commented Jun 5, 2026

Uh oh!

rayers commented Jun 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

`src/core/retry.ts`

`src/core/postgres-engine.ts`

Why not just block the `disconnect()` call?

`extract.timeline_fs` finalize batch

`extract.links_fs` connection-blip handling