Skip to content

fix(retry): reconnect engine pool in batchRetry onRetry — singleton-null repair#1593

Closed
jalagrange wants to merge 1 commit into
garrytan:masterfrom
jalagrange:fix/reconnect-on-singleton-null
Closed

fix(retry): reconnect engine pool in batchRetry onRetry — singleton-null repair#1593
jalagrange wants to merge 1 commit into
garrytan:masterfrom
jalagrange:fix/reconnect-on-singleton-null

Conversation

@jalagrange

Copy link
Copy Markdown

Problem

#1416 added "No database connection" to retry-matcher.ts's retryable patterns, but the retry path never re-establishes the connection pool. When engine.disconnect() runs mid-cycle (for example between dream phases, or via connection-manager lifecycle), the module-level singleton in db.ts goes null; the in-flight batch flunks with "No database connection: connect() has not been called.", withRetry faithfully fires 3/3 retries against the same null singleton, and the batch is silently lost.

Live repro

gbrain dream on v0.41.26.0 against direct Postgres (not Supavisor):

[extract.timeline_fs] 706/706 (100%)
[extract.timeline_fs] connection blip, retrying (attempt 1/3): No database connection: connect() has not been called. Fix: Run gbrain init --supabase or gbrain init --url <connection_string>
[extract.timeline_fs] connection blip, retrying (attempt 2/3): No database connection: connect() has not been called. Fix: Run gbrain init --supabase or gbrain init --url <connection_string>
[extract.timeline_fs] connection blip, retrying (attempt 3/3): No database connection: connect() has not been called. Fix: Run gbrain init --supabase or gbrain init --url <connection_string>
  batch error (13 timeline rows lost): No database connection: connect() has not been called.
[extract.timeline_fs] 706/706 (100%) done
Timeline: created 0 entries from 706 pages

Downstream doctor symptoms:

[FAIL] batch_retry_health: 192 exhausted batch retries in last 24h
       (worst: extract.links_fs = 174). Sustained circuit-breaker incident.
[FAIL] cycle_freshness: Source 'default' has never completed a full cycle.

The cycle_freshness FAIL is a downstream effect: a later phase (conversation_facts_backfill) hits the dead singleton via engine.getConfig() and aborts the cycle.

Root cause

Three load-bearing facts confirmed in source:

  • retry-matcher.ts:23-27 matches /No database connection/i and adds the typed problem === 'No database connection' branch — explicitly cited as the #1416 follow-up.
  • postgres-engine.ts:batchRetry onRetry only logs + audits — no reconnect path.
  • postgres-engine.ts:reconnect() already exists and is exactly what's needed; it's invoked from minions/supervisor.ts:600 but not from the retry path. _savedConfig is preserved at postgres-engine.ts:95 for this purpose.
  • Bonus: retry.ts:withRetry calls opts.onRetry?.(...) without await, so an async repair callback would race the next attempt regardless.

Fix

Two-file change. ~92 lines including tests.

src/core/retry.ts

Widen WithRetryOpts.onRetry to allow Promise<void> and await the call in withRetry. The old fire-and-forget call meant async pre-retry repair work (e.g. engine.reconnect()) could not run atomically with the retry. Sync callers are unaffected — await undefined resolves immediately.

src/core/postgres-engine.ts

In batchRetry's onRetry, call this.reconnect() before the next attempt sleeps. Mirrors the supervisor pool-restart pattern. engine.reconnect() no-ops when _savedConfig is null (module-singleton mode without a saved engine config), so the call is defensively safe. withRetry only invokes onRetry after isRetryableConnError matches, so every onRetry call is for a connection-class error where reconnect is the right repair.

   onRetry: async (attempt, err) => {
     // ... existing audit + log ...
+    try {
+      await this.reconnect();
+    } catch (reconnectErr) {
+      process.stderr.write(
+        `[${auditSite}] reconnect failed (will retry against existing pool): ${(reconnectErr as Error).message}\n`,
+      );
+    }
   },

Tests

In test/core/retry.test.ts:

  • async onRetry is awaited before the next attempt — pins the contract batchRetry now depends on. Asserts event ordering across an async onRetry with a microtask boundary; without the await patch the assertion would break.
  • async onRetry that throws does not crash withRetry — pins the best-effort repair contract: a throwing reconnect surfaces visibly rather than being swallowed.
$ bun test test/core/retry.test.ts
 39 pass, 0 fail, 194 expect() calls

$ bun test test/core/retry.test.ts test/core/retry-stress.slow.test.ts test/retry-matcher.test.ts test/connection-resilience.test.ts test/doctor-batch-retry.test.ts
 94 pass, 0 fail

$ bun run typecheck
$ tsc --noEmit   # clean

Full bun test run also clean against this branch (the two pglite-engine.test.ts > Stats & Health failures repro on origin/master without these changes — pre-existing, unrelated).

Scope of this PR

Intentionally narrow — only the batch path. The non-batch path (conversation_facts_backfill's direct engine.getConfig() call) has the same root cause but a different fix shape (likely engine.ensureConnected() at phase start, or wrapping getConfig itself in withRetry). Happy to follow up in a second PR if this direction is approved.

Risk

  • await opts.onRetry?.(...) adds at most one event-loop tick per retry. No throughput impact on the common case.
  • this.reconnect() is already exercised in production by the supervisor path. Calling it on every connection-class retry adds pool churn on a real connection blip but eliminates the pathological case of N retries against a dead pool. Net win.
  • reconnect() is _reconnecting-guarded so concurrent retries collapse into a single re-init.

Why not just block the disconnect() call?

Considered. engine.disconnect() mid-cycle is intentional in some lifecycle paths (Astra autopilot between phases, test teardown reuse, etc.) and the existing _connectionStyle === 'module' guard already prevents accidental clobbering between engines. Removing legitimate disconnects to paper over the retry-path gap moves the bug, doesn't fix it. The retry path is the right place to be resilient — that's what retry-matcher's "No database connection" entry already implies the design intends.

…ull repair

PR garrytan#1416 added "No database connection" to retry-matcher's retryable
patterns, but the retry path never re-established the pool. When
engine.disconnect() runs mid-cycle (e.g. between dream phases) the
module-level singleton goes null; the in-flight batch flunks with
"No database connection: connect() has not been called.", retries fire
3/3 against the same null singleton, and the batch is lost.

Live repro on v0.41.26.0 against direct Postgres (not Supavisor):

  [extract.timeline_fs] connection blip, retrying (attempt 1/3): No database connection ...
  [extract.timeline_fs] connection blip, retrying (attempt 2/3): No database connection ...
  [extract.timeline_fs] connection blip, retrying (attempt 3/3): No database connection ...
    batch error (13 timeline rows lost): No database connection ...

Symptoms: 192 exhausted batch retries in 24h (doctor batch_retry_health
FAIL), cycle_freshness FAIL because conversation_facts_backfill's first
engine.getConfig() also hits the dead singleton (immediate, no retry path).

Two-file change:

  src/core/retry.ts
    Widen WithRetryOpts.onRetry to allow Promise<void> and await it in
    withRetry. The old fire-and-forget call meant async pre-retry repair
    work (e.g. engine.reconnect()) could not run atomically with the
    retry — the next attempt raced against unfinished repair.

  src/core/postgres-engine.ts
    In batchRetry's onRetry, call this.reconnect() before the next
    attempt sleeps. Mirrors the supervisor pool-restart pattern at
    minions/supervisor.ts:600. engine.reconnect() no-ops when
    _savedConfig is null (module-singleton mode without a saved engine
    config) so the call is defensively safe. withRetry only invokes
    onRetry after isRetryableConnError matches, so every onRetry call
    is for a connection-class error where reconnect is the right repair.

Tests added in test/core/retry.test.ts:

  - async onRetry is awaited before next attempt (pins the contract
    batchRetry now depends on; without await, asserts event ordering
    would break)
  - onRetry that throws propagates (best-effort repair contract — the
    reconnect failure surfaces visibly rather than being swallowed)

All 39 retry.test.ts tests pass; 94/94 across retry-adjacent suites
(retry, retry-stress, retry-matcher, connection-resilience,
doctor-batch-retry). Typecheck clean.

Does not address the non-batch-wrapped path (conversation_facts_backfill's
direct engine.getConfig fails with the same root cause). That's a
separate, smaller fix — likely engine.ensureConnected() at phase start
— intentionally left out of this PR to keep it surgical.
@jalagrange

Copy link
Copy Markdown
Author

Verified end-to-end on production

Pulled this branch into a live install (v0.41.26.0 base, direct Postgres, 749 pages) and ran gbrain dream. Comparing against the same brain on origin/master run earlier today:

extract.timeline_fs finalize batch

Before:  Timeline: created  0 entries from 706 pages   (13 timeline rows lost)
After:   Timeline: created 37 entries from 706 pages

extract.links_fs connection-blip handling

Audit log entries from ~/.gbrain/audit/batch-retry-2026-W22.jsonl:

Before (24h pre-patch):  192 exhausted retries, worst site extract.links_fs = 174
After (patched run):     1 event total
  ts=15:00:31  site=extract.links_fs  attempt=1  outcome=success  delay=1000ms
               err="No database connection: connect() has not been cal[led]"

The singleton went null mid-extract.links_fs; withRetry fired onRetry, my patch called this.reconnect(), the next attempt found a healthy pool, and the batch completed. One audit row, outcome success, vs. the pre-patch shape of attempt 1/2/3 all exhausted + "batch error (N rows lost)" line.

Doctor

batch_retry_health and cycle_freshness still FAIL in this snapshot — they're 24h-windowed metrics counting the pre-patch exhausted retries that haven't decayed yet. The audit-log delta is the leading indicator and it's clean.

Out-of-scope failures observed in the same run

sync and synthesize phases hit the same "No database connection" at non-batchRetry call sites — exactly what the PR description called out as separate work. Not regressions from this patch; the singleton was equally null there before. Will follow up in a second PR once this direction is approved.

@rayers

rayers commented Jun 2, 2026

Copy link
Copy Markdown

Hit this exact No database connection: connect() has not been called class in production on a long-running gbrain autopilot daemon (v0.42.1.0, instance-pool engine against a shared local Postgres). Wanted to flag a piece that sits underneath this PR, because on current local/integration the batchRetry reconnect callback (reconnect: () => this.reconnect()) is already wired in — yet the crash still happens. The remaining defect is in reconnect() itself.

Symptom: worker runs a full cycle's early phases fine, then late phases throw No database connection, executeJob crashes unhandled, worker exits code=1, respawns, repeats every ~5 min. conversation_facts_backfill is the phase that crashes the worker, but it's a red herring — it's default-OFF, so its only DB touch is loadCfg's engine.getConfig() x7, i.e. it's just the first hard call after the engine already died mid-cycle. The Promotion error: No database connection lines from the worker's promoteDelayed() loop confirm _sql was already null process-wide.

Root cause: PostgresEngine.reconnect() disconnects the live pool before attempting connect():

try { await this.disconnect(); } catch {}   // _sql = null HERE
await this.connect(this._savedConfig);        // if THIS throws, _sql stays null forever

So when a transient blip fires the retry→reconnect path and Postgres is still unreachable for those few seconds, connect() throws and _sql is left null permanently. Every subsequent call uses the accessor this._sql || db.getConnection(); with _sql null it falls to the module singleton, which the autopilot process never connected → throw. Non-bulk methods (getConfig, per-phase reads) have no retry wrapper, so the first one after the failed reconnect throws unhandled and kills the worker.

This is why wiring reconnect into batchRetry (this PR) is necessary but not sufficient: the reconnect it calls can itself brick the engine.

Fix that worked for us — build-then-swap (don't tear down the old pool until the new one is proven live; restore on failure so the next query self-heals):

async reconnect(): Promise<void> {
  if (!this._savedConfig || this._reconnecting) return;
  this._reconnecting = true;
  const oldSql = this._sql;
  const oldManager = this.connectionManager;
  try {
    this._sql = null;                       // force connect() to build fresh
    await this.connect(this._savedConfig);  // validates via SELECT 1 before returning
    if (oldSql) { try { await oldSql.end({ timeout: 5 }); } catch {} }
  } catch (err) {
    if (this._sql && this._sql !== oldSql) { try { await this._sql.end({ timeout: 5 }); } catch {} }
    this._sql = oldSql;
    this.connectionManager = oldManager;
    throw err;                              // let batchRetry's backoff handle it
  } finally {
    this._reconnecting = false;
  }
}

Green on test/core/retry-reconnect.test.ts + retry.test.ts (42 pass), test/e2e/db-singleton-shared-recovery.test.ts (3 pass), and the disconnect-idempotency e2e (7 pass). Looks complementary to the module-singleton refcount work in #1754/#1758 too — that layer and this instance-pool layer are orthogonal; both seem needed. Happy to fold this into whichever PR lands as the canonical fix.

@jalagrange

Copy link
Copy Markdown
Author

@rayers thank you, this is excellent root-cause work, and it matches what I see on master: the instance-pool branch of reconnect() runs disconnect() (which nulls _sql) before connect(), and the catch only logs and rethrows, so a connect() failure during the blip leaves _sql null with no self-heal. Build-then-swap is the right shape: hold the old pool until the new one validates, restore it on failure so the next query recovers. I'd land that as its own PR since it is cleanly scoped and orthogonal to the #1754 refcount work.

Closing this PR as superseded. The change it proposed (await onRetry plus wiring reconnect into the batch retry path) has since landed on master independently: retry.ts now awaits onRetry and exposes a dedicated reconnect opt, and batchRetry passes reconnect: (ctx) => this.reconnect(ctx). So this diff is redundant against current master.

The two real remaining layers, both confirmed, are now split out:

  1. reconnect() itself is not crash-safe on the instance-pool path (your build-then-swap), yours to land when ready.
  2. The non-batch call sites (getConfig, per-phase reads) had no retry wrapper, so the first hard DB call after a mid-cycle disconnect threw unhandled and crashed the worker. I took that one as fix(retry): reconnect on null instance pool in non-batch config reads (#1593 follow-up) #1891.

Thanks again for the deep dig.

@jalagrange jalagrange closed this Jun 5, 2026
rayers added a commit to rayers/gbrain that referenced this pull request Jun 5, 2026
… can't brick the engine (garrytan#1593)

The instance-pool reconnect() did disconnect() (nulling _sql) BEFORE connect(),
so a connect() failure during a transient Postgres blip left _sql null for the
rest of the process — every subsequent non-retry-wrapped call then fell through
to the never-connected module singleton and threw 'No database connection',
crashing the autopilot worker into a respawn loop. Build-then-swap: snapshot the
live pool, build a fresh one, end the old only once the new validates, restore
on failure. Keeps upstream's reap-detection + pool-recovery audit. Confirmed by
jalagrange on closed PR garrytan#1593; this is the Layer-1 fix he left to us.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@rayers

rayers commented Jun 5, 2026

Copy link
Copy Markdown

Landed the build-then-swap reconnect crash-safety as #1906 (Layer 1 from your split). Rebased onto current master, keeps the reap-detection + pool-recovery audit, and updates the recovery-contract guard. Complementary to your #1891 (non-batch call sites). Thanks for the clean split here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants