fix(retry): reconnect engine pool in batchRetry onRetry — singleton-null repair#1593
fix(retry): reconnect engine pool in batchRetry onRetry — singleton-null repair#1593jalagrange wants to merge 1 commit into
Conversation
…ull repair PR garrytan#1416 added "No database connection" to retry-matcher's retryable patterns, but the retry path never re-established the pool. When engine.disconnect() runs mid-cycle (e.g. between dream phases) the module-level singleton goes null; the in-flight batch flunks with "No database connection: connect() has not been called.", retries fire 3/3 against the same null singleton, and the batch is lost. Live repro on v0.41.26.0 against direct Postgres (not Supavisor): [extract.timeline_fs] connection blip, retrying (attempt 1/3): No database connection ... [extract.timeline_fs] connection blip, retrying (attempt 2/3): No database connection ... [extract.timeline_fs] connection blip, retrying (attempt 3/3): No database connection ... batch error (13 timeline rows lost): No database connection ... Symptoms: 192 exhausted batch retries in 24h (doctor batch_retry_health FAIL), cycle_freshness FAIL because conversation_facts_backfill's first engine.getConfig() also hits the dead singleton (immediate, no retry path). Two-file change: src/core/retry.ts Widen WithRetryOpts.onRetry to allow Promise<void> and await it in withRetry. The old fire-and-forget call meant async pre-retry repair work (e.g. engine.reconnect()) could not run atomically with the retry — the next attempt raced against unfinished repair. src/core/postgres-engine.ts In batchRetry's onRetry, call this.reconnect() before the next attempt sleeps. Mirrors the supervisor pool-restart pattern at minions/supervisor.ts:600. engine.reconnect() no-ops when _savedConfig is null (module-singleton mode without a saved engine config) so the call is defensively safe. withRetry only invokes onRetry after isRetryableConnError matches, so every onRetry call is for a connection-class error where reconnect is the right repair. Tests added in test/core/retry.test.ts: - async onRetry is awaited before next attempt (pins the contract batchRetry now depends on; without await, asserts event ordering would break) - onRetry that throws propagates (best-effort repair contract — the reconnect failure surfaces visibly rather than being swallowed) All 39 retry.test.ts tests pass; 94/94 across retry-adjacent suites (retry, retry-stress, retry-matcher, connection-resilience, doctor-batch-retry). Typecheck clean. Does not address the non-batch-wrapped path (conversation_facts_backfill's direct engine.getConfig fails with the same root cause). That's a separate, smaller fix — likely engine.ensureConnected() at phase start — intentionally left out of this PR to keep it surgical.
Verified end-to-end on productionPulled this branch into a live install (v0.41.26.0 base, direct Postgres, 749 pages) and ran
|
|
Hit this exact Symptom: worker runs a full cycle's early phases fine, then late phases throw Root cause: try { await this.disconnect(); } catch {} // _sql = null HERE
await this.connect(this._savedConfig); // if THIS throws, _sql stays null foreverSo when a transient blip fires the retry→reconnect path and Postgres is still unreachable for those few seconds, This is why wiring reconnect into batchRetry (this PR) is necessary but not sufficient: the reconnect it calls can itself brick the engine. Fix that worked for us — build-then-swap (don't tear down the old pool until the new one is proven live; restore on failure so the next query self-heals): async reconnect(): Promise<void> {
if (!this._savedConfig || this._reconnecting) return;
this._reconnecting = true;
const oldSql = this._sql;
const oldManager = this.connectionManager;
try {
this._sql = null; // force connect() to build fresh
await this.connect(this._savedConfig); // validates via SELECT 1 before returning
if (oldSql) { try { await oldSql.end({ timeout: 5 }); } catch {} }
} catch (err) {
if (this._sql && this._sql !== oldSql) { try { await this._sql.end({ timeout: 5 }); } catch {} }
this._sql = oldSql;
this.connectionManager = oldManager;
throw err; // let batchRetry's backoff handle it
} finally {
this._reconnecting = false;
}
}Green on |
|
@rayers thank you, this is excellent root-cause work, and it matches what I see on master: the instance-pool branch of Closing this PR as superseded. The change it proposed (await The two real remaining layers, both confirmed, are now split out:
Thanks again for the deep dig. |
… can't brick the engine (garrytan#1593) The instance-pool reconnect() did disconnect() (nulling _sql) BEFORE connect(), so a connect() failure during a transient Postgres blip left _sql null for the rest of the process — every subsequent non-retry-wrapped call then fell through to the never-connected module singleton and threw 'No database connection', crashing the autopilot worker into a respawn loop. Build-then-swap: snapshot the live pool, build a fresh one, end the old only once the new validates, restore on failure. Keeps upstream's reap-detection + pool-recovery audit. Confirmed by jalagrange on closed PR garrytan#1593; this is the Layer-1 fix he left to us. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Problem
#1416 added
"No database connection"toretry-matcher.ts's retryable patterns, but the retry path never re-establishes the connection pool. Whenengine.disconnect()runs mid-cycle (for example between dream phases, or viaconnection-managerlifecycle), the module-level singleton indb.tsgoes null; the in-flight batch flunks with"No database connection: connect() has not been called.",withRetryfaithfully fires 3/3 retries against the same null singleton, and the batch is silently lost.Live repro
gbrain dreamon v0.41.26.0 against direct Postgres (not Supavisor):Downstream doctor symptoms:
The
cycle_freshnessFAIL is a downstream effect: a later phase (conversation_facts_backfill) hits the dead singleton viaengine.getConfig()and aborts the cycle.Root cause
Three load-bearing facts confirmed in source:
retry-matcher.ts:23-27matches/No database connection/iand adds the typedproblem === 'No database connection'branch — explicitly cited as the #1416 follow-up.postgres-engine.ts:batchRetryonRetryonly logs + audits — no reconnect path.postgres-engine.ts:reconnect()already exists and is exactly what's needed; it's invoked fromminions/supervisor.ts:600but not from the retry path._savedConfigis preserved atpostgres-engine.ts:95for this purpose.retry.ts:withRetrycallsopts.onRetry?.(...)withoutawait, so an async repair callback would race the next attempt regardless.Fix
Two-file change. ~92 lines including tests.
src/core/retry.tsWiden
WithRetryOpts.onRetryto allowPromise<void>andawaitthe call inwithRetry. The old fire-and-forget call meant async pre-retry repair work (e.g.engine.reconnect()) could not run atomically with the retry. Sync callers are unaffected —await undefinedresolves immediately.src/core/postgres-engine.tsIn
batchRetry'sonRetry, callthis.reconnect()before the next attempt sleeps. Mirrors the supervisor pool-restart pattern.engine.reconnect()no-ops when_savedConfigis null (module-singleton mode without a saved engine config), so the call is defensively safe.withRetryonly invokesonRetryafterisRetryableConnErrormatches, so everyonRetrycall is for a connection-class error where reconnect is the right repair.onRetry: async (attempt, err) => { // ... existing audit + log ... + try { + await this.reconnect(); + } catch (reconnectErr) { + process.stderr.write( + `[${auditSite}] reconnect failed (will retry against existing pool): ${(reconnectErr as Error).message}\n`, + ); + } },Tests
In
test/core/retry.test.ts:async onRetry is awaited before the next attempt— pins the contractbatchRetrynow depends on. Asserts event ordering across an asynconRetrywith a microtask boundary; without theawaitpatch the assertion would break.async onRetry that throws does not crash withRetry— pins the best-effort repair contract: a throwing reconnect surfaces visibly rather than being swallowed.Full
bun testrun also clean against this branch (the twopglite-engine.test.ts > Stats & Healthfailures repro onorigin/masterwithout these changes — pre-existing, unrelated).Scope of this PR
Intentionally narrow — only the batch path. The non-batch path (
conversation_facts_backfill's directengine.getConfig()call) has the same root cause but a different fix shape (likelyengine.ensureConnected()at phase start, or wrappinggetConfigitself in withRetry). Happy to follow up in a second PR if this direction is approved.Risk
await opts.onRetry?.(...)adds at most one event-loop tick per retry. No throughput impact on the common case.this.reconnect()is already exercised in production by the supervisor path. Calling it on every connection-class retry adds pool churn on a real connection blip but eliminates the pathological case of N retries against a dead pool. Net win.reconnect()is_reconnecting-guarded so concurrent retries collapse into a single re-init.Why not just block the
disconnect()call?Considered.
engine.disconnect()mid-cycle is intentional in some lifecycle paths (Astra autopilot between phases, test teardown reuse, etc.) and the existing_connectionStyle === 'module'guard already prevents accidental clobbering between engines. Removing legitimate disconnects to paper over the retry-path gap moves the bug, doesn't fix it. The retry path is the right place to be resilient — that's whatretry-matcher's"No database connection"entry already implies the design intends.