fix(postgres-engine): build-then-swap reconnect() so a failed rebuild can't brick the engine by rayers · Pull Request #1906 · garrytan/gbrain

rayers · 2026-06-05T22:13:34Z

Context

Follow-up to #1593 (closed). That PR wired reconnect into batchRetry's onRetry — which has since landed on master independently. As @jalagrange and I confirmed there, the wiring is necessary but not sufficient: the reconnect() it calls can itself brick the engine. Juan split the remaining work into two layers and left this one (the reconnect() crash-safety) to me; the non-batch-call-sites layer is his #1891.

Problem

PostgresEngine.reconnect() on the instance-pool path does disconnect() (which nulls _sql) before attempting connect():

try { await this.disconnect(); } catch {}   // _sql = null HERE
await this.connect(this._savedConfig);        // if THIS throws, _sql stays null forever

So when a transient blip fires the retry→reconnect path and Postgres is unreachable for those few seconds, connect() throws and _sql is left null permanently. Every subsequent call uses the accessor this._sql || db.getConnection(); with _sql null it falls through to the module singleton, which a long-running instance-pool process (e.g. the autopilot worker) never connected → throw. Non-bulk methods (getConfig, per-phase reads) have no retry wrapper, so the first one after the failed reconnect throws unhandled and crashes the worker, which respawns and repeats every ~5 min.

Observed in production on a long-running gbrain autopilot daemon (direct local Postgres): worker runs a cycle's early phases fine, late phases throw No database connection: connect() has not been called, executeJob crashes unhandled, exit code 1, respawn loop.

Fix — build-then-swap

Snapshot the live pool, build a fresh one, and only tear the old one down once the new one is proven live (connect() validates via SELECT 1 before returning). If the rebuild fails, restore the old pool so the engine stays usable — postgres.js pools self-heal on the next query once Postgres is back, and batchRetry's backoff retries. Keeps the existing reap-detection + pool-recovery audit logging untouched; only the teardown ordering changes.

Test

test/connection-resilience.test.ts (25 pass) + test/core/retry-reconnect.test.ts (5 pass), typecheck clean. The recovery-contract guard is updated to assert the build-then-swap old-pool teardown rather than the removed disconnect() call.

Complementary to #1891 (non-batch call-site retry wrapper) and orthogonal to the module-singleton ownership work — this is the instance-pool layer.

… can't brick the engine (garrytan#1593) The instance-pool reconnect() did disconnect() (nulling _sql) BEFORE connect(), so a connect() failure during a transient Postgres blip left _sql null for the rest of the process — every subsequent non-retry-wrapped call then fell through to the never-connected module singleton and threw 'No database connection', crashing the autopilot worker into a respawn loop. Build-then-swap: snapshot the live pool, build a fresh one, end the old only once the new validates, restore on failure. Keeps upstream's reap-detection + pool-recovery audit. Confirmed by jalagrange on closed PR garrytan#1593; this is the Layer-1 fix he left to us. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

rayers mentioned this pull request Jun 5, 2026

fix(retry): reconnect engine pool in batchRetry onRetry — singleton-null repair #1593

Closed

maxpetrusenkoagent mentioned this pull request Jun 9, 2026

fix(minions): reconnect worker after promote connection loss #2025

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(postgres-engine): build-then-swap reconnect() so a failed rebuild can't brick the engine#1906

fix(postgres-engine): build-then-swap reconnect() so a failed rebuild can't brick the engine#1906
rayers wants to merge 1 commit into
garrytan:masterfrom
rayers:fix/reconnect-build-then-swap

rayers commented Jun 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

rayers commented Jun 5, 2026

Context

Problem

Fix — build-then-swap

Test

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant