Skip to content

fix(postgres-engine): build-then-swap reconnect() so a failed rebuild can't brick the engine#1906

Open
rayers wants to merge 1 commit into
garrytan:masterfrom
rayers:fix/reconnect-build-then-swap
Open

fix(postgres-engine): build-then-swap reconnect() so a failed rebuild can't brick the engine#1906
rayers wants to merge 1 commit into
garrytan:masterfrom
rayers:fix/reconnect-build-then-swap

Conversation

@rayers

@rayers rayers commented Jun 5, 2026

Copy link
Copy Markdown

Context

Follow-up to #1593 (closed). That PR wired reconnect into batchRetry's onRetry — which has since landed on master independently. As @jalagrange and I confirmed there, the wiring is necessary but not sufficient: the reconnect() it calls can itself brick the engine. Juan split the remaining work into two layers and left this one (the reconnect() crash-safety) to me; the non-batch-call-sites layer is his #1891.

Problem

PostgresEngine.reconnect() on the instance-pool path does disconnect() (which nulls _sql) before attempting connect():

try { await this.disconnect(); } catch {}   // _sql = null HERE
await this.connect(this._savedConfig);        // if THIS throws, _sql stays null forever

So when a transient blip fires the retry→reconnect path and Postgres is unreachable for those few seconds, connect() throws and _sql is left null permanently. Every subsequent call uses the accessor this._sql || db.getConnection(); with _sql null it falls through to the module singleton, which a long-running instance-pool process (e.g. the autopilot worker) never connected → throw. Non-bulk methods (getConfig, per-phase reads) have no retry wrapper, so the first one after the failed reconnect throws unhandled and crashes the worker, which respawns and repeats every ~5 min.

Observed in production on a long-running gbrain autopilot daemon (direct local Postgres): worker runs a cycle's early phases fine, late phases throw No database connection: connect() has not been called, executeJob crashes unhandled, exit code 1, respawn loop.

Fix — build-then-swap

Snapshot the live pool, build a fresh one, and only tear the old one down once the new one is proven live (connect() validates via SELECT 1 before returning). If the rebuild fails, restore the old pool so the engine stays usable — postgres.js pools self-heal on the next query once Postgres is back, and batchRetry's backoff retries. Keeps the existing reap-detection + pool-recovery audit logging untouched; only the teardown ordering changes.

Test

test/connection-resilience.test.ts (25 pass) + test/core/retry-reconnect.test.ts (5 pass), typecheck clean. The recovery-contract guard is updated to assert the build-then-swap old-pool teardown rather than the removed disconnect() call.

Complementary to #1891 (non-batch call-site retry wrapper) and orthogonal to the module-singleton ownership work — this is the instance-pool layer.

… can't brick the engine (garrytan#1593)

The instance-pool reconnect() did disconnect() (nulling _sql) BEFORE connect(),
so a connect() failure during a transient Postgres blip left _sql null for the
rest of the process — every subsequent non-retry-wrapped call then fell through
to the never-connected module singleton and threw 'No database connection',
crashing the autopilot worker into a respawn loop. Build-then-swap: snapshot the
live pool, build a fresh one, end the old only once the new validates, restore
on failure. Keeps upstream's reap-detection + pool-recovery audit. Confirmed by
jalagrange on closed PR garrytan#1593; this is the Layer-1 fix he left to us.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant