fix(postgres-engine): build-then-swap reconnect() so a failed rebuild can't brick the engine#1906
Open
rayers wants to merge 1 commit into
Open
fix(postgres-engine): build-then-swap reconnect() so a failed rebuild can't brick the engine#1906rayers wants to merge 1 commit into
rayers wants to merge 1 commit into
Conversation
… can't brick the engine (garrytan#1593) The instance-pool reconnect() did disconnect() (nulling _sql) BEFORE connect(), so a connect() failure during a transient Postgres blip left _sql null for the rest of the process — every subsequent non-retry-wrapped call then fell through to the never-connected module singleton and threw 'No database connection', crashing the autopilot worker into a respawn loop. Build-then-swap: snapshot the live pool, build a fresh one, end the old only once the new validates, restore on failure. Keeps upstream's reap-detection + pool-recovery audit. Confirmed by jalagrange on closed PR garrytan#1593; this is the Layer-1 fix he left to us. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Context
Follow-up to #1593 (closed). That PR wired
reconnectintobatchRetry'sonRetry— which has since landed on master independently. As @jalagrange and I confirmed there, the wiring is necessary but not sufficient: thereconnect()it calls can itself brick the engine. Juan split the remaining work into two layers and left this one (thereconnect()crash-safety) to me; the non-batch-call-sites layer is his #1891.Problem
PostgresEngine.reconnect()on the instance-pool path doesdisconnect()(which nulls_sql) before attemptingconnect():So when a transient blip fires the retry→reconnect path and Postgres is unreachable for those few seconds,
connect()throws and_sqlis left null permanently. Every subsequent call uses the accessorthis._sql || db.getConnection(); with_sqlnull it falls through to the module singleton, which a long-running instance-pool process (e.g. the autopilot worker) never connected → throw. Non-bulk methods (getConfig, per-phase reads) have no retry wrapper, so the first one after the failed reconnect throws unhandled and crashes the worker, which respawns and repeats every ~5 min.Observed in production on a long-running
gbrain autopilotdaemon (direct local Postgres): worker runs a cycle's early phases fine, late phases throwNo database connection: connect() has not been called,executeJobcrashes unhandled, exit code 1, respawn loop.Fix — build-then-swap
Snapshot the live pool, build a fresh one, and only tear the old one down once the new one is proven live (
connect()validates viaSELECT 1before returning). If the rebuild fails, restore the old pool so the engine stays usable — postgres.js pools self-heal on the next query once Postgres is back, andbatchRetry's backoff retries. Keeps the existing reap-detection + pool-recovery audit logging untouched; only the teardown ordering changes.Test
test/connection-resilience.test.ts(25 pass) +test/core/retry-reconnect.test.ts(5 pass), typecheck clean. The recovery-contract guard is updated to assert the build-then-swap old-pool teardown rather than the removeddisconnect()call.Complementary to #1891 (non-batch call-site retry wrapper) and orthogonal to the module-singleton ownership work — this is the instance-pool layer.