fix(retry): reconnect on null instance pool in non-batch config reads (#1593 follow-up)#1891
Open
jalagrange wants to merge 1 commit into
Open
fix(retry): reconnect on null instance pool in non-batch config reads (#1593 follow-up)#1891jalagrange wants to merge 1 commit into
jalagrange wants to merge 1 commit into
Conversation
…garrytan#1593 follow-up) getConfig / setConfig / unsetConfig / listConfigKeys touch `this.sql` directly. When an instance pool is torn down mid-cycle the `sql` getter throws a retryable "No database connection" (issue garrytan#1678) by design, so a withRetry+reconnect caller self-heals. The batch path had that wrapper; these non-batch accessors did not, so the first hard DB call after a mid-cycle disconnect (loadCfg's getConfig) threw unhandled and crashed the worker into a respawn loop (reported by rayers on garrytan#1593). Add a `connRetry` helper (single-statement sibling of batchRetry, same reconnect posture, no batch audit) and route the config accessors through it. All four are single-statement and idempotent, so retry-on-connection- error is safe; non-retryable errors still throw immediately. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
#1593 wired
reconnect()into the batch retry path so a null instance pool rebuilds itself mid-cycle. The non-batch path was explicitly left as follow-up in that PR's Scope section, and @rayers confirmed it in production (#1593 (comment) #1593 (comment)): on a long-runninggbrain autopilotdaemon, late-cycle phases throwNo database connection,executeJobcrashes unhandled, the worker exits code 1 and respawns every ~5 min.The first hard DB call after a mid-cycle instance-pool teardown is
loadCfg'sengine.getConfig()(x7). That accessor, like its siblings, touchesthis.sqldirectly with no retry wrapper.Root cause
The
sqlgetter already throws a retryableNo database connectionwhen an instance pool went null (issue #1678,postgres-engine.ts:136), by design, precisely so awithRetry+reconnectcaller rebuilds the pool and recovers. The batch path (batchRetry) has that wrapper. The config accessors (getConfig,setConfig,unsetConfig,listConfigKeys) do not, so the retryable throw propagates unhandled and kills the worker.Fix
Add
connRetry, a single-statement sibling ofbatchRetry: samereconnect: (ctx) => this.reconnect(ctx)posture andBULK_RETRY_OPTStiming, but no batch-retry audit (these are not sized batches, so they must not inflate thebatch_retry_healthdoctor metric). Route the four config accessors through it. Each retried thunk re-readsthis.sql, so a reconnect between attempts is picked up.All four are single-statement and idempotent (reads; upsert; delete), so retry-on-connection-error is safe.
withRetryonly retries onisRetryableConnError, so non-connection errors (syntax, constraint) still throw immediately and are not masked.Tests
test/postgres-engine-config-reconnect.test.ts(pure, no DB; pokes private fields + stubsreconnect):getConfigreconnects + retries a null instance pool, then returns the value (exactly one reconnect).getConfigsurfaces a non-retryable error without reconnecting (no masking).listConfigKeysreconnects + retries a null instance pool, then returns keys.Scope and relationship to other PRs
reconnect()hardening @rayers proposed on fix(retry): reconnect engine pool in batchRetry onRetry — singleton-null repair #1593 (build-then-swap): that makesreconnect()itself crash-safe, this makes the non-batch callers invoke reconnect at all. Both are needed and compose cleanly.db.ts).connRetryhelper extends to other non-batch reads as obvious follow-ups.