Skip to content

fix(retry): reconnect on null instance pool in non-batch config reads (#1593 follow-up)#1891

Open
jalagrange wants to merge 1 commit into
garrytan:masterfrom
jalagrange:fix/non-batch-config-reconnect
Open

fix(retry): reconnect on null instance pool in non-batch config reads (#1593 follow-up)#1891
jalagrange wants to merge 1 commit into
garrytan:masterfrom
jalagrange:fix/non-batch-config-reconnect

Conversation

@jalagrange

Copy link
Copy Markdown

Problem

#1593 wired reconnect() into the batch retry path so a null instance pool rebuilds itself mid-cycle. The non-batch path was explicitly left as follow-up in that PR's Scope section, and @rayers confirmed it in production (#1593 (comment) #1593 (comment)): on a long-running gbrain autopilot daemon, late-cycle phases throw No database connection, executeJob crashes unhandled, the worker exits code 1 and respawns every ~5 min.

The first hard DB call after a mid-cycle instance-pool teardown is loadCfg's engine.getConfig() (x7). That accessor, like its siblings, touches this.sql directly with no retry wrapper.

Root cause

The sql getter already throws a retryable No database connection when an instance pool went null (issue #1678, postgres-engine.ts:136), by design, precisely so a withRetry+reconnect caller rebuilds the pool and recovers. The batch path (batchRetry) has that wrapper. The config accessors (getConfig, setConfig, unsetConfig, listConfigKeys) do not, so the retryable throw propagates unhandled and kills the worker.

Fix

Add connRetry, a single-statement sibling of batchRetry: same reconnect: (ctx) => this.reconnect(ctx) posture and BULK_RETRY_OPTS timing, but no batch-retry audit (these are not sized batches, so they must not inflate the batch_retry_health doctor metric). Route the four config accessors through it. Each retried thunk re-reads this.sql, so a reconnect between attempts is picked up.

All four are single-statement and idempotent (reads; upsert; delete), so retry-on-connection-error is safe. withRetry only retries on isRetryableConnError, so non-connection errors (syntax, constraint) still throw immediately and are not masked.

Tests

test/postgres-engine-config-reconnect.test.ts (pure, no DB; pokes private fields + stubs reconnect):

  • getConfig reconnects + retries a null instance pool, then returns the value (exactly one reconnect).
  • getConfig surfaces a non-retryable error without reconnecting (no masking).
  • listConfigKeys reconnects + retries a null instance pool, then returns keys.
$ bun test test/postgres-engine-config-reconnect.test.ts
 3 pass, 0 fail

$ bun test test/postgres-engine-getter-selfheal.test.ts test/config.test.ts test/config-set.test.ts test/config-unset.test.ts test/core/retry.test.ts test/connection-resilience.test.ts
 115 pass, 0 fail

$ bun run typecheck   # tsc --noEmit, clean

Scope and relationship to other PRs

…garrytan#1593 follow-up)

getConfig / setConfig / unsetConfig / listConfigKeys touch `this.sql`
directly. When an instance pool is torn down mid-cycle the `sql` getter
throws a retryable "No database connection" (issue garrytan#1678) by design, so a
withRetry+reconnect caller self-heals. The batch path had that wrapper;
these non-batch accessors did not, so the first hard DB call after a
mid-cycle disconnect (loadCfg's getConfig) threw unhandled and crashed the
worker into a respawn loop (reported by rayers on garrytan#1593).

Add a `connRetry` helper (single-statement sibling of batchRetry, same
reconnect posture, no batch audit) and route the config accessors through
it. All four are single-statement and idempotent, so retry-on-connection-
error is safe; non-retryable errors still throw immediately.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant