fix(retry): reconnect on null instance pool in non-batch config reads (#1593 follow-up) by jalagrange · Pull Request #1891 · garrytan/gbrain

jalagrange · 2026-06-05T16:39:46Z

Problem

#1593 wired reconnect() into the batch retry path so a null instance pool rebuilds itself mid-cycle. The non-batch path was explicitly left as follow-up in that PR's Scope section, and @rayers confirmed it in production (#1593 (comment) #1593 (comment)): on a long-running gbrain autopilot daemon, late-cycle phases throw No database connection, executeJob crashes unhandled, the worker exits code 1 and respawns every ~5 min.

The first hard DB call after a mid-cycle instance-pool teardown is loadCfg's engine.getConfig() (x7). That accessor, like its siblings, touches this.sql directly with no retry wrapper.

Root cause

The sql getter already throws a retryable No database connection when an instance pool went null (issue #1678, postgres-engine.ts:136), by design, precisely so a withRetry+reconnect caller rebuilds the pool and recovers. The batch path (batchRetry) has that wrapper. The config accessors (getConfig, setConfig, unsetConfig, listConfigKeys) do not, so the retryable throw propagates unhandled and kills the worker.

Fix

Add connRetry, a single-statement sibling of batchRetry: same reconnect: (ctx) => this.reconnect(ctx) posture and BULK_RETRY_OPTS timing, but no batch-retry audit (these are not sized batches, so they must not inflate the batch_retry_health doctor metric). Route the four config accessors through it. Each retried thunk re-reads this.sql, so a reconnect between attempts is picked up.

All four are single-statement and idempotent (reads; upsert; delete), so retry-on-connection-error is safe. withRetry only retries on isRetryableConnError, so non-connection errors (syntax, constraint) still throw immediately and are not masked.

Tests

test/postgres-engine-config-reconnect.test.ts (pure, no DB; pokes private fields + stubs reconnect):

getConfig reconnects + retries a null instance pool, then returns the value (exactly one reconnect).
getConfig surfaces a non-retryable error without reconnecting (no masking).
listConfigKeys reconnects + retries a null instance pool, then returns keys.

$ bun test test/postgres-engine-config-reconnect.test.ts
 3 pass, 0 fail

$ bun test test/postgres-engine-getter-selfheal.test.ts test/config.test.ts test/config-set.test.ts test/config-unset.test.ts test/core/retry.test.ts test/connection-resilience.test.ts
 115 pass, 0 fail

$ bun run typecheck   # tsc --noEmit, clean

Scope and relationship to other PRs

Builds on fix(retry): reconnect engine pool in batchRetry onRetry — singleton-null repair #1593 (batch-path reconnect, now reflected on master) and is the non-batch half that PR called out as follow-up.
Orthogonal to the instance-pool reconnect() hardening @rayers proposed on fix(retry): reconnect engine pool in batchRetry onRetry — singleton-null repair #1593 (build-then-swap): that makes reconnect() itself crash-safe, this makes the non-batch callers invoke reconnect at all. Both are needed and compose cleanly.
Orthogonal to fix(db): refcount module singleton to fix mid-cycle "connect() has not been called" (#1570) #1754 (module-singleton refcount, db.ts).
Intentionally narrow: the config accessor family only. The same connRetry helper extends to other non-batch reads as obvious follow-ups.

…garrytan#1593 follow-up) getConfig / setConfig / unsetConfig / listConfigKeys touch `this.sql` directly. When an instance pool is torn down mid-cycle the `sql` getter throws a retryable "No database connection" (issue garrytan#1678) by design, so a withRetry+reconnect caller self-heals. The batch path had that wrapper; these non-batch accessors did not, so the first hard DB call after a mid-cycle disconnect (loadCfg's getConfig) threw unhandled and crashed the worker into a respawn loop (reported by rayers on garrytan#1593). Add a `connRetry` helper (single-statement sibling of batchRetry, same reconnect posture, no batch audit) and route the config accessors through it. All four are single-statement and idempotent, so retry-on-connection- error is safe; non-retryable errors still throw immediately. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

jalagrange mentioned this pull request Jun 5, 2026

fix(retry): reconnect engine pool in batchRetry onRetry — singleton-null repair #1593

Closed

rayers mentioned this pull request Jun 5, 2026

fix(postgres-engine): build-then-swap reconnect() so a failed rebuild can't brick the engine #1906

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(retry): reconnect on null instance pool in non-batch config reads (#1593 follow-up)#1891

fix(retry): reconnect on null instance pool in non-batch config reads (#1593 follow-up)#1891
jalagrange wants to merge 1 commit into
garrytan:masterfrom
jalagrange:fix/non-batch-config-reconnect

jalagrange commented Jun 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jalagrange commented Jun 5, 2026

Problem

Root cause

Fix

Tests

Scope and relationship to other PRs

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant