fix(db-lock): self-heal cycle-lock refresh on pooler-reaped CONNECTION_ENDED#1669
Open
garrytan-agents wants to merge 1 commit into
Open
fix(db-lock): self-heal cycle-lock refresh on pooler-reaped CONNECTION_ENDED#1669garrytan-agents wants to merge 1 commit into
garrytan-agents wants to merge 1 commit into
Conversation
…n (CONNECTION_ENDED) The cycle-lock refresh tick fires every (TTL/6) ms (~5 min for the default 30-min lock), but the read pool idle_timeout is 20s, so the connection is always reaped between ticks. Against a transaction-mode pooler (PgBouncer/ Supavisor :6543) the reconnect races and postgres.js raises `write CONNECTION_ENDED`. That unhandled throw trips withRefreshingLock, sets healthOk=false, and the worker voluntarily releases + exits code 1 — an every-~5-min "crash" loop (observed 400+ crashes/24h on a prod worker, all clean safety-releases requeued with no attempt burned, but noisy and throughput-killing). Two minimal, idiomatic fixes: 1. retry-matcher: add CONNECTION_ENDED (postgres.js error code + message). The matcher previously only caught the downstream "No database connection" GBrainError after the pool was already nulled — the ROOT transient drop escaped retry. Benefits every withRetry site. 2. db-lock: wrap refresh() and the SELECT-1 heartbeat in withRetry with a duck-typed reconnect() callback (PostgresEngine-private; undefined for PGLite). A reaped socket now self-heals (rebuild pool, retry) instead of killing the lock holder. Mirrors the engine-level batchRetry idiom (garrytan#1570). Tests: +2 CONNECTION_ENDED cases in retry-matcher.test.ts. tsc clean; retry-matcher + db-lock-refresh + worker-lock-renewal suites green (57->retained pass, new cases pass).
This was referenced May 31, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
A production worker logged 400+ "crashes" / 24h — one every ~5 minutes — all with the same chain:
These aren't data-losing (jobs requeue, no attempt burned —
withRefreshingLock's designed safety release) but they are noisy and throughput-killing: the worker restarts every cycle and the autopilot queue backs up.Root cause
withRefreshingLock's refresh tick fires every(TTL/6)ms — ~5 min for the default 30-min cycle lock. But the read pool'sidle_timeoutis 20s (connection-manager.ts), so the connection is always reaped between ticks. Against a transaction-mode pooler (PgBouncer / Supavisor :6543) the reconnect races and postgres.js raiseswrite CONNECTION_ENDED.That throw escapes both
engineSelectOne(the SELECT-1 heartbeat) andhandle.refresh(), tripswithRefreshingLock's catch, setshealthOk=false, and the worker voluntarily releases + exits code 1.The existing retry matcher (
retry-matcher.ts) caught the downstreamNo database connectionGBrainError — but only after the pool was already nulled. The rootCONNECTION_ENDEDhad no matching pattern, so the first (recoverable) error escaped retry entirely.Fix (two minimal, idiomatic changes)
retry-matcher.ts— addCONNECTION_ENDED(postgres.js error code + message pattern). Benefits everywithRetrycall site, not just locks.db-lock.ts— wraprefresh()and the SELECT-1 heartbeat inwithRetrywith a duck-typedreconnect()callback (PostgresEngine-private;undefinedfor PGLite, which can't lose a socket). A reaped connection self-heals (rebuild pool, retry) instead of killing the lock holder. Mirrors the engine-levelbatchRetryidiom from bug: withRetry doesn't reconnect on 'No database connection' — batch rows lost every dream cycle #1570.Tests
+2CONNECTION_ENDED cases inretry-matcher.test.ts(code form + message form).tsc --noEmitclean.retry-matcher(17 pass) +db-lock-refresh+worker-lock-renewal+worker-lock-renewal-shapesuites green.Notes / alternatives considered
idle_timeout: rejected — fights the pooler's own reaping and just moves the race.withReservedConnection(thedb-lock.ts:552"Lane B follow-up" TODO): heavier surface;withRetry+reconnect achieves the same self-heal with far less risk and reuses the existing primitive. The reserved-connection pin is still a reasonable future hardening.