fix(db-lock): self-heal cycle-lock refresh on pooler-reaped CONNECTION_ENDED by garrytan-agents · Pull Request #1669 · garrytan/gbrain

garrytan-agents · 2026-05-30T23:14:04Z

Problem

A production worker logged 400+ "crashes" / 24h — one every ~5 minutes — all with the same chain:

[cycle.lint] done
Promotion error: No database connection: connect() has not been called.
Job NNNN (autopilot-cycle) released after infrastructure abort (lock-renewal-failed); stall detector will requeue (no attempt burned)
write CONNECTION_ENDED aws-1-...pooler.supabase.com:6543
worker_exited code=1 ... likely_cause=runtime_error

These aren't data-losing (jobs requeue, no attempt burned — withRefreshingLock's designed safety release) but they are noisy and throughput-killing: the worker restarts every cycle and the autopilot queue backs up.

Root cause

withRefreshingLock's refresh tick fires every (TTL/6) ms — ~5 min for the default 30-min cycle lock. But the read pool's idle_timeout is 20s (connection-manager.ts), so the connection is always reaped between ticks. Against a transaction-mode pooler (PgBouncer / Supavisor :6543) the reconnect races and postgres.js raises write CONNECTION_ENDED.

That throw escapes both engineSelectOne (the SELECT-1 heartbeat) and handle.refresh(), trips withRefreshingLock's catch, sets healthOk=false, and the worker voluntarily releases + exits code 1.

The existing retry matcher (retry-matcher.ts) caught the downstream No database connection GBrainError — but only after the pool was already nulled. The root CONNECTION_ENDED had no matching pattern, so the first (recoverable) error escaped retry entirely.

Fix (two minimal, idiomatic changes)

retry-matcher.ts — add CONNECTION_ENDED (postgres.js error code + message pattern). Benefits every withRetry call site, not just locks.
db-lock.ts — wrap refresh() and the SELECT-1 heartbeat in withRetry with a duck-typed reconnect() callback (PostgresEngine-private; undefined for PGLite, which can't lose a socket). A reaped connection self-heals (rebuild pool, retry) instead of killing the lock holder. Mirrors the engine-level batchRetry idiom from bug: withRetry doesn't reconnect on 'No database connection' — batch rows lost every dream cycle #1570.

Tests

+2 CONNECTION_ENDED cases in retry-matcher.test.ts (code form + message form).
tsc --noEmit clean.
retry-matcher (17 pass) + db-lock-refresh + worker-lock-renewal + worker-lock-renewal-shape suites green.

Notes / alternatives considered

Bumping idle_timeout: rejected — fights the pooler's own reaping and just moves the race.
Pinning refresh to withReservedConnection (the db-lock.ts:552 "Lane B follow-up" TODO): heavier surface; withRetry+reconnect achieves the same self-heal with far less risk and reuses the existing primitive. The reserved-connection pin is still a reasonable future hardening.

…n (CONNECTION_ENDED) The cycle-lock refresh tick fires every (TTL/6) ms (~5 min for the default 30-min lock), but the read pool idle_timeout is 20s, so the connection is always reaped between ticks. Against a transaction-mode pooler (PgBouncer/ Supavisor :6543) the reconnect races and postgres.js raises `write CONNECTION_ENDED`. That unhandled throw trips withRefreshingLock, sets healthOk=false, and the worker voluntarily releases + exits code 1 — an every-~5-min "crash" loop (observed 400+ crashes/24h on a prod worker, all clean safety-releases requeued with no attempt burned, but noisy and throughput-killing). Two minimal, idiomatic fixes: 1. retry-matcher: add CONNECTION_ENDED (postgres.js error code + message). The matcher previously only caught the downstream "No database connection" GBrainError after the pool was already nulled — the ROOT transient drop escaped retry. Benefits every withRetry site. 2. db-lock: wrap refresh() and the SELECT-1 heartbeat in withRetry with a duck-typed reconnect() callback (PostgresEngine-private; undefined for PGLite). A reaped socket now self-heals (rebuild pool, retry) instead of killing the lock holder. Mirrors the engine-level batchRetry idiom (garrytan#1570). Tests: +2 CONNECTION_ENDED cases in retry-matcher.test.ts. tsc clean; retry-matcher + db-lock-refresh + worker-lock-renewal suites green (57->retained pass, new cases pass).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(db-lock): self-heal cycle-lock refresh on pooler-reaped CONNECTION_ENDED#1669

fix(db-lock): self-heal cycle-lock refresh on pooler-reaped CONNECTION_ENDED#1669
garrytan-agents wants to merge 1 commit into
garrytan:masterfrom
garrytan-agents:fix/lock-refresh-reserved-conn

garrytan-agents commented May 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

garrytan-agents commented May 30, 2026

Problem

Root cause

Fix (two minimal, idiomatic changes)

Tests

Notes / alternatives considered

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant