Skip to content

fix(db-lock): self-heal cycle-lock refresh on pooler-reaped CONNECTION_ENDED#1669

Open
garrytan-agents wants to merge 1 commit into
garrytan:masterfrom
garrytan-agents:fix/lock-refresh-reserved-conn
Open

fix(db-lock): self-heal cycle-lock refresh on pooler-reaped CONNECTION_ENDED#1669
garrytan-agents wants to merge 1 commit into
garrytan:masterfrom
garrytan-agents:fix/lock-refresh-reserved-conn

Conversation

@garrytan-agents

Copy link
Copy Markdown
Contributor

Problem

A production worker logged 400+ "crashes" / 24h — one every ~5 minutes — all with the same chain:

[cycle.lint] done
Promotion error: No database connection: connect() has not been called.
Job NNNN (autopilot-cycle) released after infrastructure abort (lock-renewal-failed); stall detector will requeue (no attempt burned)
write CONNECTION_ENDED aws-1-...pooler.supabase.com:6543
worker_exited code=1 ... likely_cause=runtime_error

These aren't data-losing (jobs requeue, no attempt burned — withRefreshingLock's designed safety release) but they are noisy and throughput-killing: the worker restarts every cycle and the autopilot queue backs up.

Root cause

withRefreshingLock's refresh tick fires every (TTL/6) ms — ~5 min for the default 30-min cycle lock. But the read pool's idle_timeout is 20s (connection-manager.ts), so the connection is always reaped between ticks. Against a transaction-mode pooler (PgBouncer / Supavisor :6543) the reconnect races and postgres.js raises write CONNECTION_ENDED.

That throw escapes both engineSelectOne (the SELECT-1 heartbeat) and handle.refresh(), trips withRefreshingLock's catch, sets healthOk=false, and the worker voluntarily releases + exits code 1.

The existing retry matcher (retry-matcher.ts) caught the downstream No database connection GBrainError — but only after the pool was already nulled. The root CONNECTION_ENDED had no matching pattern, so the first (recoverable) error escaped retry entirely.

Fix (two minimal, idiomatic changes)

  1. retry-matcher.ts — add CONNECTION_ENDED (postgres.js error code + message pattern). Benefits every withRetry call site, not just locks.

  2. db-lock.ts — wrap refresh() and the SELECT-1 heartbeat in withRetry with a duck-typed reconnect() callback (PostgresEngine-private; undefined for PGLite, which can't lose a socket). A reaped connection self-heals (rebuild pool, retry) instead of killing the lock holder. Mirrors the engine-level batchRetry idiom from bug: withRetry doesn't reconnect on 'No database connection' — batch rows lost every dream cycle #1570.

Tests

  • +2 CONNECTION_ENDED cases in retry-matcher.test.ts (code form + message form).
  • tsc --noEmit clean.
  • retry-matcher (17 pass) + db-lock-refresh + worker-lock-renewal + worker-lock-renewal-shape suites green.

Notes / alternatives considered

  • Bumping idle_timeout: rejected — fights the pooler's own reaping and just moves the race.
  • Pinning refresh to withReservedConnection (the db-lock.ts:552 "Lane B follow-up" TODO): heavier surface; withRetry+reconnect achieves the same self-heal with far less risk and reuses the existing primitive. The reserved-connection pin is still a reasonable future hardening.

…n (CONNECTION_ENDED)

The cycle-lock refresh tick fires every (TTL/6) ms (~5 min for the default
30-min lock), but the read pool idle_timeout is 20s, so the connection is
always reaped between ticks. Against a transaction-mode pooler (PgBouncer/
Supavisor :6543) the reconnect races and postgres.js raises
`write CONNECTION_ENDED`. That unhandled throw trips withRefreshingLock,
sets healthOk=false, and the worker voluntarily releases + exits code 1 —
an every-~5-min "crash" loop (observed 400+ crashes/24h on a prod worker,
all clean safety-releases requeued with no attempt burned, but noisy and
throughput-killing).

Two minimal, idiomatic fixes:

1. retry-matcher: add CONNECTION_ENDED (postgres.js error code + message).
   The matcher previously only caught the downstream "No database connection"
   GBrainError after the pool was already nulled — the ROOT transient drop
   escaped retry. Benefits every withRetry site.

2. db-lock: wrap refresh() and the SELECT-1 heartbeat in withRetry with a
   duck-typed reconnect() callback (PostgresEngine-private; undefined for
   PGLite). A reaped socket now self-heals (rebuild pool, retry) instead of
   killing the lock holder. Mirrors the engine-level batchRetry idiom (garrytan#1570).

Tests: +2 CONNECTION_ENDED cases in retry-matcher.test.ts. tsc clean;
retry-matcher + db-lock-refresh + worker-lock-renewal suites green
(57->retained pass, new cases pass).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant