Version: gbrain 0.41.11.1 (Postgres engine, gbrain autopilot continuous daemon)
TL;DR
When the Postgres connection drops under a long-running gbrain autopilot worker (e.g. a Supabase Supavisor session-pooler closing the socket), the autopilot-managed child worker (gbrain jobs work) never re-establishes the connection in-process. Its background interval loops (promoteDelayed, stall-detection) and the failure-recording path (failJob) just throw No database connection repeatedly, the original job error gets masked, and the worker exits code=1 and respawns. On a brain that sees periodic pooler closes this becomes a crash-loop.
This is not the RSS watchdog — no [watchdog rss=…] events occur. The initiating failure is a socket/pooler close.
Observed (sanitized)
Continuous gbrain autopilot daemon, child worker gbrain jobs work --max-rss 2048, Supabase Supavisor session pooler (<project>.pooler.supabase.com:5432).
[extract.links_fs] connection blip, retrying 7 rows in 500ms (No database connection: connect() has not been called…)
Promotion error: write CONNECTION_CLOSED <project>.pooler.supabase.com:5432
Stall detection error: write CONNECTION_CLOSED <project>.pooler.supabase.com:5432
batch error (7 link rows lost): No database connection: connect() has not been called…
Over the daemon's lifetime: No database connection ×6,350, Promotion error ×20,064 (background interval loop spamming the dead pool), worker exited code=1 ×547, crashCount reaching 5/5. The errors spray across every cycle phase (lint, sync, extract, embed…), confirming a global pool-state failure, not a single bad phase.
Root cause
-
The DB connection is a single mutable module-level pool singleton: let sql (src/core/db.ts:7). getConnection() throws No database connection when it's null (db.ts:151).
-
A raw CONNECTION_CLOSED does not null the singleton — the object stays non-null but its underlying socket is dead. And connect() is a blind no-op when sql is already set (db.ts:162), so nothing repairs a dead-but-non-null pool.
-
The autopilot child is plain gbrain jobs work — not MinionSupervisor — so it lacks the reconnect health logic in supervisor.ts:589. Its background paths only log:
promoteDelayed() — src/core/minions/worker.ts:432
- stall-detection timer —
src/core/minions/worker.ts:242
failJob() → engine.transaction() (src/core/minions/queue.ts:855) → this._sql || db.getConnection() (src/core/postgres-engine.ts:764)
None of these call connect()/reconnect(). failJob recording a failure on the dead pool throws again and masks the original job error, taking the worker down.
-
Retry coverage is fragmented: retry-matcher.ts (/connection.*closed/i, would match CONNECTION_CLOSED) is not called by these paths; and connectWithRetry() uses a separate older list (db.ts:255) that lacks connection.*closed.
A PostgresEngine.reconnect() already exists (postgres-engine.ts:4146) — it just isn't wired into the child-worker's background loops.
Proposed minimal fix
worker.ts:432 + worker.ts:242: on isRetryableConnError(e), call engine.reconnect() and retry once for these idempotent queue-maintenance operations.
worker.ts:831: wrap queue.failJob() so a retryable DB failure reconnects + retries once; if it still fails, log both the original job error and the failure-recording error rather than masking the original.
db.ts:162: connect() should not be a blind no-op when the singleton exists but may be dead; reconnect() should be the standard repair path for worker daemons.
- Unify
connectWithRetry() (db.ts:255) with retry-matcher.ts so the startup matcher and the runtime matcher agree (the startup list currently misses CONNECTION_CLOSED).
Happy to open a PR for (1)+(2) (the highest-leverage pair) if the direction looks right.
Version: gbrain 0.41.11.1 (Postgres engine,
gbrain autopilotcontinuous daemon)TL;DR
When the Postgres connection drops under a long-running
gbrain autopilotworker (e.g. a Supabase Supavisor session-pooler closing the socket), the autopilot-managed child worker (gbrain jobs work) never re-establishes the connection in-process. Its background interval loops (promoteDelayed, stall-detection) and the failure-recording path (failJob) just throwNo database connectionrepeatedly, the original job error gets masked, and the worker exitscode=1and respawns. On a brain that sees periodic pooler closes this becomes a crash-loop.This is not the RSS watchdog — no
[watchdog rss=…]events occur. The initiating failure is a socket/pooler close.Observed (sanitized)
Continuous
gbrain autopilotdaemon, child workergbrain jobs work --max-rss 2048, Supabase Supavisor session pooler (<project>.pooler.supabase.com:5432).Over the daemon's lifetime:
No database connection×6,350,Promotion error×20,064 (background interval loop spamming the dead pool),worker exited code=1×547,crashCountreaching 5/5. The errors spray across every cycle phase (lint, sync, extract, embed…), confirming a global pool-state failure, not a single bad phase.Root cause
The DB connection is a single mutable module-level pool singleton:
let sql(src/core/db.ts:7).getConnection()throwsNo database connectionwhen it's null (db.ts:151).A raw
CONNECTION_CLOSEDdoes not null the singleton — the object stays non-null but its underlying socket is dead. Andconnect()is a blind no-op whensqlis already set (db.ts:162), so nothing repairs a dead-but-non-null pool.The autopilot child is plain
gbrain jobs work— notMinionSupervisor— so it lacks the reconnect health logic insupervisor.ts:589. Its background paths only log:promoteDelayed()—src/core/minions/worker.ts:432src/core/minions/worker.ts:242failJob()→engine.transaction()(src/core/minions/queue.ts:855) →this._sql || db.getConnection()(src/core/postgres-engine.ts:764)None of these call
connect()/reconnect().failJobrecording a failure on the dead pool throws again and masks the original job error, taking the worker down.Retry coverage is fragmented:
retry-matcher.ts(/connection.*closed/i, would matchCONNECTION_CLOSED) is not called by these paths; andconnectWithRetry()uses a separate older list (db.ts:255) that lacksconnection.*closed.A
PostgresEngine.reconnect()already exists (postgres-engine.ts:4146) — it just isn't wired into the child-worker's background loops.Proposed minimal fix
worker.ts:432+worker.ts:242: onisRetryableConnError(e), callengine.reconnect()and retry once for these idempotent queue-maintenance operations.worker.ts:831: wrapqueue.failJob()so a retryable DB failure reconnects + retries once; if it still fails, log both the original job error and the failure-recording error rather than masking the original.db.ts:162:connect()should not be a blind no-op when the singleton exists but may be dead;reconnect()should be the standard repair path for worker daemons.connectWithRetry()(db.ts:255) withretry-matcher.tsso the startup matcher and the runtime matcher agree (the startup list currently missesCONNECTION_CLOSED).Happy to open a PR for (1)+(2) (the highest-leverage pair) if the direction looks right.