Skip to content

autopilot worker crash-loops on pooler CONNECTION_CLOSED: background queue loops have no in-process reconnect #1720

@mdcruz88

Description

@mdcruz88

Version: gbrain 0.41.11.1 (Postgres engine, gbrain autopilot continuous daemon)

TL;DR

When the Postgres connection drops under a long-running gbrain autopilot worker (e.g. a Supabase Supavisor session-pooler closing the socket), the autopilot-managed child worker (gbrain jobs work) never re-establishes the connection in-process. Its background interval loops (promoteDelayed, stall-detection) and the failure-recording path (failJob) just throw No database connection repeatedly, the original job error gets masked, and the worker exits code=1 and respawns. On a brain that sees periodic pooler closes this becomes a crash-loop.

This is not the RSS watchdog — no [watchdog rss=…] events occur. The initiating failure is a socket/pooler close.

Observed (sanitized)

Continuous gbrain autopilot daemon, child worker gbrain jobs work --max-rss 2048, Supabase Supavisor session pooler (<project>.pooler.supabase.com:5432).

[extract.links_fs] connection blip, retrying 7 rows in 500ms (No database connection: connect() has not been called…)
Promotion error: write CONNECTION_CLOSED <project>.pooler.supabase.com:5432
Stall detection error: write CONNECTION_CLOSED <project>.pooler.supabase.com:5432
  batch error (7 link rows lost): No database connection: connect() has not been called…

Over the daemon's lifetime: No database connection ×6,350, Promotion error ×20,064 (background interval loop spamming the dead pool), worker exited code=1 ×547, crashCount reaching 5/5. The errors spray across every cycle phase (lint, sync, extract, embed…), confirming a global pool-state failure, not a single bad phase.

Root cause

  1. The DB connection is a single mutable module-level pool singleton: let sql (src/core/db.ts:7). getConnection() throws No database connection when it's null (db.ts:151).

  2. A raw CONNECTION_CLOSED does not null the singleton — the object stays non-null but its underlying socket is dead. And connect() is a blind no-op when sql is already set (db.ts:162), so nothing repairs a dead-but-non-null pool.

  3. The autopilot child is plain gbrain jobs worknot MinionSupervisor — so it lacks the reconnect health logic in supervisor.ts:589. Its background paths only log:

    • promoteDelayed()src/core/minions/worker.ts:432
    • stall-detection timer — src/core/minions/worker.ts:242
    • failJob()engine.transaction() (src/core/minions/queue.ts:855) → this._sql || db.getConnection() (src/core/postgres-engine.ts:764)

    None of these call connect()/reconnect(). failJob recording a failure on the dead pool throws again and masks the original job error, taking the worker down.

  4. Retry coverage is fragmented: retry-matcher.ts (/connection.*closed/i, would match CONNECTION_CLOSED) is not called by these paths; and connectWithRetry() uses a separate older list (db.ts:255) that lacks connection.*closed.

A PostgresEngine.reconnect() already exists (postgres-engine.ts:4146) — it just isn't wired into the child-worker's background loops.

Proposed minimal fix

  1. worker.ts:432 + worker.ts:242: on isRetryableConnError(e), call engine.reconnect() and retry once for these idempotent queue-maintenance operations.
  2. worker.ts:831: wrap queue.failJob() so a retryable DB failure reconnects + retries once; if it still fails, log both the original job error and the failure-recording error rather than masking the original.
  3. db.ts:162: connect() should not be a blind no-op when the singleton exists but may be dead; reconnect() should be the standard repair path for worker daemons.
  4. Unify connectWithRetry() (db.ts:255) with retry-matcher.ts so the startup matcher and the runtime matcher agree (the startup list currently misses CONNECTION_CLOSED).

Happy to open a PR for (1)+(2) (the highest-leverage pair) if the direction looks right.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions