autopilot worker crash-loops on pooler CONNECTION_CLOSED: background queue loops have no in-process reconnect

**Version:** gbrain 0.41.11.1 (Postgres engine, `gbrain autopilot` continuous daemon)

## TL;DR

When the Postgres connection drops under a long-running `gbrain autopilot` worker (e.g. a Supabase Supavisor session-pooler closing the socket), the autopilot-managed child worker (`gbrain jobs work`) never re-establishes the connection in-process. Its background interval loops (`promoteDelayed`, stall-detection) and the failure-recording path (`failJob`) just throw `No database connection` repeatedly, the original job error gets masked, and the worker exits `code=1` and respawns. On a brain that sees periodic pooler closes this becomes a crash-loop.

This is **not** the RSS watchdog — no `[watchdog rss=…]` events occur. The initiating failure is a socket/pooler close.

## Observed (sanitized)

Continuous `gbrain autopilot` daemon, child worker `gbrain jobs work --max-rss 2048`, Supabase Supavisor **session pooler** (`<project>.pooler.supabase.com:5432`).

```
[extract.links_fs] connection blip, retrying 7 rows in 500ms (No database connection: connect() has not been called…)
Promotion error: write CONNECTION_CLOSED <project>.pooler.supabase.com:5432
Stall detection error: write CONNECTION_CLOSED <project>.pooler.supabase.com:5432
  batch error (7 link rows lost): No database connection: connect() has not been called…
```

Over the daemon's lifetime: `No database connection` ×6,350, `Promotion error` ×20,064 (background interval loop spamming the dead pool), `worker exited code=1` ×547, `crashCount` reaching 5/5. The errors spray across *every* cycle phase (lint, sync, extract, embed…), confirming a global pool-state failure, not a single bad phase.

## Root cause

1. The DB connection is a single mutable module-level pool singleton: `let sql` (`src/core/db.ts:7`). `getConnection()` throws `No database connection` when it's null (`db.ts:151`).
2. A raw `CONNECTION_CLOSED` does **not** null the singleton — the object stays non-null but its underlying socket is dead. And `connect()` is a **blind no-op when `sql` is already set** (`db.ts:162`), so nothing repairs a dead-but-non-null pool.
3. The autopilot child is plain `gbrain jobs work` — **not** `MinionSupervisor` — so it lacks the reconnect health logic in `supervisor.ts:589`. Its background paths only log:
   - `promoteDelayed()` — `src/core/minions/worker.ts:432`
   - stall-detection timer — `src/core/minions/worker.ts:242`
   - `failJob()` → `engine.transaction()` (`src/core/minions/queue.ts:855`) → `this._sql || db.getConnection()` (`src/core/postgres-engine.ts:764`)

   None of these call `connect()`/`reconnect()`. `failJob` recording a failure on the dead pool throws again and **masks the original job error**, taking the worker down.
4. Retry coverage is fragmented: `retry-matcher.ts` (`/connection.*closed/i`, would match `CONNECTION_CLOSED`) is not called by these paths; and `connectWithRetry()` uses a separate older list (`db.ts:255`) that lacks `connection.*closed`.

A `PostgresEngine.reconnect()` already exists (`postgres-engine.ts:4146`) — it just isn't wired into the child-worker's background loops.

## Proposed minimal fix

1. `worker.ts:432` + `worker.ts:242`: on `isRetryableConnError(e)`, call `engine.reconnect()` and retry once for these idempotent queue-maintenance operations.
2. `worker.ts:831`: wrap `queue.failJob()` so a retryable DB failure reconnects + retries once; if it still fails, log **both** the original job error and the failure-recording error rather than masking the original.
3. `db.ts:162`: `connect()` should not be a blind no-op when the singleton exists but may be dead; `reconnect()` should be the standard repair path for worker daemons.
4. Unify `connectWithRetry()` (`db.ts:255`) with `retry-matcher.ts` so the startup matcher and the runtime matcher agree (the startup list currently misses `CONNECTION_CLOSED`).

Happy to open a PR for (1)+(2) (the highest-leverage pair) if the direction looks right.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

autopilot worker crash-loops on pooler CONNECTION_CLOSED: background queue loops have no in-process reconnect #1720

TL;DR

Observed (sanitized)

Root cause

Proposed minimal fix

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

autopilot worker crash-loops on pooler CONNECTION_CLOSED: background queue loops have no in-process reconnect #1720

Description

TL;DR

Observed (sanitized)

Root cause

Proposed minimal fix

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions