fix(autopilot): use engine.reconnect() so DB health check doesn't wipe saved config by notjbg · Pull Request #465 · garrytan/gbrain

notjbg · 2026-04-27T04:16:55Z

The bug

Autopilot's per-cycle DB health check (autopilot.ts:248-255) tries to recover from a transient connection failure with:

```ts
try {
await engine.getConfig('version');
} catch {
try {
await engine.disconnect();
await (engine as any).connect?.(); // ← no arguments
} catch (e) { logError('reconnect', e); }
}
```

But `PostgresEngine.connect` requires a config:

```ts
async connect(config: EngineConfig & { poolSize?: number }): Promise {
this._savedConfig = config; // overwrites with undefined
if (config.poolSize) { // → "undefined is not an object (evaluating 'config.poolSize')"
```

So every transient health blip:

Wipes `_savedConfig` (corrupting state for future supervisor reconnects too).
Throws `undefined is not an object (evaluating 'config.poolSize')`.
Logs `[reconnect] ERROR ...`.
Leaves the engine without a connection. The next dispatch fails with `No database connection: connect() has not been called`.
Cycle marks failed. After 5 in a row: `Autopilot stopping (cycle-failure-cap)`. launchd respawns ~10 s later. Repeat.

The buggy pattern dates to v0.10.1 (b7e3005), well before v0.22.1's #406 added `reconnect()`. It was silent because the catch path only fires on real connection drops; on a healthy pool you never see it.

Real-world impact (one operator's data)

In ~8 days on a Supabase free-tier brain:

54 `Autopilot stopping (cycle-failure-cap)` exits.
~85 cascading `Errors.postgres` parse errors.
~70 `MaxClientsInSessionMode` pool-saturation errors as orphaned backends piled up between launchd respawns.
~2,488 cascading per-page embed failures (separate root cause, but they only kept happening because autopilot kept dying mid-cycle).

Every Supavisor pooler keepalive eviction tripped this code and broke the daemon until launchd respawned it.

The fix

Route recovery through `engine.reconnect()` (added in v0.22.1 #406), which uses the saved config instead of dropping it.

```ts
try {
await engine.getConfig('version');
} catch {
try {
if (typeof (engine as any).reconnect === 'function') {
await (engine as any).reconnect();
} else {
await engine.disconnect();
}
} catch (e) { logError('reconnect', e); }
}
```

`reconnect()` already handles the saved-config-missing case (returns early), so this is safe even if `connect()` was never called on this engine. PGLite (which doesn't implement `reconnect`) falls through to a plain `disconnect()`, preserving prior behavior.

Repro

`gbrain autopilot --repo ~/brain` against a Supabase Postgres brain.
Wait for any transient connection drop — Supavisor pooler restart, network blip, idle-eviction. On a free-tier brain this is usually within an hour.
`tail -f ~/.gbrain/autopilot.log` — you'll see `[reconnect] ERROR: undefined is not an object (evaluating 'config.poolSize')` followed by repeated `[dispatch] ERROR: No database connection` every cycle until `Autopilot stopping (cycle-failure-cap)` fires.

Tests

I didn't add a new test in this PR — the change is a single replacement. Happy to add a unit test that mocks `engine.getConfig` to throw and asserts `reconnect()` (not `connect()`) gets called. Let me know.

The companion PR fix(autopilot): cd into --repo so Bun auto-loads brain/.env #464 fixes a different daemon bug in the same area (`autopilot-run.sh` doesn't `cd` into the brain, so Bun misses `brain/.env`). Together the two fixes ended an 8-day silent-failure mode for me. Independent fixes, can land in either order.

…e saved config When engine.getConfig('version') throws (transient pool blip, Supavisor restart, network jitter), the catch block tried to recover with: await engine.disconnect(); await (engine as any).connect?.(); The connect() call passes NO ARGUMENTS, but PostgresEngine.connect signature requires a config: async connect(config: EngineConfig & { poolSize?: number }): Promise<void> { this._savedConfig = config; // overwrites with undefined if (config.poolSize) { // throws "undefined is not an object" So every health-blip recovery attempt: 1. Wipes _savedConfig (corrupting state for future supervisor reconnect) 2. Throws "undefined is not an object (evaluating 'config.poolSize')" 3. Logs [reconnect] ERROR 4. Leaves engine in broken state — next dispatch fails with "No database connection: connect() has not been called" 5. Cycle marks failed, after 5 in a row autopilot exits with cycle-failure-cap The pattern dates to v0.10.1 (b7e3005, before reconnect() existed) and was silent because the catch path only fires on real connection drops. With v0.22.1's garrytan#406 we now have engine.reconnect() that uses the saved config — this fix routes the autopilot's recovery through it. Real-world impact: in my install over 8 days I logged 54 cycle-failure-cap exits and 2,488 cascading per-page embed errors because every Supavisor keepalive eviction tripped this code path and left the daemon broken until launchd respawned it ~10s later. After the fix, transient blips recover silently in a few hundred ms. Falls back to engine.disconnect() on engines that don't implement reconnect() (e.g., PGLite), preserving prior behavior there. Repro on Postgres: 1. Run gbrain autopilot --repo ~/brain 2. Wait for any transient connection drop (Supavisor pooler restart, network blip, idle-eviction by the pooler — usually within an hour on a Supabase free-tier brain). 3. Watch ~/.gbrain/autopilot.log fill with "[reconnect] ERROR: undefined is not an object (evaluating 'config.poolSize')" every cycle until cycle-failure-cap hits. Test plan: - bun test (no new test added; the fix is a single replacement that exercises an existing code path. Happy to add a unit test that mocks engine.getConfig to throw and asserts reconnect() gets called instead of connect() — let me know if you want it).

notjbg · 2026-04-30T03:25:44Z

Friendly bump 🙏 — rebased onto current master (08746b0), still a clean one-commit diff. This one's been quietly causing autopilot to die under launchd on every Supavisor pooler eviction; would love a review whenever you have a moment. Happy to add a unit test mocking engine.getConfig to throw and asserting reconnect() (not connect()) gets called if that'd help.

archerhpagent · 2026-05-25T07:18:25Z

Applied locally to v0.41.0.0 — this PR's reconnect fix works as described and keeps the daemon alive past transient blips. Just wanted to flag: this fix alone does NOT close the deeper bug class where mid-cycle code paths null the module-level singleton, causing silent row loss in extract.links_fs and broken getHealth() post-cycle. Filed companion issue #1413 + PR #1415 covering that separate class. Independent fixes — can land in either order.

notjbg · 2026-05-27T18:49:50Z

Bump with fresh evidence — hit the same bug class today (2026-05-27, gbrain v0.41.18.0), on a DIFFERENT code site than the autopilot main loop this PR fixes.

After restoring my daemon (separate auth/cd issue — see #464), the new daemon crashed twice in 90s during extract.links_fs. Stack:

[unhandledRejection] GBrainError: No database connection: connect() has not been called.
    at getConnection (src/core/db.ts:153:15)
    at executeRaw (src/core/postgres-engine.ts:4349:18)
    at renewLock (src/core/minions/queue.ts:1024:36)
    at <anonymous> (src/core/minions/worker.ts:619:40)

The worker.ts:619 setInterval lock-keeper calls queue.ts:1024:renewLock → engine.executeRaw → db.ts:getConnection and hits the same null-singleton-throws pattern. The v0.41.11.x autopilot.ts rewrite that @archerhpagent referenced as superseding #1415 did NOT touch this worker subprocess path — the queue worker has its own engine instance and uses the same db.ts module-level singleton.

So this PR's engine.reconnect() swap is still necessary, plus a follow-up needed for the worker-side renewLock/executeRaw paths (or have getConnection attempt one-shot reconnect from _savedConfig before throwing). Happy to open a follow-up PR for the worker fix; flagging here since this PR is the closest upstream anchor for the bug class.

CI's green. Would love to land #464 + #465 together.

notjbg · 2026-05-27T21:03:30Z

Filed the worker-side follow-up: #1578 — same pattern this PR uses for the autopilot main loop, applied at the worker subprocess startup site (case 'work': in jobs.ts, where the queue worker is spawned as bun gbrain jobs work). Reproduces and fixes the trace I posted upthread (worker.ts:619:renewLock). Independent of this PR — both can land in either order.

notjbg mentioned this pull request Apr 27, 2026

feat(doctor): surface daemon silent-failure modes — per-cycle failure rate, cycle-failure-cap count, autopilot.err fingerprints #466

Closed

notjbg force-pushed the fix/autopilot-reconnect-config-undefined branch from dcc3c0a to 94bc019 Compare April 30, 2026 03:25

notjbg mentioned this pull request Apr 30, 2026

gbrain dream sync phase crashes with socket.write null on Supabase pooler at 94% — PR #211 fixes this, please merge #533

Closed

maxpetrusenkoagent mentioned this pull request Jun 9, 2026

fix(minions): reconnect worker after promote connection loss #2025

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(autopilot): use engine.reconnect() so DB health check doesn't wipe saved config#465

fix(autopilot): use engine.reconnect() so DB health check doesn't wipe saved config#465
notjbg wants to merge 1 commit into
garrytan:masterfrom
notjbg:fix/autopilot-reconnect-config-undefined

notjbg commented Apr 27, 2026

Uh oh!

notjbg commented Apr 30, 2026

Uh oh!

archerhpagent commented May 25, 2026

Uh oh!

notjbg commented May 27, 2026

Uh oh!

notjbg commented May 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

notjbg commented Apr 27, 2026

The bug

Real-world impact (one operator's data)

The fix

Repro

Tests

Related

Uh oh!

notjbg commented Apr 30, 2026

Uh oh!

archerhpagent commented May 25, 2026

Uh oh!

notjbg commented May 27, 2026

Uh oh!

notjbg commented May 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants