Skip to content

fix(autopilot): use engine.reconnect() so DB health check doesn't wipe saved config#465

Open
notjbg wants to merge 1 commit into
garrytan:masterfrom
notjbg:fix/autopilot-reconnect-config-undefined
Open

fix(autopilot): use engine.reconnect() so DB health check doesn't wipe saved config#465
notjbg wants to merge 1 commit into
garrytan:masterfrom
notjbg:fix/autopilot-reconnect-config-undefined

Conversation

@notjbg

@notjbg notjbg commented Apr 27, 2026

Copy link
Copy Markdown
Contributor

The bug

Autopilot's per-cycle DB health check (autopilot.ts:248-255) tries to recover from a transient connection failure with:

```ts
try {
await engine.getConfig('version');
} catch {
try {
await engine.disconnect();
await (engine as any).connect?.(); // ← no arguments
} catch (e) { logError('reconnect', e); }
}
```

But `PostgresEngine.connect` requires a config:

```ts
async connect(config: EngineConfig & { poolSize?: number }): Promise {
this._savedConfig = config; // overwrites with undefined
if (config.poolSize) { // → "undefined is not an object (evaluating 'config.poolSize')"
```

So every transient health blip:

  1. Wipes `_savedConfig` (corrupting state for future supervisor reconnects too).
  2. Throws `undefined is not an object (evaluating 'config.poolSize')`.
  3. Logs `[reconnect] ERROR ...`.
  4. Leaves the engine without a connection. The next dispatch fails with `No database connection: connect() has not been called`.
  5. Cycle marks failed. After 5 in a row: `Autopilot stopping (cycle-failure-cap)`. launchd respawns ~10 s later. Repeat.

The buggy pattern dates to v0.10.1 (b7e3005), well before v0.22.1's #406 added `reconnect()`. It was silent because the catch path only fires on real connection drops; on a healthy pool you never see it.

Real-world impact (one operator's data)

In ~8 days on a Supabase free-tier brain:

  • 54 `Autopilot stopping (cycle-failure-cap)` exits.
  • ~85 cascading `Errors.postgres` parse errors.
  • ~70 `MaxClientsInSessionMode` pool-saturation errors as orphaned backends piled up between launchd respawns.
  • ~2,488 cascading per-page embed failures (separate root cause, but they only kept happening because autopilot kept dying mid-cycle).

Every Supavisor pooler keepalive eviction tripped this code and broke the daemon until launchd respawned it.

The fix

Route recovery through `engine.reconnect()` (added in v0.22.1 #406), which uses the saved config instead of dropping it.

```ts
try {
await engine.getConfig('version');
} catch {
try {
if (typeof (engine as any).reconnect === 'function') {
await (engine as any).reconnect();
} else {
await engine.disconnect();
}
} catch (e) { logError('reconnect', e); }
}
```

`reconnect()` already handles the saved-config-missing case (returns early), so this is safe even if `connect()` was never called on this engine. PGLite (which doesn't implement `reconnect`) falls through to a plain `disconnect()`, preserving prior behavior.

Repro

  1. `gbrain autopilot --repo ~/brain` against a Supabase Postgres brain.
  2. Wait for any transient connection drop — Supavisor pooler restart, network blip, idle-eviction. On a free-tier brain this is usually within an hour.
  3. `tail -f ~/.gbrain/autopilot.log` — you'll see `[reconnect] ERROR: undefined is not an object (evaluating 'config.poolSize')` followed by repeated `[dispatch] ERROR: No database connection` every cycle until `Autopilot stopping (cycle-failure-cap)` fires.

Tests

I didn't add a new test in this PR — the change is a single replacement. Happy to add a unit test that mocks `engine.getConfig` to throw and asserts `reconnect()` (not `connect()`) gets called. Let me know.

Related

…e saved config

When engine.getConfig('version') throws (transient pool blip, Supavisor
restart, network jitter), the catch block tried to recover with:

    await engine.disconnect();
    await (engine as any).connect?.();

The connect() call passes NO ARGUMENTS, but PostgresEngine.connect signature
requires a config:

    async connect(config: EngineConfig & { poolSize?: number }): Promise<void> {
      this._savedConfig = config;          // overwrites with undefined
      if (config.poolSize) {                // throws "undefined is not an object"

So every health-blip recovery attempt:
1. Wipes _savedConfig (corrupting state for future supervisor reconnect)
2. Throws "undefined is not an object (evaluating 'config.poolSize')"
3. Logs [reconnect] ERROR
4. Leaves engine in broken state — next dispatch fails with
   "No database connection: connect() has not been called"
5. Cycle marks failed, after 5 in a row autopilot exits with cycle-failure-cap

The pattern dates to v0.10.1 (b7e3005, before reconnect() existed) and was
silent because the catch path only fires on real connection drops. With
v0.22.1's garrytan#406 we now have engine.reconnect() that uses the saved config —
this fix routes the autopilot's recovery through it.

Real-world impact: in my install over 8 days I logged 54 cycle-failure-cap
exits and 2,488 cascading per-page embed errors because every Supavisor
keepalive eviction tripped this code path and left the daemon broken until
launchd respawned it ~10s later. After the fix, transient blips recover
silently in a few hundred ms.

Falls back to engine.disconnect() on engines that don't implement
reconnect() (e.g., PGLite), preserving prior behavior there.

Repro on Postgres:
1. Run gbrain autopilot --repo ~/brain
2. Wait for any transient connection drop (Supavisor pooler restart,
   network blip, idle-eviction by the pooler — usually within an hour
   on a Supabase free-tier brain).
3. Watch ~/.gbrain/autopilot.log fill with "[reconnect] ERROR: undefined
   is not an object (evaluating 'config.poolSize')" every cycle until
   cycle-failure-cap hits.

Test plan:
- bun test (no new test added; the fix is a single replacement that
  exercises an existing code path. Happy to add a unit test that mocks
  engine.getConfig to throw and asserts reconnect() gets called instead
  of connect() — let me know if you want it).
@notjbg notjbg force-pushed the fix/autopilot-reconnect-config-undefined branch from dcc3c0a to 94bc019 Compare April 30, 2026 03:25
@notjbg

notjbg commented Apr 30, 2026

Copy link
Copy Markdown
Contributor Author

Friendly bump 🙏 — rebased onto current master (08746b0), still a clean one-commit diff. This one's been quietly causing autopilot to die under launchd on every Supavisor pooler eviction; would love a review whenever you have a moment. Happy to add a unit test mocking engine.getConfig to throw and asserting reconnect() (not connect()) gets called if that'd help.

@archerhpagent

Copy link
Copy Markdown

Applied locally to v0.41.0.0 — this PR's reconnect fix works as described and keeps the daemon alive past transient blips. Just wanted to flag: this fix alone does NOT close the deeper bug class where mid-cycle code paths null the module-level singleton, causing silent row loss in extract.links_fs and broken getHealth() post-cycle. Filed companion issue #1413 + PR #1415 covering that separate class. Independent fixes — can land in either order.

@notjbg

notjbg commented May 27, 2026

Copy link
Copy Markdown
Contributor Author

Bump with fresh evidence — hit the same bug class today (2026-05-27, gbrain v0.41.18.0), on a DIFFERENT code site than the autopilot main loop this PR fixes.

After restoring my daemon (separate auth/cd issue — see #464), the new daemon crashed twice in 90s during extract.links_fs. Stack:

[unhandledRejection] GBrainError: No database connection: connect() has not been called.
    at getConnection (src/core/db.ts:153:15)
    at executeRaw (src/core/postgres-engine.ts:4349:18)
    at renewLock (src/core/minions/queue.ts:1024:36)
    at <anonymous> (src/core/minions/worker.ts:619:40)

The worker.ts:619 setInterval lock-keeper calls queue.ts:1024:renewLockengine.executeRawdb.ts:getConnection and hits the same null-singleton-throws pattern. The v0.41.11.x autopilot.ts rewrite that @archerhpagent referenced as superseding #1415 did NOT touch this worker subprocess path — the queue worker has its own engine instance and uses the same db.ts module-level singleton.

So this PR's engine.reconnect() swap is still necessary, plus a follow-up needed for the worker-side renewLock/executeRaw paths (or have getConnection attempt one-shot reconnect from _savedConfig before throwing). Happy to open a follow-up PR for the worker fix; flagging here since this PR is the closest upstream anchor for the bug class.

CI's green. Would love to land #464 + #465 together.

@notjbg

notjbg commented May 27, 2026

Copy link
Copy Markdown
Contributor Author

Filed the worker-side follow-up: #1578 — same pattern this PR uses for the autopilot main loop, applied at the worker subprocess startup site (case 'work': in jobs.ts, where the queue worker is spawned as bun gbrain jobs work). Reproduces and fixes the trace I posted upthread (worker.ts:619:renewLock). Independent of this PR — both can land in either order.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants