fix(autopilot): use engine.reconnect() so DB health check doesn't wipe saved config#465
fix(autopilot): use engine.reconnect() so DB health check doesn't wipe saved config#465notjbg wants to merge 1 commit into
Conversation
…e saved config
When engine.getConfig('version') throws (transient pool blip, Supavisor
restart, network jitter), the catch block tried to recover with:
await engine.disconnect();
await (engine as any).connect?.();
The connect() call passes NO ARGUMENTS, but PostgresEngine.connect signature
requires a config:
async connect(config: EngineConfig & { poolSize?: number }): Promise<void> {
this._savedConfig = config; // overwrites with undefined
if (config.poolSize) { // throws "undefined is not an object"
So every health-blip recovery attempt:
1. Wipes _savedConfig (corrupting state for future supervisor reconnect)
2. Throws "undefined is not an object (evaluating 'config.poolSize')"
3. Logs [reconnect] ERROR
4. Leaves engine in broken state — next dispatch fails with
"No database connection: connect() has not been called"
5. Cycle marks failed, after 5 in a row autopilot exits with cycle-failure-cap
The pattern dates to v0.10.1 (b7e3005, before reconnect() existed) and was
silent because the catch path only fires on real connection drops. With
v0.22.1's garrytan#406 we now have engine.reconnect() that uses the saved config —
this fix routes the autopilot's recovery through it.
Real-world impact: in my install over 8 days I logged 54 cycle-failure-cap
exits and 2,488 cascading per-page embed errors because every Supavisor
keepalive eviction tripped this code path and left the daemon broken until
launchd respawned it ~10s later. After the fix, transient blips recover
silently in a few hundred ms.
Falls back to engine.disconnect() on engines that don't implement
reconnect() (e.g., PGLite), preserving prior behavior there.
Repro on Postgres:
1. Run gbrain autopilot --repo ~/brain
2. Wait for any transient connection drop (Supavisor pooler restart,
network blip, idle-eviction by the pooler — usually within an hour
on a Supabase free-tier brain).
3. Watch ~/.gbrain/autopilot.log fill with "[reconnect] ERROR: undefined
is not an object (evaluating 'config.poolSize')" every cycle until
cycle-failure-cap hits.
Test plan:
- bun test (no new test added; the fix is a single replacement that
exercises an existing code path. Happy to add a unit test that mocks
engine.getConfig to throw and asserts reconnect() gets called instead
of connect() — let me know if you want it).
dcc3c0a to
94bc019
Compare
|
Friendly bump 🙏 — rebased onto current |
|
Applied locally to v0.41.0.0 — this PR's reconnect fix works as described and keeps the daemon alive past transient blips. Just wanted to flag: this fix alone does NOT close the deeper bug class where mid-cycle code paths null the module-level singleton, causing silent row loss in |
|
Bump with fresh evidence — hit the same bug class today (2026-05-27, gbrain v0.41.18.0), on a DIFFERENT code site than the autopilot main loop this PR fixes. After restoring my daemon (separate auth/cd issue — see #464), the new daemon crashed twice in 90s during The So this PR's |
|
Filed the worker-side follow-up: #1578 — same pattern this PR uses for the autopilot main loop, applied at the worker subprocess startup site ( |
The bug
Autopilot's per-cycle DB health check (autopilot.ts:248-255) tries to recover from a transient connection failure with:
```ts
try {
await engine.getConfig('version');
} catch {
try {
await engine.disconnect();
await (engine as any).connect?.(); // ← no arguments
} catch (e) { logError('reconnect', e); }
}
```
But `PostgresEngine.connect` requires a config:
```ts
async connect(config: EngineConfig & { poolSize?: number }): Promise {
this._savedConfig = config; // overwrites with undefined
if (config.poolSize) { // → "undefined is not an object (evaluating 'config.poolSize')"
```
So every transient health blip:
The buggy pattern dates to v0.10.1 (b7e3005), well before v0.22.1's #406 added `reconnect()`. It was silent because the catch path only fires on real connection drops; on a healthy pool you never see it.
Real-world impact (one operator's data)
In ~8 days on a Supabase free-tier brain:
Every Supavisor pooler keepalive eviction tripped this code and broke the daemon until launchd respawned it.
The fix
Route recovery through `engine.reconnect()` (added in v0.22.1 #406), which uses the saved config instead of dropping it.
```ts
try {
await engine.getConfig('version');
} catch {
try {
if (typeof (engine as any).reconnect === 'function') {
await (engine as any).reconnect();
} else {
await engine.disconnect();
}
} catch (e) { logError('reconnect', e); }
}
```
`reconnect()` already handles the saved-config-missing case (returns early), so this is safe even if `connect()` was never called on this engine. PGLite (which doesn't implement `reconnect`) falls through to a plain `disconnect()`, preserving prior behavior.
Repro
Tests
I didn't add a new test in this PR — the change is a single replacement. Happy to add a unit test that mocks `engine.getConfig` to throw and asserts `reconnect()` (not `connect()`) gets called. Let me know.
Related