Skip to content

fix(autopilot): switch main engine to instance-owned pool so mid-cycle singleton-null doesn't lose rows + break getHealth#1415

Closed
archerhpagent wants to merge 1 commit into
garrytan:masterfrom
archerhpagent:fix/autopilot-instance-pool
Closed

fix(autopilot): switch main engine to instance-owned pool so mid-cycle singleton-null doesn't lose rows + break getHealth#1415
archerhpagent wants to merge 1 commit into
garrytan:masterfrom
archerhpagent:fix/autopilot-instance-pool

Conversation

@archerhpagent

Copy link
Copy Markdown

The bug

Companion to issue #1413.

On Postgres engine, gbrain autopilot silently loses link rows during the extract.links_fs phase, and engine.getHealth() post-cycle throws No database connection: connect() has not been called on every tick.

Root cause: connectEngine() in cli.ts calls engine.connect(config) without a poolSize arg, which routes PostgresEngine.connect() to the module-level singleton branch (postgres-engine.ts:175-189). The engine's own _sql stays null and this.sql falls back to db.getConnection(). Something in the cycle's call graph nulls that singleton mid-tick, after which every later this.sql access throws.

Confirmed reproducible on v0.41.0.0 with gbrain dream --dir <repo> --json. The log shows:

[cycle.extract] start
[extract.links_fs] 3/3 (100%)
  batch error (2 link rows lost): No database connection: connect() has not been called. Fix: Run gbrain init --supabase or gbrain init --url <connection_string>
[extract.links_fs] 3/3 (100%) done

The reported extracted=N count in the human log line looks healthy because it reflects intent, not actual rows persisted. extract.ts:614-628's flush() catches the throw, logs the line, and batch.length = 0 in the finally clears the un-inserted rows.

Same bug class as the source comment at src/commands/auth.ts:52 — auth-command instance was fixed by having withConfiguredSql call engine.connect(); this PR closes the autopilot instance.

The fix

Re-call engine.connect({ ...savedConfig, poolSize: 5 }) at autopilot startup so the engine moves to instance-owned pool (postgres-engine.ts:127-174). With poolSize set, _sql is created and the getter returns it directly; the module singleton is no longer load-bearing for this daemon.

PGLite engines are unaffected — poolSize is ignored on that engine.

The switch is gated on engine.kind === 'postgres' and skipped when _sql is already non-null (a future caller could wire the engine with poolSize upstream of autopilot; the guard avoids a wasted reconnect). The connect call is wrapped in try/catch so a transient pool-init blip logs and continues rather than killing the supervisor.

+  if (engine.kind === 'postgres') {
+    const savedCfg = (engine as unknown as { _savedConfig?: EngineConfig })._savedConfig;
+    const hasInstancePool = (engine as unknown as { _sql?: unknown })._sql != null;
+    if (savedCfg && !hasInstancePool) {
+      try {
+        await engine.connect({ ...savedCfg, poolSize: 5 } as EngineConfig & { poolSize: number });
+      } catch (e) {
+        console.error(`[autopilot] could not switch engine to instance pool: ${(e as Error).message}`);
+      }
+    }
+  }

Private-field access goes through a narrow structural cast rather than as any so the rest of the block keeps its type coverage. If you'd prefer a public method on BrainEngine (e.g. ensureInstancePool(size: number)), happy to split that into a follow-up — the broader interface change felt out of scope for the first fix.

Tests

New test/autopilot-instance-pool.test.ts follows the static-shape style of test/autopilot-supervisor-wiring.test.ts (7 cases): kind gate, _savedConfig read shape, _sql != null guard, poolSize: 5 cast shape, try/catch wrap, source-ordering before mode/spawn, import of EngineConfig. Full behavioral test would need a Postgres fixture; the static guards catch refactor regressions cheaply.

All 94 existing autopilot tests still pass.

Verified locally

Applied to v0.41.0.0 against a Postgres-engine brain. Every cycle phase runs end-to-end (16 phases including extract_facts / consolidate / propose_takes / grade_takes / calibration_profile / orphans / schema_suggest / purge). No batch error lines. No [health] ERROR lines. engine.getHealth() returns a real brain_score. extract.links_fs writes real rows.

Relationship to PR #465

PR #465 (open, ~1 month no review) fixes the (engine as any).connect?.() no-config-arg bug in the autopilot reconnect path. It's a real fix and lands when this lands. But #465 alone leaves THIS bug fully present — the daemon stays alive (good — that's what #465 ships) but every cycle still loses rows and getHealth still throws.

Independent fixes; can land in either order.

The autopilot daemon's main engine connects via gbrain's module-level
singleton path because `connectEngine()` calls `engine.connect(config)`
without `poolSize`. The engine's own `_sql` stays null and `this.sql`
falls back to `db.getConnection()`.

Any mid-cycle code path that nulls the module-level singleton makes every
later engine call through that getter throw "connect() has not been
called". Two production symptoms:

1. Silent row loss during `extract.links_fs`. `addLinksBatch` throws the
   error; `extract.ts:614-628` catches it, logs `batch error (N link rows
   lost): ...`, and the `finally` block clears the un-inserted rows so the
   reported `extracted=N` count looks healthy while N rows actually drop
   on the floor.

2. `engine.getHealth()` post-cycle throws the same error. `logError('health',
   e)` catches it so the daemon survives, but the adaptive-interval logic
   that depends on the health score never runs, and the log fills with
   `[health] ERROR` once per cycle indefinitely.

Re-calling `engine.connect({ ...savedConfig, poolSize: 5 })` at autopilot
startup forces `PostgresEngine` onto the instance-owned pool path
(`postgres-engine.ts:127-174`). The engine's `_sql` is set, the getter
returns it directly, and the module singleton is no longer load-bearing.
PGLite engines are unaffected — `poolSize` is ignored on that engine.

Companion to the upstream issue describing this bug class. PR garrytan#465 (open,
no review) fixes a related-but-different bug in the reconnect path; this
PR closes a different symptom of the same singleton-fallback antipattern.

The test pins the load-bearing pieces (kind gate, _savedConfig read, _sql
guard, poolSize:5 cast shape, try/catch wrap, ordering before mode/spawn)
in the same static-shape style as test/autopilot-supervisor-wiring.test.ts.
A full behavioral test needs a Postgres fixture; the static-shape guards
catch refactor regressions cheaply.

Verified locally on v0.41.0.0 against a Postgres-engine brain: every cycle
phase runs end-to-end, no `batch error` lines, no `[health] ERROR` lines,
`engine.getHealth()` returns a real brain_score, `extract.links_fs` writes
real rows instead of losing them.

Files: src/commands/autopilot.ts, test/autopilot-instance-pool.test.ts
@archerhpagent

Copy link
Copy Markdown
Author

Closing this — upstream v0.41.x autopilot.ts already eliminates both (engine as any).connect?.() and the module-singleton-fallback issue this PR targeted. Upgraded my archer install from v0.41.0.0 → v0.41.11.1 today and confirmed: no batch-error rows, no [health] ERROR lines, cycles run cleanly. The narrow autopilot-only fix this PR proposed is no longer needed. Thanks for the cleanup wave.

@archerhpagent

Copy link
Copy Markdown
Author

Superseded by upstream v0.41.x autopilot.ts rewrites. Confirmed end-to-end clean cycles on v0.41.11.1.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants