fix(autopilot): switch main engine to instance-owned pool so mid-cycle singleton-null doesn't lose rows + break getHealth by archerhpagent · Pull Request #1415 · garrytan/gbrain

archerhpagent · 2026-05-25T07:18:01Z

The bug

Companion to issue #1413.

On Postgres engine, gbrain autopilot silently loses link rows during the extract.links_fs phase, and engine.getHealth() post-cycle throws No database connection: connect() has not been called on every tick.

Root cause: connectEngine() in cli.ts calls engine.connect(config) without a poolSize arg, which routes PostgresEngine.connect() to the module-level singleton branch (postgres-engine.ts:175-189). The engine's own _sql stays null and this.sql falls back to db.getConnection(). Something in the cycle's call graph nulls that singleton mid-tick, after which every later this.sql access throws.

Confirmed reproducible on v0.41.0.0 with gbrain dream --dir <repo> --json. The log shows:

[cycle.extract] start
[extract.links_fs] 3/3 (100%)
  batch error (2 link rows lost): No database connection: connect() has not been called. Fix: Run gbrain init --supabase or gbrain init --url <connection_string>
[extract.links_fs] 3/3 (100%) done

The reported extracted=N count in the human log line looks healthy because it reflects intent, not actual rows persisted. extract.ts:614-628's flush() catches the throw, logs the line, and batch.length = 0 in the finally clears the un-inserted rows.

Same bug class as the source comment at src/commands/auth.ts:52 — auth-command instance was fixed by having withConfiguredSql call engine.connect(); this PR closes the autopilot instance.

The fix

Re-call engine.connect({ ...savedConfig, poolSize: 5 }) at autopilot startup so the engine moves to instance-owned pool (postgres-engine.ts:127-174). With poolSize set, _sql is created and the getter returns it directly; the module singleton is no longer load-bearing for this daemon.

PGLite engines are unaffected — poolSize is ignored on that engine.

The switch is gated on engine.kind === 'postgres' and skipped when _sql is already non-null (a future caller could wire the engine with poolSize upstream of autopilot; the guard avoids a wasted reconnect). The connect call is wrapped in try/catch so a transient pool-init blip logs and continues rather than killing the supervisor.

+  if (engine.kind === 'postgres') {
+    const savedCfg = (engine as unknown as { _savedConfig?: EngineConfig })._savedConfig;
+    const hasInstancePool = (engine as unknown as { _sql?: unknown })._sql != null;
+    if (savedCfg && !hasInstancePool) {
+      try {
+        await engine.connect({ ...savedCfg, poolSize: 5 } as EngineConfig & { poolSize: number });
+      } catch (e) {
+        console.error(`[autopilot] could not switch engine to instance pool: ${(e as Error).message}`);
+      }
+    }
+  }

Private-field access goes through a narrow structural cast rather than as any so the rest of the block keeps its type coverage. If you'd prefer a public method on BrainEngine (e.g. ensureInstancePool(size: number)), happy to split that into a follow-up — the broader interface change felt out of scope for the first fix.

Tests

New test/autopilot-instance-pool.test.ts follows the static-shape style of test/autopilot-supervisor-wiring.test.ts (7 cases): kind gate, _savedConfig read shape, _sql != null guard, poolSize: 5 cast shape, try/catch wrap, source-ordering before mode/spawn, import of EngineConfig. Full behavioral test would need a Postgres fixture; the static guards catch refactor regressions cheaply.

All 94 existing autopilot tests still pass.

Verified locally

Applied to v0.41.0.0 against a Postgres-engine brain. Every cycle phase runs end-to-end (16 phases including extract_facts / consolidate / propose_takes / grade_takes / calibration_profile / orphans / schema_suggest / purge). No batch error lines. No [health] ERROR lines. engine.getHealth() returns a real brain_score. extract.links_fs writes real rows.

Relationship to PR #465

PR #465 (open, ~1 month no review) fixes the (engine as any).connect?.() no-config-arg bug in the autopilot reconnect path. It's a real fix and lands when this lands. But #465 alone leaves THIS bug fully present — the daemon stays alive (good — that's what #465 ships) but every cycle still loses rows and getHealth still throws.

Independent fixes; can land in either order.

The autopilot daemon's main engine connects via gbrain's module-level singleton path because `connectEngine()` calls `engine.connect(config)` without `poolSize`. The engine's own `_sql` stays null and `this.sql` falls back to `db.getConnection()`. Any mid-cycle code path that nulls the module-level singleton makes every later engine call through that getter throw "connect() has not been called". Two production symptoms: 1. Silent row loss during `extract.links_fs`. `addLinksBatch` throws the error; `extract.ts:614-628` catches it, logs `batch error (N link rows lost): ...`, and the `finally` block clears the un-inserted rows so the reported `extracted=N` count looks healthy while N rows actually drop on the floor. 2. `engine.getHealth()` post-cycle throws the same error. `logError('health', e)` catches it so the daemon survives, but the adaptive-interval logic that depends on the health score never runs, and the log fills with `[health] ERROR` once per cycle indefinitely. Re-calling `engine.connect({ ...savedConfig, poolSize: 5 })` at autopilot startup forces `PostgresEngine` onto the instance-owned pool path (`postgres-engine.ts:127-174`). The engine's `_sql` is set, the getter returns it directly, and the module singleton is no longer load-bearing. PGLite engines are unaffected — `poolSize` is ignored on that engine. Companion to the upstream issue describing this bug class. PR garrytan#465 (open, no review) fixes a related-but-different bug in the reconnect path; this PR closes a different symptom of the same singleton-fallback antipattern. The test pins the load-bearing pieces (kind gate, _savedConfig read, _sql guard, poolSize:5 cast shape, try/catch wrap, ordering before mode/spawn) in the same static-shape style as test/autopilot-supervisor-wiring.test.ts. A full behavioral test needs a Postgres fixture; the static-shape guards catch refactor regressions cheaply. Verified locally on v0.41.0.0 against a Postgres-engine brain: every cycle phase runs end-to-end, no `batch error` lines, no `[health] ERROR` lines, `engine.getHealth()` returns a real brain_score, `extract.links_fs` writes real rows instead of losing them. Files: src/commands/autopilot.ts, test/autopilot-instance-pool.test.ts

archerhpagent · 2026-05-26T02:02:55Z

Closing this — upstream v0.41.x autopilot.ts already eliminates both (engine as any).connect?.() and the module-singleton-fallback issue this PR targeted. Upgraded my archer install from v0.41.0.0 → v0.41.11.1 today and confirmed: no batch-error rows, no [health] ERROR lines, cycles run cleanly. The narrow autopilot-only fix this PR proposed is no longer needed. Thanks for the cleanup wave.

archerhpagent · 2026-05-26T02:02:58Z

Superseded by upstream v0.41.x autopilot.ts rewrites. Confirmed end-to-end clean cycles on v0.41.11.1.

This was referenced May 25, 2026

autopilot: silent link-row loss + getHealth() breakage from module-singleton null mid-cycle (separate class from #1162 / #465) #1413

Open

fix(autopilot): use engine.reconnect() so DB health check doesn't wipe saved config #465

Open

zaycruz mentioned this pull request May 25, 2026

fix(postgres): prevent transient module engines from disconnecting a shared singleton #1425

Closed

archerhpagent closed this May 26, 2026

seaneoliver mentioned this pull request May 26, 2026

gbrain dream — all DB-dependent phases fail with "connect() has not been called" #1404

Open

notjbg mentioned this pull request May 27, 2026

fix(jobs work): switch worker engine to instance-owned pool (Issue #1413 worker-side) #1578

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(autopilot): switch main engine to instance-owned pool so mid-cycle singleton-null doesn't lose rows + break getHealth#1415

fix(autopilot): switch main engine to instance-owned pool so mid-cycle singleton-null doesn't lose rows + break getHealth#1415
archerhpagent wants to merge 1 commit into
garrytan:masterfrom
archerhpagent:fix/autopilot-instance-pool

archerhpagent commented May 25, 2026

Uh oh!

archerhpagent commented May 26, 2026

Uh oh!

archerhpagent commented May 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

archerhpagent commented May 25, 2026

The bug

The fix

Tests

Verified locally

Relationship to PR #465

Uh oh!

archerhpagent commented May 26, 2026

Uh oh!

archerhpagent commented May 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants