fix(autopilot): switch main engine to instance-owned pool so mid-cycle singleton-null doesn't lose rows + break getHealth#1415
Closed
archerhpagent wants to merge 1 commit into
Conversation
The autopilot daemon's main engine connects via gbrain's module-level
singleton path because `connectEngine()` calls `engine.connect(config)`
without `poolSize`. The engine's own `_sql` stays null and `this.sql`
falls back to `db.getConnection()`.
Any mid-cycle code path that nulls the module-level singleton makes every
later engine call through that getter throw "connect() has not been
called". Two production symptoms:
1. Silent row loss during `extract.links_fs`. `addLinksBatch` throws the
error; `extract.ts:614-628` catches it, logs `batch error (N link rows
lost): ...`, and the `finally` block clears the un-inserted rows so the
reported `extracted=N` count looks healthy while N rows actually drop
on the floor.
2. `engine.getHealth()` post-cycle throws the same error. `logError('health',
e)` catches it so the daemon survives, but the adaptive-interval logic
that depends on the health score never runs, and the log fills with
`[health] ERROR` once per cycle indefinitely.
Re-calling `engine.connect({ ...savedConfig, poolSize: 5 })` at autopilot
startup forces `PostgresEngine` onto the instance-owned pool path
(`postgres-engine.ts:127-174`). The engine's `_sql` is set, the getter
returns it directly, and the module singleton is no longer load-bearing.
PGLite engines are unaffected — `poolSize` is ignored on that engine.
Companion to the upstream issue describing this bug class. PR garrytan#465 (open,
no review) fixes a related-but-different bug in the reconnect path; this
PR closes a different symptom of the same singleton-fallback antipattern.
The test pins the load-bearing pieces (kind gate, _savedConfig read, _sql
guard, poolSize:5 cast shape, try/catch wrap, ordering before mode/spawn)
in the same static-shape style as test/autopilot-supervisor-wiring.test.ts.
A full behavioral test needs a Postgres fixture; the static-shape guards
catch refactor regressions cheaply.
Verified locally on v0.41.0.0 against a Postgres-engine brain: every cycle
phase runs end-to-end, no `batch error` lines, no `[health] ERROR` lines,
`engine.getHealth()` returns a real brain_score, `extract.links_fs` writes
real rows instead of losing them.
Files: src/commands/autopilot.ts, test/autopilot-instance-pool.test.ts
Author
|
Closing this — upstream v0.41.x autopilot.ts already eliminates both |
Author
|
Superseded by upstream v0.41.x autopilot.ts rewrites. Confirmed end-to-end clean cycles on v0.41.11.1. |
3 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
The bug
Companion to issue #1413.
On Postgres engine,
gbrain autopilotsilently loses link rows during theextract.links_fsphase, andengine.getHealth()post-cycle throwsNo database connection: connect() has not been calledon every tick.Root cause:
connectEngine()in cli.ts callsengine.connect(config)without apoolSizearg, which routesPostgresEngine.connect()to the module-level singleton branch (postgres-engine.ts:175-189). The engine's own_sqlstays null andthis.sqlfalls back todb.getConnection(). Something in the cycle's call graph nulls that singleton mid-tick, after which every laterthis.sqlaccess throws.Confirmed reproducible on v0.41.0.0 with
gbrain dream --dir <repo> --json. The log shows:The reported
extracted=Ncount in the human log line looks healthy because it reflects intent, not actual rows persisted.extract.ts:614-628'sflush()catches the throw, logs the line, andbatch.length = 0in thefinallyclears the un-inserted rows.Same bug class as the source comment at
src/commands/auth.ts:52— auth-command instance was fixed by havingwithConfiguredSqlcallengine.connect(); this PR closes the autopilot instance.The fix
Re-call
engine.connect({ ...savedConfig, poolSize: 5 })at autopilot startup so the engine moves to instance-owned pool (postgres-engine.ts:127-174). WithpoolSizeset,_sqlis created and the getter returns it directly; the module singleton is no longer load-bearing for this daemon.PGLite engines are unaffected —
poolSizeis ignored on that engine.The switch is gated on
engine.kind === 'postgres'and skipped when_sqlis already non-null (a future caller could wire the engine withpoolSizeupstream of autopilot; the guard avoids a wasted reconnect). The connect call is wrapped in try/catch so a transient pool-init blip logs and continues rather than killing the supervisor.Private-field access goes through a narrow structural cast rather than
as anyso the rest of the block keeps its type coverage. If you'd prefer a public method onBrainEngine(e.g.ensureInstancePool(size: number)), happy to split that into a follow-up — the broader interface change felt out of scope for the first fix.Tests
New
test/autopilot-instance-pool.test.tsfollows the static-shape style oftest/autopilot-supervisor-wiring.test.ts(7 cases): kind gate,_savedConfigread shape,_sql != nullguard,poolSize: 5cast shape, try/catch wrap, source-ordering before mode/spawn, import ofEngineConfig. Full behavioral test would need a Postgres fixture; the static guards catch refactor regressions cheaply.All 94 existing autopilot tests still pass.
Verified locally
Applied to v0.41.0.0 against a Postgres-engine brain. Every cycle phase runs end-to-end (16 phases including extract_facts / consolidate / propose_takes / grade_takes / calibration_profile / orphans / schema_suggest / purge). No
batch errorlines. No[health] ERRORlines.engine.getHealth()returns a realbrain_score.extract.links_fswrites real rows.Relationship to PR #465
PR #465 (open, ~1 month no review) fixes the
(engine as any).connect?.()no-config-arg bug in the autopilot reconnect path. It's a real fix and lands when this lands. But #465 alone leaves THIS bug fully present — the daemon stays alive (good — that's what #465 ships) but every cycle still loses rows andgetHealthstill throws.Independent fixes; can land in either order.