fix(jobs work): switch worker engine to instance-owned pool (Issue #1413 worker-side)#1578
Closed
notjbg wants to merge 1 commit into
Closed
fix(jobs work): switch worker engine to instance-owned pool (Issue #1413 worker-side)#1578notjbg wants to merge 1 commit into
notjbg wants to merge 1 commit into
Conversation
Forces the worker subprocess engine into instance-owned-pool mode at startup so `_sql` is non-null and the module-level singleton fallback at `db.ts:getConnection` stops being load-bearing for every `executeRaw` call site inside the worker. The specific failure: when a transient Supabase pool eviction nulls the module-level `sql` singleton (db.ts:216's connect() catch path), the next `worker.ts:619` lock-keeper setInterval fires `queue.ts:renewLock` -> `engine.executeRaw` -> `db.ts:getConnection` which throws `No database connection: connect() has not been called`. That either unhandled-reject-crashes the worker (autopilot KeepAlive respawns, observed 2x in 90s) or degrades it silently (queue.poll returns nothing, cycle stops making progress while the daemon stays alive, observed 110+ minutes). Confirmed reproducible on v0.41.18.0 against a 42k-page Supabase pooler brain. After this patch the same brain runs cleanly under the same workload conditions. Companion to Issue garrytan#1413 / closed PR garrytan#1415, which fixed the autopilot main loop the same way but did not touch this worker-subprocess code path. The worker is a separate process (\`bun gbrain jobs work --max-rss N\`) with its own engine and its own connection lifecycle; PR garrytan#1415's autopilot.ts startup hook does not reach it. Same pattern as PR garrytan#1415: - Gate on \`engine.kind === 'postgres'\` (PGLite unaffected) - Read \`_savedConfig\` via narrow structural cast (not \`as any\`) - Skip when \`_connectionStyle === 'instance'\` (idempotent on re-run or when caller already wired engine with poolSize upstream) - Wrap in try/catch so a transient pool-init blip logs and continues rather than killing the worker startup No new dependencies. ~14 LOC. Idempotent.
Contributor
Author
|
Closing as obsolete: upstream v0.41.28.0's retry self-heal (reconnect() wired into batchRetry, #1570) supersedes this worker-side patch — verified in production that connection blips now recover at attempt 1/3 with zero row loss, so the instance-pool workaround is no longer needed. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Forces the queue-worker subprocess engine into instance-owned-pool mode at startup, mirroring the pattern @archerhpagent shipped in #1415 for the autopilot main loop, but applied at the separate worker code path (
bun gbrain jobs work --max-rss N).The bug
When a transient Supabase pool eviction nulls the module-level
sqlsingleton (the catch path atsrc/core/db.ts:216setssql = nullon connect failure), the next firing ofsrc/core/minions/worker.ts:619'slockTimersetInterval callsqueue.ts:renewLock→engine.executeRaw→db.ts:getConnection, which throws:That either:
queue.poll()returns nothing, the cycle stops making progress, but the daemon process stays alive andgbrain autopilot --statusreportsinstalled. Observed 110+ minutes of zero ingest progression on the same brain before manual restart.Both states observed on v0.41.18.0 against a 42k-page Supabase pooler brain (Autopilot run,
extract.links_fsphase). Logs at #1413 (comment) (my comment in the same incident thread).Relationship to #1415 and #1413
@archerhpagent's #1415 fixed the same module-singleton-null bug class for the autopilot main loop by forcing
engine.connect({...savedConfig, poolSize: 5})at autopilot startup. That PR was closed 2026-05-26 on the claim that v0.41.11.x autopilot.ts rewrite eliminated the bug — which is true for the autopilot path, but the queue worker subprocess is a separate process with its own engine instance and its ownconnectEngine()call. Per-subprocess engine initialization defaults to the module-singleton fallback (db.connect(config)→ module-levelsqlshared with the worker'sexecuteRawcallers), so the worker inherits the same bug class.This PR closes the bug at the worker startup site. Same pattern, different code path.
The fix
Inserted right before
new MinionWorker(engine, {...})incase 'work':(aroundjobs.ts:814). 14 LOC of executable code + 17 LOC of explanatory comment.Properties:
engine.kind === 'postgres'— PGLite engines unaffected (they ignorepoolSize)._savedConfigvia narrow structural cast (mirrors PR fix(autopilot): switch main engine to instance-owned pool so mid-cycle singleton-null doesn't lose rows + break getHealth #1415 style; noas anyblast radius)._connectionStyle === 'instance'— idempotent on re-run, and a no-op when a future caller wires the engine withpoolSizeupstream ofcase 'work':.try/catchwraps theconnect()call so a transient pool-init blip logs and continues rather than killing the worker startup. The error message names the daemon process the operator's eyes are already on.Verification
Confirmed locally on the same production brain that hit the bug:
extract.links_fs.[jobs work] switched engine to instance-owned pool (size=5)once at startup. Worker survivesextract.links_fs(42k pages) cleanly. New ingest entries land on schedule. (Monitor still running at PR-submit time — will update this PR with the final clean-runtime number when it lands.)Tests
The bug is hard to unit-test deterministically — reproducing it needs real Supabase + PgBouncer transaction-mode + connection pool eviction (or a brittle simulation of the same). I did NOT add a behavioral test for the new code path; the static guard
_connectionStyle !== 'instance' && _savedConfig defined → call engine.connect with poolSizeis exercised by the engine's existingconnect()tests via thepoolSizebranch.Happy to add a static-shape test (mirror of
test/autopilot-supervisor-wiring.test.ts) that asserts the patch wiresengine.connect({..., poolSize: 5})betweenqueue.ensureSchema()andnew MinionWorker(), if that'd help unblock review.Test plan
bun run typecheckon the branch — no errors introduced)🤖 Generated with Claude Code