fix(minions): route lock claim/renewLock through direct session pool#1816
Closed
garrytan-agents wants to merge 1 commit into
Closed
fix(minions): route lock claim/renewLock through direct session pool#1816garrytan-agents wants to merge 1 commit into
garrytan-agents wants to merge 1 commit into
Conversation
The Minion lock hot-path (claim + renewLock) ran every UPDATE through engine.executeRaw(), which is hardcoded to the read pool (this.sql) — on Supabase that's the transaction-mode pooler (port 6543). Transaction-mode recycles connections per-transaction, so a lock heartbeat held open for minutes periodically gets its socket reaped mid-flight: bun's postgres client throws CONNECTION_ENDED, the lock looks expired, the worker force-evicts the job, and the claim loop wedges (alive, no errors, claiming nothing). Observed as repeated silent worker wedges under enrich fan-out. gbrain already ships a dual-pool design (ConnectionManager.ddl()/bulk() → direct session pool, port 5432 via GBRAIN_DIRECT_DATABASE_URL) but the lock path never used it. The session pool holds the connection for the worker's lifetime, so heartbeats survive (verified: 4/4 beats over 8s on 5432 vs CONNECTION_ENDED storms on 6543). Fix: add BrainEngine.executeRawDirect() — same contract as executeRaw but routes to the direct pool when dual-pool is active (no-op delegation on PGLite / non-Supabase / kill-switch). Point claim() and renewLock() at it. Single-statement UPDATEs only; same idempotency guarantees, so the Codex garrytan#1 double-claim guard and the renewLock no-retry contract are preserved. Transaction-scoped engine clones keep using their tx connection (never reroute a statement off an open transaction). Tests: queue-lock-retry (updated mock to executeRawDirect, + guard that claim never falls back to executeRaw), minions.test.ts 169/169 pass via real PGLiteEngine delegation. typecheck clean. worker-pool-atomicity, lock-renewal-shape, no-double-retry, no-legacy-getconnection all green.
Owner
|
Superseded: the full change set has been pulled into the |
4 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
The wedge, root-caused
The Minion lock hot-path —
claim()andrenewLock()inqueue.ts— ran every UPDATE throughengine.executeRaw(), which is hardcoded to the read pool (this.sql). On Supabase that's the transaction-mode pooler (port 6543), which recycles connections per-transaction.A lock heartbeat is held open for minutes. So the pooler periodically reaps its socket mid-flight → bun's postgres client throws
CONNECTION_ENDED→ the lock looks expired → the worker force-evicts the job → the claim loop wedges (process alive, no errors, claiming nothing). This showed up as repeated silent worker wedges, especially underenrichfan-out load.The fix was already half-built
gbrain already ships a dual-pool design:
ConnectionManager.ddl()/bulk()route to the direct session-mode pool (port 5432,GBRAIN_DIRECT_DATABASE_URL), which holds a connection for the worker's lifetime. The lock path just never used it.Verified empirically: a single connection heartbeating every 2s survived 4/4 beats over 8s on 5432, vs.
CONNECTION_ENDEDstorms on 6543.What this PR does
BrainEngine.executeRawDirect()— same contract asexecuteRaw, but routes to the direct pool when dual-pool is active. No-op delegation on PGLite / non-Supabase / kill-switch (GBRAIN_DISABLE_DIRECT_POOL).claim()andrenewLock()at it.transaction(), so it always reaches the direct pool there.Testing
test/minions.test.ts— 169/169 pass (claim/renewLock via realPGLiteEnginedelegation path)test/queue-lock-retry.test.ts— pass; mock updated toexecuteRawDirect, plus a new guard thatclaimnever falls back toexecuteRawtsc --noEmit— cleancheck-worker-pool-atomicity,check-worker-lock-renewal-shape,check-no-double-retry,check-no-legacy-getconnectionNote
No new infra needed —
GBRAIN_DIRECT_DATABASE_URLis already configured in our env. This just makes the lock path use the pool that was built for exactly this workload.