Skip to content

fix(minions): route lock claim/renewLock through direct session pool#1816

Closed
garrytan-agents wants to merge 1 commit into
garrytan:masterfrom
garrytan-agents:fix/minion-locks-direct-pool
Closed

fix(minions): route lock claim/renewLock through direct session pool#1816
garrytan-agents wants to merge 1 commit into
garrytan:masterfrom
garrytan-agents:fix/minion-locks-direct-pool

Conversation

@garrytan-agents

Copy link
Copy Markdown
Contributor

The wedge, root-caused

The Minion lock hot-path — claim() and renewLock() in queue.ts — ran every UPDATE through engine.executeRaw(), which is hardcoded to the read pool (this.sql). On Supabase that's the transaction-mode pooler (port 6543), which recycles connections per-transaction.

A lock heartbeat is held open for minutes. So the pooler periodically reaps its socket mid-flight → bun's postgres client throws CONNECTION_ENDED → the lock looks expired → the worker force-evicts the job → the claim loop wedges (process alive, no errors, claiming nothing). This showed up as repeated silent worker wedges, especially under enrich fan-out load.

The fix was already half-built

gbrain already ships a dual-pool design: ConnectionManager.ddl()/bulk() route to the direct session-mode pool (port 5432, GBRAIN_DIRECT_DATABASE_URL), which holds a connection for the worker's lifetime. The lock path just never used it.

Verified empirically: a single connection heartbeating every 2s survived 4/4 beats over 8s on 5432, vs. CONNECTION_ENDED storms on 6543.

What this PR does

  • Adds BrainEngine.executeRawDirect() — same contract as executeRaw, but routes to the direct pool when dual-pool is active. No-op delegation on PGLite / non-Supabase / kill-switch (GBRAIN_DISABLE_DIRECT_POOL).
  • Points claim() and renewLock() at it.
  • Single-statement UPDATEs only — same idempotency guarantees, so the Codex feat: GBrain v0.1.0 — Postgres-native personal knowledge brain #1 double-claim guard and the renewLock no-inline-retry contract are preserved.
  • Transaction-scoped engine clones keep using their tx connection (never reroute a statement off an open transaction — that would break atomicity). The lock hot-path doesn't run inside transaction(), so it always reaches the direct pool there.

Testing

  • test/minions.test.ts169/169 pass (claim/renewLock via real PGLiteEngine delegation path)
  • test/queue-lock-retry.test.ts — pass; mock updated to executeRawDirect, plus a new guard that claim never falls back to executeRaw
  • tsc --noEmit — clean
  • Guards green: check-worker-pool-atomicity, check-worker-lock-renewal-shape, check-no-double-retry, check-no-legacy-getconnection

Note

No new infra needed — GBRAIN_DIRECT_DATABASE_URL is already configured in our env. This just makes the lock path use the pool that was built for exactly this workload.

The Minion lock hot-path (claim + renewLock) ran every UPDATE through
engine.executeRaw(), which is hardcoded to the read pool (this.sql) — on
Supabase that's the transaction-mode pooler (port 6543). Transaction-mode
recycles connections per-transaction, so a lock heartbeat held open for
minutes periodically gets its socket reaped mid-flight: bun's postgres
client throws CONNECTION_ENDED, the lock looks expired, the worker
force-evicts the job, and the claim loop wedges (alive, no errors, claiming
nothing). Observed as repeated silent worker wedges under enrich fan-out.

gbrain already ships a dual-pool design (ConnectionManager.ddl()/bulk() →
direct session pool, port 5432 via GBRAIN_DIRECT_DATABASE_URL) but the lock
path never used it. The session pool holds the connection for the worker's
lifetime, so heartbeats survive (verified: 4/4 beats over 8s on 5432 vs
CONNECTION_ENDED storms on 6543).

Fix: add BrainEngine.executeRawDirect() — same contract as executeRaw but
routes to the direct pool when dual-pool is active (no-op delegation on
PGLite / non-Supabase / kill-switch). Point claim() and renewLock() at it.
Single-statement UPDATEs only; same idempotency guarantees, so the Codex garrytan#1
double-claim guard and the renewLock no-retry contract are preserved.
Transaction-scoped engine clones keep using their tx connection (never
reroute a statement off an open transaction).

Tests: queue-lock-retry (updated mock to executeRawDirect, + guard that
claim never falls back to executeRaw), minions.test.ts 169/169 pass via
real PGLiteEngine delegation. typecheck clean. worker-pool-atomicity,
lock-renewal-shape, no-double-retry, no-legacy-getconnection all green.
@garrytan

garrytan commented Jun 3, 2026

Copy link
Copy Markdown
Owner

Superseded: the full change set has been pulled into the garrytan/minion-locks-session-pool worktree (Conductor workspace cape-town-v6), where it will ship as part of that branch's PR. Closing this one to avoid a duplicate. All five files apply cleanly and the verification (typecheck, minions.test.ts, queue-lock-retry, four lock/pool guards) reproduces green there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants