Skip to content

v0.42.24.0 fix(minions): route lock claim/renewLock through direct session pool#1822

Merged
garrytan merged 7 commits into
masterfrom
garrytan/minion-locks-session-pool
Jun 4, 2026
Merged

v0.42.24.0 fix(minions): route lock claim/renewLock through direct session pool#1822
garrytan merged 7 commits into
masterfrom
garrytan/minion-locks-session-pool

Conversation

@garrytan

@garrytan garrytan commented Jun 3, 2026

Copy link
Copy Markdown
Owner

Summary

Fix (fix(minions): route lock claim/renewLock through direct session pool): The Minion lock heartbeat (claim + renewLock in src/core/minions/queue.ts) ran every UPDATE through engine.executeRaw(), hardcoded to the read pool. On Supabase that is the transaction-mode pooler (6543), which recycles connections per transaction. A lock is held for minutes, so the pooler periodically reaps the socket mid-heartbeat → CONNECTION_ENDED → the lock looks expired → the worker force-evicts its own job and the claim loop wedges silently (process alive, no errors, claiming nothing). Showed up most under enrich fan-out.

The fix adds BrainEngine.executeRawDirect() — same contract as executeRaw, but routes to the direct session-mode pool (5432, GBRAIN_DIRECT_DATABASE_URL) when dual-pool is active, holding the connection open for the worker's life so heartbeats survive. No-op delegation on PGLite / non-Supabase / kill-switch (GBRAIN_DISABLE_DIRECT_POOL). claim/renewLock now use it. Single-statement UPDATEs only, so the double-claim guard and the renewLock no-inline-retry contract are preserved. Statements inside an open transaction keep their tx connection (in-transaction guard keys on peekReadPool() !== _sql); the lock hot-path never runs inside transaction(). The Postgres impl shares its cancellation plumbing with executeRaw via a private runUnsafe helper.

No new infrastructure — GBRAIN_DIRECT_DATABASE_URL was already configured; this just makes the lock path use the pool built for exactly this workload. Supersedes PR #1816 (closed).

Test Coverage

New test/postgres-execute-raw-direct.test.ts covers the Postgres routing decision (dual-pool on/off × in-tx/not, plus abort short-circuit) with stubbed connections — runs in CI without a live Postgres, the branch the fix exists to exercise. test/queue-lock-retry.test.ts gains a guard that claim can never fall back to executeRaw.

  • bun run typecheck — clean repo-wide on the merged state
  • Targeted suite — 200 pass / 0 fail (minions, queue-lock-retry, postgres-execute-raw-direct, worker-lock-renewal, postgres-engine-getter-selfheal, postgres-reconnect-singleton)
  • 5 project guards green: check-worker-pool-atomicity, check-worker-lock-renewal-shape, check-no-double-retry, check-no-legacy-getconnection, check-jsonb-pattern
  • Verified the inTransaction guard survives v0.42.20.0's reconnect-race rework: the instance reconnect path re-runs setReadPool(this._sql), so peekReadPool() === _sql still holds for a live worker.

Pre-Landing Review

/plan-eng-review cleared the change this session. Three findings applied before ship: DRY refactor (shared runUnsafe helper), the routing unit test above, and a P3 follow-up TODO for direct-pool sizing under enrich fan-out.

Design Review

No frontend files changed — design review skipped.

Eval Results

No prompt-related files changed — evals skipped.

Plan Completion

No plan file detected (diff-scoped change reviewed via /plan-eng-review).

TODOS

  • Filed P3 — "Size the direct session pool for enrich fan-out" (TODOS.md, Minion-lock direct-pool follow-up section).

Documentation

  • docs/architecture/KEY_FILES.md — documented the new BrainEngine.executeRawDirect() contract method (lock-hot-path sibling of executeRaw; routes to the direct session-mode pool when dual-pool is active, PGLite delegates straight to executeRaw) on the src/core/engine.ts entry; noted on the src/core/minions/queue.ts entry that claim/renewLock now issue their UPDATE via executeRawDirect.

Documentation Debt (pre-existing, not introduced here)

  • docs/ENGINES.md lists only the transaction pooler (6543) and never documents the dual-pool / direct session-mode (5432) design. Candidate for a follow-up /document-generate pass.

Test plan

  • typecheck clean repo-wide
  • Targeted lock/engine suite: 200 pass / 0 fail
  • 5 lock/pool/jsonb guards green
  • CI runs the Postgres e2e suite (engine-parity, reconnect-singleton) which needs a live Postgres service unavailable locally

🤖 Generated with Claude Code

garrytan and others added 7 commits June 3, 2026 08:45
The Minion lock heartbeat (claim + renewLock) ran every UPDATE through
engine.executeRaw(), which is hardcoded to the read pool. On Supabase that
is the transaction-mode pooler (6543), which recycles connections per
transaction. A lock is held for minutes, so the pooler periodically reaps
the socket mid-heartbeat -> CONNECTION_ENDED -> the lock looks expired ->
the worker force-evicts its own job and the claim loop wedges silently.

Add BrainEngine.executeRawDirect(): same contract as executeRaw, but routes
to the direct session-mode pool (5432, GBRAIN_DIRECT_DATABASE_URL) when
dual-pool is active. No-op delegation on PGLite / non-Supabase / kill-switch.
claim/renewLock now use it. Single-statement UPDATEs only, so the double-claim
guard and the renewLock no-inline-retry contract are preserved. Statements
inside an open transaction keep their tx connection (in-transaction guard keys
on peekReadPool() !== _sql); the lock hot-path never runs inside transaction().

The Postgres impl shares its cancellation plumbing with executeRaw via a
private runUnsafe helper. New test/postgres-execute-raw-direct.test.ts covers
the routing decision (dual-pool on/off x in-tx/not + abort short-circuit)
without a live Postgres; queue-lock-retry.test.ts gains a guard that claim
can never fall back to executeRaw.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Document executeRawDirect on the BrainEngine contract and the
claim/renewLock direct-session-pool routing in KEY_FILES.md.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The DRY refactor in this PR extracted executeRaw/executeRawDirect's shared
cancellation plumbing into a private runUnsafe(conn, ...) helper, so the single
conn.unsafe() call moved out of executeRaw's body. The D3 guard read
executeRaw's source and asserted conn.unsafe( appeared exactly once there,
which now fails (it's zero — executeRaw delegates).

The D3 invariant (no per-call retry wrapper) is unchanged; it just spans the
delegate now. Update the guard to check both public methods delegate to
runUnsafe without reconnect/retry, and assert the exactly-once conn.unsafe +
cancel-only catch in runUnsafe. Also extends coverage to executeRawDirect so
the lock hot-path can't reintroduce a retry wrapper either.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Resolve VERSION/package.json (keep 0.42.24.0, higher than master's 0.42.21.0),
CHANGELOG (both entries kept, 0.42.24.0 on top), and the postgres-engine.ts /
connection-resilience.test.ts auto-merges. Trio audit green; typecheck clean;
lock-path + D3-guard + routing tests pass on the merged state.
Resolve VERSION/package.json (keep 0.42.24.0 > master's 0.42.22.0), CHANGELOG
(both entries, 0.42.24.0 on top), and KEY_FILES.md (keep my queue.ts entry with
executeRawDirect; take master's newer worker/supervisor/child-worker-supervisor
entries from the #1801 progress-watchdog work). queue.ts auto-merge kept both
executeRawDirect call sites. Trio green; typecheck clean; lock-path + D3-guard +
routing tests pass on the merged state.
Resolve VERSION/package.json (keep 0.42.24.0 > master's 0.42.23.0) and CHANGELOG
(both entries, 0.42.24.0 on top). KEY_FILES auto-merged cleanly (my queue.ts
executeRawDirect entry + master's wedge work both intact). Trio green; typecheck
clean; lock-path + D3-guard + routing tests pass on the merged state.
@garrytan garrytan merged commit f868257 into master Jun 4, 2026
21 checks passed
mgunnin added a commit to mgunnin/gbrain that referenced this pull request Jun 5, 2026
* upstream/master:
  v0.42.26.0 docs(supabase): update connection-string setup to new UI + Transaction pooler (garrytan#1848) (garrytan#1875)
  v0.42.25.0 fix(pricing): unify chat-model pricing into one canonical source; add Opus 4.8 (garrytan#1819) (garrytan#1827)
  v0.42.24.0 fix(minions): route lock claim/renewLock through direct session pool (garrytan#1822)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant