Skip to content

v0.42.37.0 fix(jobs): reap stale locks, bound disconnect, complete cooperative-abort (#1972)#2015

Merged
garrytan merged 7 commits into
masterfrom
garrytan/reap-stale-job-locks
Jun 10, 2026
Merged

v0.42.37.0 fix(jobs): reap stale locks, bound disconnect, complete cooperative-abort (#1972)#2015
garrytan merged 7 commits into
masterfrom
garrytan/reap-stale-job-locks

Conversation

@garrytan

@garrytan garrytan commented Jun 9, 2026

Copy link
Copy Markdown
Owner

Closes #1972. Three independent job-layer bugs, each traced to source, fixed in one PR with atomic per-bug commits. Hardened by /plan-eng-review + a Codex outside-voice round (11 findings folded in).

The three bugs

1. Stale locks never reaped. A crashed sync (OOM/recycle/SIGKILL) stranded its gbrain_cycle_locks row — reclaim was on-contention only, no background sweep, so a low-traffic source looked "syncing" indefinitely. Added reapDeadHolderLocks: a host-scoped sweep at cycle start that deletes locks whose holder PID is provably dead, scoped to the gbrain-sync:* / gbrain-cycle* namespaces only (never elections/supervisor/reindex — that was a Codex blast-radius catch), with a snapshot-matched delete (date_trunc on acquired_at) that's TOCTOU-safe against PID reuse, plus the existing 60s grace. gbrain doctor --fix runs the same reaper for no-autopilot brains. DRY: selectLockRows + a shared mapper now back inspectLock/listStaleLocks.

2. disconnect() hung ~10s and ate CLI output (#1959). pool.end() never drained against PgBouncer transaction-mode, so teardown blocked until the CLI's 10s force-exit fired and process.exit()'d mid-write — truncating stdout (why a relational query returned empty though the query worked). Added a gbrain-owned endPoolBounded (Promise.race of pool.end({timeout}) against a hard timer), applied across db.ts/postgres-engine.ts/connection-manager.ts (the two conn-manager pools close concurrently so bounds don't stack).

3. Cooperative-abort was incomplete (#1737 follow-up). v0.42.29 covered only embed. Threaded the signal into every cycle-reachable long loop: extract (incremental extractForSlugs + full-walk extractLinksFromDir/extractTimelineFromDir — a Codex correction; my first scope named the wrong functions), extract_facts (per-page loop + embed signal + the phantom-redirect 30s lock-retry), and consolidate's bucket loop. lint yields + checks abort every 200 pages (it's synchronous; the yield is what lets the signal land). Added a terminal abort check so a cancelled cycle never stamps last_full_cycle_at as a completed run, and a per-phase duration warning that names any phase overrunning the 30s force-evict.

Scope decisions

  • Reaper scoped to sync/cycle namespaces, not all locks (election/supervisor TTL-failover untouched).
  • One deferred item, gated not punted: findBacklinkGaps is a synchronous double-walk with no await seam; its async refactor is filed in TODOS.md, to be done only if the new force-evict attribution log shows backlinks actually crossing 30s in production.

Version

Requested v0.42.37.0 collided with master's just-landed release of the same version; queue-advanced to v0.42.38.0 (clean next slot, no active sibling collision).

Tests

🤖 Generated with Claude Code

garrytan and others added 6 commits June 9, 2026 08:35
A crashed sync (OOM, recycle, SIGKILL) stranded its gbrain_cycle_locks row
until something contended for it — reclaim was on-contention only. Add a
host-scoped background reaper: reapDeadHolderLocks deletes locks whose holder
PID is provably dead on this host, scoped to the gbrain-sync:*/gbrain-cycle*
namespaces only (never elections/supervisor/reindex), with a snapshot-matched
delete (date_trunc on acquired_at) that is TOCTOU-safe against PID reuse.
Reuses isHolderDeadLocally (same-host + ESRCH + 60s grace). doctor --fix now
auto-reaps for no-autopilot brains. DRY: selectLockRows + shared mapper now
back inspectLock + listStaleLocks (killed the triplication).
pool.end() against PgBouncer transaction-mode never drained, so disconnect
blocked until the CLI's 10s force-exit fired and process.exit()'d mid-write,
truncating stdout (e.g. #1959's relational query returned empty). Add a
gbrain-owned endPoolBounded(pool): Promise.race of pool.end({timeout}) against
a hard timer, so teardown is bounded regardless of what postgres.js does and is
testable. connection-manager ends its direct + read pools concurrently so the
per-pool bounds don't stack. PGLite disconnect is unaffected.
…1972)

v0.42.29 made only the embed phase honor the abort signal; a 24h pull still
showed force-evicts from a long non-embed phase ignoring it. Thread the signal
into every cycle-reachable long loop: extract (extractForSlugs + the full-walk
extractLinksFromDir/extractTimelineFromDir), extract_facts (per-page loop +
embed signal + the phantom-redirect 30s lock-retry), and consolidate's bucket
loop. Add a terminal abort check so an aborted cycle never stamps
last_full_cycle_at as a completed run (Codex #9). lint now yields + checks
abort every 200 pages (it's synchronous; the yield is what lets the signal
land). New phase-duration force-evict attribution log names any phase that
crosses the 30s deadline. Wire reapDeadHolderLocks at cycle start.
#1972 — stale-lock reaper, bounded pool disconnect, and complete
cooperative-abort coverage across cycle phases.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ct, abort coverage

document-release: update db-lock.ts (reapDeadHolderLocks + selectLockRows DRY),
db.ts (endPoolBounded), and abort-check.ts (coverage now spans extract/
extract_facts/consolidate/lint + terminal guard) entries to current truth.
@garrytan garrytan changed the title v0.42.38.0 fix(jobs): reap stale locks, bound disconnect, complete cooperative-abort (#1972) v0.42.37.0 fix(jobs): reap stale locks, bound disconnect, complete cooperative-abort (#1972) Jun 10, 2026
…iles

Adding 3 new test files reshuffled the hash-based shards, exposing two
pre-existing test-isolation bugs:

- cycle-consolidate.test.ts assumed the global legacy-embedding preload's
  1536-d gateway config still held at initSchema, but a co-sharded test that
  calls resetGateway() in teardown nulls it, so initSchema fell back to the
  1280-d default and built a halfvec(1280) facts column its 1536-d fixtures
  can't fill. Re-pin the legacy OpenAI/1536 config in beforeAll (the pattern
  legacy-embedding-preload.ts documents for 1536-d fixture tests).
- db-lock-heartbeat-takeover.test.ts (merged from master's #1794) mutated
  process.env.GBRAIN_LOCK_STEAL_GRACE_SECONDS raw, tripping check:test-isolation
  rule R1. Convert to withEnv().
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Compendium (0.42.34.0): stale dead-PID cycle locks never reaped, disconnect() 10s hang on PgBouncer txn-mode CLI, and incomplete #1737 cooperative-abort

1 participant