v0.42.37.0 fix(jobs): reap stale locks, bound disconnect, complete cooperative-abort (#1972)#2015
Merged
Merged
Conversation
A crashed sync (OOM, recycle, SIGKILL) stranded its gbrain_cycle_locks row until something contended for it — reclaim was on-contention only. Add a host-scoped background reaper: reapDeadHolderLocks deletes locks whose holder PID is provably dead on this host, scoped to the gbrain-sync:*/gbrain-cycle* namespaces only (never elections/supervisor/reindex), with a snapshot-matched delete (date_trunc on acquired_at) that is TOCTOU-safe against PID reuse. Reuses isHolderDeadLocally (same-host + ESRCH + 60s grace). doctor --fix now auto-reaps for no-autopilot brains. DRY: selectLockRows + shared mapper now back inspectLock + listStaleLocks (killed the triplication).
pool.end() against PgBouncer transaction-mode never drained, so disconnect blocked until the CLI's 10s force-exit fired and process.exit()'d mid-write, truncating stdout (e.g. #1959's relational query returned empty). Add a gbrain-owned endPoolBounded(pool): Promise.race of pool.end({timeout}) against a hard timer, so teardown is bounded regardless of what postgres.js does and is testable. connection-manager ends its direct + read pools concurrently so the per-pool bounds don't stack. PGLite disconnect is unaffected.
…1972) v0.42.29 made only the embed phase honor the abort signal; a 24h pull still showed force-evicts from a long non-embed phase ignoring it. Thread the signal into every cycle-reachable long loop: extract (extractForSlugs + the full-walk extractLinksFromDir/extractTimelineFromDir), extract_facts (per-page loop + embed signal + the phantom-redirect 30s lock-retry), and consolidate's bucket loop. Add a terminal abort check so an aborted cycle never stamps last_full_cycle_at as a completed run (Codex #9). lint now yields + checks abort every 200 pages (it's synchronous; the yield is what lets the signal land). New phase-duration force-evict attribution log names any phase that crosses the 30s deadline. Wire reapDeadHolderLocks at cycle start.
#1972 — stale-lock reaper, bounded pool disconnect, and complete cooperative-abort coverage across cycle phases. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ct, abort coverage document-release: update db-lock.ts (reapDeadHolderLocks + selectLockRows DRY), db.ts (endPoolBounded), and abort-check.ts (coverage now spans extract/ extract_facts/consolidate/lint + terminal guard) entries to current truth.
…iles Adding 3 new test files reshuffled the hash-based shards, exposing two pre-existing test-isolation bugs: - cycle-consolidate.test.ts assumed the global legacy-embedding preload's 1536-d gateway config still held at initSchema, but a co-sharded test that calls resetGateway() in teardown nulls it, so initSchema fell back to the 1280-d default and built a halfvec(1280) facts column its 1536-d fixtures can't fill. Re-pin the legacy OpenAI/1536 config in beforeAll (the pattern legacy-embedding-preload.ts documents for 1536-d fixture tests). - db-lock-heartbeat-takeover.test.ts (merged from master's #1794) mutated process.env.GBRAIN_LOCK_STEAL_GRACE_SECONDS raw, tripping check:test-isolation rule R1. Convert to withEnv().
garrytan
added a commit
that referenced
this pull request
Jun 10, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #1972. Three independent job-layer bugs, each traced to source, fixed in one PR with atomic per-bug commits. Hardened by
/plan-eng-review+ a Codex outside-voice round (11 findings folded in).The three bugs
1. Stale locks never reaped. A crashed sync (OOM/recycle/SIGKILL) stranded its
gbrain_cycle_locksrow — reclaim was on-contention only, no background sweep, so a low-traffic source looked "syncing" indefinitely. AddedreapDeadHolderLocks: a host-scoped sweep at cycle start that deletes locks whose holder PID is provably dead, scoped to thegbrain-sync:*/gbrain-cycle*namespaces only (never elections/supervisor/reindex — that was a Codex blast-radius catch), with a snapshot-matched delete (date_trunconacquired_at) that's TOCTOU-safe against PID reuse, plus the existing 60s grace.gbrain doctor --fixruns the same reaper for no-autopilot brains. DRY:selectLockRows+ a shared mapper now backinspectLock/listStaleLocks.2.
disconnect()hung ~10s and ate CLI output (#1959).pool.end()never drained against PgBouncer transaction-mode, so teardown blocked until the CLI's 10s force-exit fired andprocess.exit()'d mid-write — truncating stdout (why a relational query returned empty though the query worked). Added a gbrain-ownedendPoolBounded(Promise.raceofpool.end({timeout})against a hard timer), applied acrossdb.ts/postgres-engine.ts/connection-manager.ts(the two conn-manager pools close concurrently so bounds don't stack).3. Cooperative-abort was incomplete (#1737 follow-up). v0.42.29 covered only embed. Threaded the signal into every cycle-reachable long loop:
extract(incrementalextractForSlugs+ full-walkextractLinksFromDir/extractTimelineFromDir— a Codex correction; my first scope named the wrong functions),extract_facts(per-page loop +embedsignal + the phantom-redirect 30s lock-retry), andconsolidate's bucket loop.lintyields + checks abort every 200 pages (it's synchronous; the yield is what lets the signal land). Added a terminal abort check so a cancelled cycle never stampslast_full_cycle_atas a completed run, and a per-phase duration warning that names any phase overrunning the 30s force-evict.Scope decisions
findBacklinkGapsis a synchronous double-walk with no await seam; its async refactor is filed inTODOS.md, to be done only if the new force-evict attribution log shows backlinks actually crossing 30s in production.Version
Requested
v0.42.37.0collided with master's just-landed release of the same version; queue-advanced to v0.42.38.0 (clean next slot, no active sibling collision).Tests
db-lock-reap(namespace scope, TOCTOU, grace, cross-host),postgres-disconnect-bounded(resolves even when.end()never settles),extract-abort(pre-abort → 0 processed, full-walk + incremental). Extendedcycle-abort(terminal guard + source guards).db-lock.tsalongside the reaper; verified semantically (66 overlap tests incl. master'sdb-lock-heartbeat-takeover, plus 147 of master's newly-merged tests, all green; typecheck clean).🤖 Generated with Claude Code