v0.41.21.0 feat(ops): 5 daily-driver pains fixed in one wave#1545
Merged
Conversation
Replaces the per-hash transcript loop (7K SQL roundtrips on big brains) with one batch query using `frontmatter->>'source_hash' = ANY($2::text[])`. Migration v104 adds the partial expression index that keeps the new query O(log n) at scale (mirrors v97 pattern: CONCURRENTLY + invalid-remnant pre-drop on Postgres, plain CREATE INDEX on PGLite). Helper exported so test/cycle/extract-atoms-batch.test.ts can drive it directly without orchestrating the full phase. Fail-open posture preserved from the prior per-hash helper. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ring Issue 3 + Issue 2 of the v0.41.20.0 ops-fix-wave. Codex caught during plan review that yieldBetweenPhases (the existing external hook) does NOT refresh the cycle DB lock — it's just a setImmediate() from jobs.ts:1405 / autopilot.ts:632, and lock.refresh() was never called from inside runCycle. Combined with the 30min TTL, crashed cycles wedged the lock for the full window before another worker could take over. Three coordinated changes: 1. LOCK_TTL_MINUTES 30 → 5 (src/core/cycle.ts). Crash recovers in ≤5 min instead of ≤30 min. 2. buildYieldDuringPhase(lock, outer) — exported closure that calls lock.refresh() AND the existing yieldBetweenPhases hook on every fire. Passed to both long phases (extract_atoms, synthesize_concepts) as their yieldDuringPhase opt. 3. maybeYield helper inside both phases — 30s throttle, fires inside the main work loop AND immediately after every `await chat()` LLM call (codex hardening: a single long LLM await could otherwise sit past TTL). Progress reporter wired through to both phases too (Issue 2): extract_atoms emits `[cycle.extract_atoms] N atoms / M skipped` ticks every ~1s; synthesize_concepts ticks per concept group. Cycle.ts owns start()/finish(); phases only call tick() and heartbeat() on the same reporter (NOT a child — that would produce path collision `cycle.extract_atoms.extract_atoms.work`). LockHandle interface exported for tests. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Issue 4 of the v0.41.20.0 ops-fix-wave. On a 322K-page brain the sweep
takes 10+ hours; if it died at 87% the user redid 87% on restart.
Wires the existing `op_checkpoints` framework into extractMentionsFromDb
with a flushAndCheckpoint ordering that closes the four codex-flagged
correctness bugs at once:
1. Lost-links-on-crash — flush batch links to DB FIRST, commit page
keys to checkpoint SECOND, persist THIRD. A crash between
batch.push() and flushBatch() leaves the page un-checkpointed so
resume re-scans it (no silently lost mention links).
2. Dry-run resume contradiction — dry-run does NOT load or persist
the checkpoint. Verification path uses non-dry-run kill-and-resume.
3. Gazetteer hash in fingerprint — entity pages added mid-pause shift
the gazetteer hash → new fingerprint → fresh scan against the new
gazetteer. Without this, resumed runs would silently skip pages
against a new entity set.
4. Filtered pages get checkpointed too — pages skipped by `--type` /
`--since` / empty body / no-mentions all get marked completed so
resume doesn't re-fetch them.
Persist cadence: every 1000 items OR every 30s, whichever first
(~322 persists / ~24s total overhead on the 322K-page brain). Crash
window capped at 1000 pages (<0.3% loss).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Issue 5 of the v0.41.20.0 ops-fix-wave. Multi-source brains see a paste-ready `gbrain sync --all --parallel 4 --workers 4 --skip-failed` in `gbrain doctor` output instead of maintaining two staggered per-source cron entries with manual deconfliction. New checkSyncConsolidation surfaces the recommendation when 2+ active sources exist; "not applicable" for single-source brains. Own try/catch returns warn on SQL failure — outer doctor catch wasn't a safe assumption. `skills/cron-scheduler/SKILL.md` gains a "Multi-source brains" recipe block documenting the pattern + connection-budget math (parallel × workers × 2 ≈ 32 connections at default 4/4). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two pre-existing tests assumed a clean ~/.gbrain/config.json and a free
~/.gbrain/cycle.lock — both shared across all gbrain processes on the
machine. Sibling Conductor worktrees running their own gbrain tests
poisoned the shared state, causing flakes:
- test/cycle-last-full-cycle-at.test.ts test 5 timed out at 5s
because runCycle returned 'skipped' (file lock held by a parallel
test process), and last_full_cycle_at exit hook silently no-oped.
Fix: each test wraps its body in `withEnv({GBRAIN_HOME: tmpdir})`
so the file lock path becomes per-test.
- test/schema-cli.test.ts `schema active reports default resolution`
failed exit 1 because another worktree had set
`schema_pack: gbrain-base-v2` in the shared config (a pack that
doesn't exist in the bundle). Fix: gbrain() helper defaults
GBRAIN_HOME to a per-file tempdir (beforeAll-owned), so subprocess
invocations get an isolated config dir unless tests explicitly
override.
Both fixes confirmed via deliberate pollution + retest: 12/12
schema-cli tests pass under simulated `schema_pack: gbrain-base-v2`
contamination; cycle-LFCA test 5 completes <2s with isolated home.
Discovered during v0.41.20.0 ship while investigating parallel-worktree
flake. Not caused by the ops-fix-wave but found via it.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Five daily-driver ops pains fixed in one wave: 1. extract_atoms 7K-roundtrip overhead → 1 batch query + index 2. silent long-running phases → progress ticks every ~1s 3. 30-min crashed-cycle lock TTL → 5 min + active in-phase refresh 4. by-mention restarts from page 0 → resumes via op_checkpoints 5. multi-source cron → doctor surfaces `sync --all --parallel` nudge Two follow-up TODOs filed under v0.41.19.0 ops-fix-wave block (will be renamed at follow-up time): - `gbrain sync print-cron` subcommand (P2 ergonomics) - Lock-loss detection in DbLockHandle.refresh() (P2 contract change) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Folds the v0.41.20.0 wave annotations into the cycle/extract/op-checkpoint key-files block: batch idempotency via atomsExistingForHashes, shorter cycle lock TTL with buildYieldDuringPhase active refresh, progress wiring through extract_atoms + synthesize_concepts, by-mention resume via mentionsFingerprint with flushAndCheckpoint ordering, sync_consolidation doctor check, and the 44-case test suite pinning every contract. Regenerated llms-full.txt to match (CLAUDE.md edit invariant). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
# Conflicts: # CHANGELOG.md # TODOS.md
Two CI-only failures caught on PR #1545 (v0.41.21.0 ops-fix-wave): 1. doctor-categories drift guard — new `sync_consolidation` check from T6 wasn't categorized in src/core/doctor-categories.ts. Added under OPS_CHECK_NAMES (it surfaces an operator-cron recommendation, not a brain-data quality signal). 2. facts-engine `embedding cosine ordering when both sides have embeddings` — passed locally, failed under CI's parallel shard. Bun's truncated assertion output didn't surface which expect() fired; hardened the test against unknown leak vectors by: - per-run unique entity_slug (`embed-test-<random8>`) instead of the static `embed-test`, so any future cross-test pollution is structurally impossible - `findIndex` + `aIdx < bIdx` assertion that pins the cosine RELATIONSHIP (A closer than B because cos(A,Q)=1.0 vs cos(B,Q)=0.0) instead of the brittle `result[0].fact === 'A'` position check. The new shape matches the test name's contract verbatim ("ordering when both sides have embeddings"), so any unrelated row in the result set can no longer flip the test. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
garrytan
added a commit
that referenced
this pull request
May 27, 2026
Master advanced to v0.41.21.0 (5 daily-driver ops pains wave, PR #1545). Resolved 4 conflicts: - VERSION + package.json: 0.41.22.0 stays (higher than master's 0.41.21.0) - CHANGELOG: my v0.41.22.0 entry on top, master's v0.41.21.0 entry below in proper version-descending order - src/core/migrate.ts: MIGRATION VERSION COLLISION — master's v0.41.21 claimed v104 for `pages_atom_source_hash_idx`. Bumped my slug_aliases migration to v105 (the canonical "claim next available slot" pattern). Updated all slug_aliases-related v104 doc/comment references: - src/core/postgres-engine.ts: "Pre-v104 brain" → "Pre-v105 brain" - src/core/onboard/checks.ts: "pre-v104 brains" → "pre-v105 brains" - src/core/search/hybrid.ts: "pre-v104 brains" → "pre-v105 brains" - docs/architecture/pack-upgrade-mechanism.md: migrate.ts:104 → :105 - CHANGELOG.md (my entry): v104 → v105 with rebump rationale - CHANGELOG.md "To take advantage" block: migration v105 Master's v104 (atom source-hash index, `pages_atom_source_hash_idx`) preserved verbatim. Both migrations now coexist correctly. Typecheck clean. bun run verify: 28/28 checks pass. 33/33 slug_aliases-touching tests pass (page-to-alias, resolve-slug-with-alias, unify-types-handler, onboard checks, search boost, E2E full flow).
mgunnin
added a commit
to mgunnin/gbrain
that referenced
this pull request
May 28, 2026
* upstream/master: v0.41.26.1 fix: lock-renewal cathedral — closes ~39 worker crashes/day (supersedes garrytan#1567) (garrytan#1572) v0.41.26.0 fix: dream --source + ingest junk titles + emoji-crash (supersedes garrytan#1559, garrytan#1561) (garrytan#1571) v0.41.25.0 perf(sync): batched deletes + global page-generation clock (supersedes garrytan#1538) (garrytan#1566) v0.41.24.0 fix(conversation-parser): threshold gates + bold-paren-time pattern — 20,167 Circleback messages unblocked (closes garrytan#1533) (garrytan#1543) v0.41.23.0 feat: extract operator surfaces + pack-driven extractables (garrytan#1541) v0.41.22.1 feat: brainstorm/lsd judge fixes (closes garrytan#1540 end-to-end) (garrytan#1562) v0.41.22.0 feat: type-unification cathedral — 94 types → 15 canonical (closes garrytan#1479) (garrytan#1542) v0.41.21.0 feat(ops): 5 daily-driver pains fixed in one wave (garrytan#1545) v0.41.20.0 feat: gbrain status + doctor --scope=brain (fix wave 2: items garrytan#6 + garrytan#7) (garrytan#1544) feat: v0.41.19.0 Supavisor Retry Cathedral (garrytan#1537) v0.41.18.0: gbrain onboard — the activation surface gbrain didn't have before (garrytan#1521) v0.41.17.0 feat: --workers N on every bulk command + facts dim doctor parity (garrytan#1519) v0.41.16.0 feat: conversation parser cathedral + progressive-batch primitive (closes garrytan#1461) (garrytan#1510) v0.41.15.0 feat(sync): --timeout + --max-age + partial status (closes garrytan#1472 RFC) (garrytan#1506)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Five daily-driver ops pains fixed in one bisectable wave:
Performance + Visibility
extract_atomsstartup on conversation-transcript-heavy brains: 5-10 min of silent overhead → <1s. Replaces the per-hash SQL loop (7K roundtrips on a brain with that many transcripts) with one batchWHERE frontmatter->>'source_hash' = ANY($2::text[])query backed by migration v104's partial expression index.extract_atoms+synthesize_conceptswere silent for 10+ min mid-phase. Now: tick every ~1s with running counts ([cycle.extract_atoms] N atoms / M skipped).Reliability
buildYieldDuringPhase(lock, outer)closure (fireslock.refresh()+ the existing yieldBetweenPhases hook on every 30smaybeYieldAND immediately after everyawait chat()LLM call). Crash recovery 6× faster. Codex caught the pre-existing bug class during plan review:yieldBetweenPhaseswas justsetImmediate()from jobs.ts/autopilot.ts andlock.refresh()was never called from insiderunCycle— so dropping TTL alone would have made lock-stealing worse.Resumability
gbrain extract links --by-mentionnow resumes from where it died viaop_checkpoints. Codex caught 4 correctness bugs in the original design: lost-links-on-crash (flush BEFORE commit pending page keys to checkpoint), contradictory dry-run resume (dry-run skips both load + persist), gazetteer-not-in-fingerprint (entity-page changes invalidate the checkpoint), filtered-page restart pain (skip-decisions get checkpointed too). All four closed viaflushAndCheckpointordering +mentionsFingerprint({source, type, since, gazetteerHash}).Operator UX
gbrain doctornow surfaces a paste-readygbrain sync --all --parallel 4 --workers 4 --skip-failedrecommendation for multi-source brains via newsync_consolidationcheck. Cron-scheduler skill grows a "Multi-source brains" recipe.Discovered during ship, not caused by the wave: two pre-existing test-isolation flakes fixed too —
test/cycle-last-full-cycle-at.test.tsandtest/schema-cli.test.tsnow isolateGBRAIN_HOMEper test so sibling Conductor worktrees' parallel test runs don't poison shared state.Test Coverage
test/cycle/extract-atoms-batch.test.ts(5 cases)test/cycle/extract-atoms-progress.test.ts(4),synthesize-concepts-progress.test.ts(3)cycle-lock-ttl.test.ts(1),yield-during-phase-refresh.test.ts(7),yield-during-phase-throttle.test.ts(3)extract-by-mention-resume.test.ts(5),op-checkpoint-mentions-fingerprint.test.ts(7)doctor-sync-consolidation.test.ts(6)cycle-last-full-cycle-at.test.ts(5, all now under withEnv),schema-cli.test.ts(12, default-isolated home)Full suite: 11454 pass / 0 fail / 0 skip across 4 parallel shards + 47 serial files (elapsed 1371s).
Pre-Landing Review
Plan went through
/plan-eng-review(6 architecture decisions absorbed) and/codex consult(14 findings — 4 load-bearing absorbed via plan rework, 5 mechanical fixes auto-applied, 1 deferred to TODO). Codex specifically killed the original Issue 3 design when it proved theyieldBetweenPhasesclaim was wrong. Plan rebuilt aroundbuildYieldDuringPhaseclosure.No critical findings remain.
Plan Completion
All 7 implementation tasks (T1–T7) shipped + 2 pre-existing test-isolation flakes fixed + 2 follow-up TODOs filed (
gbrain sync print-cronsubcommand and lock-loss detection inDbLockHandle.refresh()).Documentation
CLAUDE.mdkey-files block updated with the consolidated v0.41.21.0 entry covering all six shipped fixes plus the 44-case test inventory and 2 follow-up TODO references.llms-full.txtregenerated to match per the mandatory invariant.Verification
Test plan
🤖 Generated with Claude Code