v0.41.21.0 feat(ops): 5 daily-driver pains fixed in one wave by garrytan · Pull Request #1545 · garrytan/gbrain

garrytan · 2026-05-27T07:04:35Z

Summary

Five daily-driver ops pains fixed in one bisectable wave:

Performance + Visibility

extract_atoms startup on conversation-transcript-heavy brains: 5-10 min of silent overhead → <1s. Replaces the per-hash SQL loop (7K roundtrips on a brain with that many transcripts) with one batch WHERE frontmatter->>'source_hash' = ANY($2::text[]) query backed by migration v104's partial expression index.
extract_atoms + synthesize_concepts were silent for 10+ min mid-phase. Now: tick every ~1s with running counts ([cycle.extract_atoms] N atoms / M skipped).

Reliability

Cycle lock TTL dropped 30→5 min. Combined with active in-phase refresh via new buildYieldDuringPhase(lock, outer) closure (fires lock.refresh() + the existing yieldBetweenPhases hook on every 30s maybeYield AND immediately after every await chat() LLM call). Crash recovery 6× faster. Codex caught the pre-existing bug class during plan review: yieldBetweenPhases was just setImmediate() from jobs.ts/autopilot.ts and lock.refresh() was never called from inside runCycle — so dropping TTL alone would have made lock-stealing worse.

Resumability

gbrain extract links --by-mention now resumes from where it died via op_checkpoints. Codex caught 4 correctness bugs in the original design: lost-links-on-crash (flush BEFORE commit pending page keys to checkpoint), contradictory dry-run resume (dry-run skips both load + persist), gazetteer-not-in-fingerprint (entity-page changes invalidate the checkpoint), filtered-page restart pain (skip-decisions get checkpointed too). All four closed via flushAndCheckpoint ordering + mentionsFingerprint({source, type, since, gazetteerHash}).

Operator UX

gbrain doctor now surfaces a paste-ready gbrain sync --all --parallel 4 --workers 4 --skip-failed recommendation for multi-source brains via new sync_consolidation check. Cron-scheduler skill grows a "Multi-source brains" recipe.

Discovered during ship, not caused by the wave: two pre-existing test-isolation flakes fixed too — test/cycle-last-full-cycle-at.test.ts and test/schema-cli.test.ts now isolate GBRAIN_HOME per test so sibling Conductor worktrees' parallel test runs don't poison shared state.

Test Coverage

Scope	Tests added	Status
extract_atoms batch	`test/cycle/extract-atoms-batch.test.ts` (5 cases)	✓
Progress wiring	`test/cycle/extract-atoms-progress.test.ts` (4), `synthesize-concepts-progress.test.ts` (3)	✓
Lock TTL + yield	`cycle-lock-ttl.test.ts` (1), `yield-during-phase-refresh.test.ts` (7), `yield-during-phase-throttle.test.ts` (3)	✓
by-mention checkpoint	`extract-by-mention-resume.test.ts` (5), `op-checkpoint-mentions-fingerprint.test.ts` (7)	✓
Doctor consolidation	`doctor-sync-consolidation.test.ts` (6)	✓
Test-isolation fixes	`cycle-last-full-cycle-at.test.ts` (5, all now under withEnv), `schema-cli.test.ts` (12, default-isolated home)	✓

Full suite: 11454 pass / 0 fail / 0 skip across 4 parallel shards + 47 serial files (elapsed 1371s).

Pre-Landing Review

Plan went through /plan-eng-review (6 architecture decisions absorbed) and /codex consult (14 findings — 4 load-bearing absorbed via plan rework, 5 mechanical fixes auto-applied, 1 deferred to TODO). Codex specifically killed the original Issue 3 design when it proved the yieldBetweenPhases claim was wrong. Plan rebuilt around buildYieldDuringPhase closure.

No critical findings remain.

Plan Completion

All 7 implementation tasks (T1–T7) shipped + 2 pre-existing test-isolation flakes fixed + 2 follow-up TODOs filed (gbrain sync print-cron subcommand and lock-loss detection in DbLockHandle.refresh()).

Documentation

CLAUDE.md key-files block updated with the consolidated v0.41.21.0 entry covering all six shipped fixes plus the 44-case test inventory and 2 follow-up TODO references. llms-full.txt regenerated to match per the mandatory invariant.

Verification

# Issue 1: extract_atoms transcript idempotency fast
time gbrain dream --phase extract_atoms --dry-run --json

# Issue 2: progress visible
gbrain dream --phase extract_atoms 2>&1 | head -50

# Issue 3: lock TTL is 5min, crashes auto-recover
gbrain dream --phase orphans &
PID1=$!; sleep 1; kill -9 $PID1
sleep 310 && gbrain dream --phase orphans  # should succeed within 5min

# Issue 4: by-mention resumes
gbrain extract --by-mention 2>&1 | tee /tmp/run1.log
# Ctrl-C at ~10%
gbrain extract --by-mention 2>&1 | tee /tmp/run2.log
# Expect "resuming: 32K/322K already scanned"

# Issue 5: doctor nudge
gbrain doctor --json | jq '.checks[] | select(.name=="sync_consolidation")'

Test plan

Full unit suite: 11454 pass / 0 fail
Both pre-existing test flakes investigated + fixed
Plan-eng-review CLEARED (6 decisions, 0 unresolved)
Codex consult ran (14 findings, 4 load-bearing absorbed)
Documentation synced (CLAUDE.md + llms-full.txt)

🤖 Generated with Claude Code

Replaces the per-hash transcript loop (7K SQL roundtrips on big brains) with one batch query using `frontmatter->>'source_hash' = ANY($2::text[])`. Migration v104 adds the partial expression index that keeps the new query O(log n) at scale (mirrors v97 pattern: CONCURRENTLY + invalid-remnant pre-drop on Postgres, plain CREATE INDEX on PGLite). Helper exported so test/cycle/extract-atoms-batch.test.ts can drive it directly without orchestrating the full phase. Fail-open posture preserved from the prior per-hash helper. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ring Issue 3 + Issue 2 of the v0.41.20.0 ops-fix-wave. Codex caught during plan review that yieldBetweenPhases (the existing external hook) does NOT refresh the cycle DB lock — it's just a setImmediate() from jobs.ts:1405 / autopilot.ts:632, and lock.refresh() was never called from inside runCycle. Combined with the 30min TTL, crashed cycles wedged the lock for the full window before another worker could take over. Three coordinated changes: 1. LOCK_TTL_MINUTES 30 → 5 (src/core/cycle.ts). Crash recovers in ≤5 min instead of ≤30 min. 2. buildYieldDuringPhase(lock, outer) — exported closure that calls lock.refresh() AND the existing yieldBetweenPhases hook on every fire. Passed to both long phases (extract_atoms, synthesize_concepts) as their yieldDuringPhase opt. 3. maybeYield helper inside both phases — 30s throttle, fires inside the main work loop AND immediately after every `await chat()` LLM call (codex hardening: a single long LLM await could otherwise sit past TTL). Progress reporter wired through to both phases too (Issue 2): extract_atoms emits `[cycle.extract_atoms] N atoms / M skipped` ticks every ~1s; synthesize_concepts ticks per concept group. Cycle.ts owns start()/finish(); phases only call tick() and heartbeat() on the same reporter (NOT a child — that would produce path collision `cycle.extract_atoms.extract_atoms.work`). LockHandle interface exported for tests. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Issue 4 of the v0.41.20.0 ops-fix-wave. On a 322K-page brain the sweep takes 10+ hours; if it died at 87% the user redid 87% on restart. Wires the existing `op_checkpoints` framework into extractMentionsFromDb with a flushAndCheckpoint ordering that closes the four codex-flagged correctness bugs at once: 1. Lost-links-on-crash — flush batch links to DB FIRST, commit page keys to checkpoint SECOND, persist THIRD. A crash between batch.push() and flushBatch() leaves the page un-checkpointed so resume re-scans it (no silently lost mention links). 2. Dry-run resume contradiction — dry-run does NOT load or persist the checkpoint. Verification path uses non-dry-run kill-and-resume. 3. Gazetteer hash in fingerprint — entity pages added mid-pause shift the gazetteer hash → new fingerprint → fresh scan against the new gazetteer. Without this, resumed runs would silently skip pages against a new entity set. 4. Filtered pages get checkpointed too — pages skipped by `--type` / `--since` / empty body / no-mentions all get marked completed so resume doesn't re-fetch them. Persist cadence: every 1000 items OR every 30s, whichever first (~322 persists / ~24s total overhead on the 322K-page brain). Crash window capped at 1000 pages (<0.3% loss). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Issue 5 of the v0.41.20.0 ops-fix-wave. Multi-source brains see a paste-ready `gbrain sync --all --parallel 4 --workers 4 --skip-failed` in `gbrain doctor` output instead of maintaining two staggered per-source cron entries with manual deconfliction. New checkSyncConsolidation surfaces the recommendation when 2+ active sources exist; "not applicable" for single-source brains. Own try/catch returns warn on SQL failure — outer doctor catch wasn't a safe assumption. `skills/cron-scheduler/SKILL.md` gains a "Multi-source brains" recipe block documenting the pattern + connection-budget math (parallel × workers × 2 ≈ 32 connections at default 4/4). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Two pre-existing tests assumed a clean ~/.gbrain/config.json and a free ~/.gbrain/cycle.lock — both shared across all gbrain processes on the machine. Sibling Conductor worktrees running their own gbrain tests poisoned the shared state, causing flakes: - test/cycle-last-full-cycle-at.test.ts test 5 timed out at 5s because runCycle returned 'skipped' (file lock held by a parallel test process), and last_full_cycle_at exit hook silently no-oped. Fix: each test wraps its body in `withEnv({GBRAIN_HOME: tmpdir})` so the file lock path becomes per-test. - test/schema-cli.test.ts `schema active reports default resolution` failed exit 1 because another worktree had set `schema_pack: gbrain-base-v2` in the shared config (a pack that doesn't exist in the bundle). Fix: gbrain() helper defaults GBRAIN_HOME to a per-file tempdir (beforeAll-owned), so subprocess invocations get an isolated config dir unless tests explicitly override. Both fixes confirmed via deliberate pollution + retest: 12/12 schema-cli tests pass under simulated `schema_pack: gbrain-base-v2` contamination; cycle-LFCA test 5 completes <2s with isolated home. Discovered during v0.41.20.0 ship while investigating parallel-worktree flake. Not caused by the ops-fix-wave but found via it. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Five daily-driver ops pains fixed in one wave: 1. extract_atoms 7K-roundtrip overhead → 1 batch query + index 2. silent long-running phases → progress ticks every ~1s 3. 30-min crashed-cycle lock TTL → 5 min + active in-phase refresh 4. by-mention restarts from page 0 → resumes via op_checkpoints 5. multi-source cron → doctor surfaces `sync --all --parallel` nudge Two follow-up TODOs filed under v0.41.19.0 ops-fix-wave block (will be renamed at follow-up time): - `gbrain sync print-cron` subcommand (P2 ergonomics) - Lock-loss detection in DbLockHandle.refresh() (P2 contract change) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Folds the v0.41.20.0 wave annotations into the cycle/extract/op-checkpoint key-files block: batch idempotency via atomsExistingForHashes, shorter cycle lock TTL with buildYieldDuringPhase active refresh, progress wiring through extract_atoms + synthesize_concepts, by-mention resume via mentionsFingerprint with flushAndCheckpoint ordering, sync_consolidation doctor check, and the 44-case test suite pinning every contract. Regenerated llms-full.txt to match (CLAUDE.md edit invariant). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

# Conflicts: # CHANGELOG.md # TODOS.md

Two CI-only failures caught on PR #1545 (v0.41.21.0 ops-fix-wave): 1. doctor-categories drift guard — new `sync_consolidation` check from T6 wasn't categorized in src/core/doctor-categories.ts. Added under OPS_CHECK_NAMES (it surfaces an operator-cron recommendation, not a brain-data quality signal). 2. facts-engine `embedding cosine ordering when both sides have embeddings` — passed locally, failed under CI's parallel shard. Bun's truncated assertion output didn't surface which expect() fired; hardened the test against unknown leak vectors by: - per-run unique entity_slug (`embed-test-<random8>`) instead of the static `embed-test`, so any future cross-test pollution is structurally impossible - `findIndex` + `aIdx < bIdx` assertion that pins the cosine RELATIONSHIP (A closer than B because cos(A,Q)=1.0 vs cos(B,Q)=0.0) instead of the brittle `result[0].fact === 'A'` position check. The new shape matches the test name's contract verbatim ("ordering when both sides have embeddings"), so any unrelated row in the result set can no longer flip the test. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Master advanced to v0.41.21.0 (5 daily-driver ops pains wave, PR #1545). Resolved 4 conflicts: - VERSION + package.json: 0.41.22.0 stays (higher than master's 0.41.21.0) - CHANGELOG: my v0.41.22.0 entry on top, master's v0.41.21.0 entry below in proper version-descending order - src/core/migrate.ts: MIGRATION VERSION COLLISION — master's v0.41.21 claimed v104 for `pages_atom_source_hash_idx`. Bumped my slug_aliases migration to v105 (the canonical "claim next available slot" pattern). Updated all slug_aliases-related v104 doc/comment references: - src/core/postgres-engine.ts: "Pre-v104 brain" → "Pre-v105 brain" - src/core/onboard/checks.ts: "pre-v104 brains" → "pre-v105 brains" - src/core/search/hybrid.ts: "pre-v104 brains" → "pre-v105 brains" - docs/architecture/pack-upgrade-mechanism.md: migrate.ts:104 → :105 - CHANGELOG.md (my entry): v104 → v105 with rebump rationale - CHANGELOG.md "To take advantage" block: migration v105 Master's v104 (atom source-hash index, `pages_atom_source_hash_idx`) preserved verbatim. Both migrations now coexist correctly. Typecheck clean. bun run verify: 28/28 checks pass. 33/33 slug_aliases-touching tests pass (page-to-alias, resolve-slug-with-alias, unify-types-handler, onboard checks, search boost, E2E full flow).

* upstream/master: v0.41.26.1 fix: lock-renewal cathedral — closes ~39 worker crashes/day (supersedes garrytan#1567) (garrytan#1572) v0.41.26.0 fix: dream --source + ingest junk titles + emoji-crash (supersedes garrytan#1559, garrytan#1561) (garrytan#1571) v0.41.25.0 perf(sync): batched deletes + global page-generation clock (supersedes garrytan#1538) (garrytan#1566) v0.41.24.0 fix(conversation-parser): threshold gates + bold-paren-time pattern — 20,167 Circleback messages unblocked (closes garrytan#1533) (garrytan#1543) v0.41.23.0 feat: extract operator surfaces + pack-driven extractables (garrytan#1541) v0.41.22.1 feat: brainstorm/lsd judge fixes (closes garrytan#1540 end-to-end) (garrytan#1562) v0.41.22.0 feat: type-unification cathedral — 94 types → 15 canonical (closes garrytan#1479) (garrytan#1542) v0.41.21.0 feat(ops): 5 daily-driver pains fixed in one wave (garrytan#1545) v0.41.20.0 feat: gbrain status + doctor --scope=brain (fix wave 2: items garrytan#6 + garrytan#7) (garrytan#1544) feat: v0.41.19.0 Supavisor Retry Cathedral (garrytan#1537) v0.41.18.0: gbrain onboard — the activation surface gbrain didn't have before (garrytan#1521) v0.41.17.0 feat: --workers N on every bulk command + facts dim doctor parity (garrytan#1519) v0.41.16.0 feat: conversation parser cathedral + progressive-batch primitive (closes garrytan#1461) (garrytan#1510) v0.41.15.0 feat(sync): --timeout + --max-age + partial status (closes garrytan#1472 RFC) (garrytan#1506)

garrytan and others added 9 commits May 26, 2026 23:59

Merge remote-tracking branch 'origin/master' into garrytan/ship

e1963e1

# Conflicts: # CHANGELOG.md # TODOS.md

garrytan merged commit 543f9a7 into master May 27, 2026
21 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v0.41.21.0 feat(ops): 5 daily-driver pains fixed in one wave#1545

v0.41.21.0 feat(ops): 5 daily-driver pains fixed in one wave#1545
garrytan merged 9 commits into
masterfrom
garrytan/ship

garrytan commented May 27, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

garrytan commented May 27, 2026

Summary

Test Coverage

Pre-Landing Review

Plan Completion

Documentation

Verification

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant