Skip to content

v0.41.21.0 feat(ops): 5 daily-driver pains fixed in one wave#1545

Merged
garrytan merged 9 commits into
masterfrom
garrytan/ship
May 27, 2026
Merged

v0.41.21.0 feat(ops): 5 daily-driver pains fixed in one wave#1545
garrytan merged 9 commits into
masterfrom
garrytan/ship

Conversation

@garrytan

Copy link
Copy Markdown
Owner

Summary

Five daily-driver ops pains fixed in one bisectable wave:

Performance + Visibility

  • extract_atoms startup on conversation-transcript-heavy brains: 5-10 min of silent overhead → <1s. Replaces the per-hash SQL loop (7K roundtrips on a brain with that many transcripts) with one batch WHERE frontmatter->>'source_hash' = ANY($2::text[]) query backed by migration v104's partial expression index.
  • extract_atoms + synthesize_concepts were silent for 10+ min mid-phase. Now: tick every ~1s with running counts ([cycle.extract_atoms] N atoms / M skipped).

Reliability

  • Cycle lock TTL dropped 30→5 min. Combined with active in-phase refresh via new buildYieldDuringPhase(lock, outer) closure (fires lock.refresh() + the existing yieldBetweenPhases hook on every 30s maybeYield AND immediately after every await chat() LLM call). Crash recovery 6× faster. Codex caught the pre-existing bug class during plan review: yieldBetweenPhases was just setImmediate() from jobs.ts/autopilot.ts and lock.refresh() was never called from inside runCycle — so dropping TTL alone would have made lock-stealing worse.

Resumability

  • gbrain extract links --by-mention now resumes from where it died via op_checkpoints. Codex caught 4 correctness bugs in the original design: lost-links-on-crash (flush BEFORE commit pending page keys to checkpoint), contradictory dry-run resume (dry-run skips both load + persist), gazetteer-not-in-fingerprint (entity-page changes invalidate the checkpoint), filtered-page restart pain (skip-decisions get checkpointed too). All four closed via flushAndCheckpoint ordering + mentionsFingerprint({source, type, since, gazetteerHash}).

Operator UX

  • gbrain doctor now surfaces a paste-ready gbrain sync --all --parallel 4 --workers 4 --skip-failed recommendation for multi-source brains via new sync_consolidation check. Cron-scheduler skill grows a "Multi-source brains" recipe.

Discovered during ship, not caused by the wave: two pre-existing test-isolation flakes fixed too — test/cycle-last-full-cycle-at.test.ts and test/schema-cli.test.ts now isolate GBRAIN_HOME per test so sibling Conductor worktrees' parallel test runs don't poison shared state.

Test Coverage

Scope Tests added Status
extract_atoms batch test/cycle/extract-atoms-batch.test.ts (5 cases)
Progress wiring test/cycle/extract-atoms-progress.test.ts (4), synthesize-concepts-progress.test.ts (3)
Lock TTL + yield cycle-lock-ttl.test.ts (1), yield-during-phase-refresh.test.ts (7), yield-during-phase-throttle.test.ts (3)
by-mention checkpoint extract-by-mention-resume.test.ts (5), op-checkpoint-mentions-fingerprint.test.ts (7)
Doctor consolidation doctor-sync-consolidation.test.ts (6)
Test-isolation fixes cycle-last-full-cycle-at.test.ts (5, all now under withEnv), schema-cli.test.ts (12, default-isolated home)

Full suite: 11454 pass / 0 fail / 0 skip across 4 parallel shards + 47 serial files (elapsed 1371s).

Pre-Landing Review

Plan went through /plan-eng-review (6 architecture decisions absorbed) and /codex consult (14 findings — 4 load-bearing absorbed via plan rework, 5 mechanical fixes auto-applied, 1 deferred to TODO). Codex specifically killed the original Issue 3 design when it proved the yieldBetweenPhases claim was wrong. Plan rebuilt around buildYieldDuringPhase closure.

No critical findings remain.

Plan Completion

All 7 implementation tasks (T1–T7) shipped + 2 pre-existing test-isolation flakes fixed + 2 follow-up TODOs filed (gbrain sync print-cron subcommand and lock-loss detection in DbLockHandle.refresh()).

Documentation

CLAUDE.md key-files block updated with the consolidated v0.41.21.0 entry covering all six shipped fixes plus the 44-case test inventory and 2 follow-up TODO references. llms-full.txt regenerated to match per the mandatory invariant.

Verification

# Issue 1: extract_atoms transcript idempotency fast
time gbrain dream --phase extract_atoms --dry-run --json

# Issue 2: progress visible
gbrain dream --phase extract_atoms 2>&1 | head -50

# Issue 3: lock TTL is 5min, crashes auto-recover
gbrain dream --phase orphans &
PID1=$!; sleep 1; kill -9 $PID1
sleep 310 && gbrain dream --phase orphans  # should succeed within 5min

# Issue 4: by-mention resumes
gbrain extract --by-mention 2>&1 | tee /tmp/run1.log
# Ctrl-C at ~10%
gbrain extract --by-mention 2>&1 | tee /tmp/run2.log
# Expect "resuming: 32K/322K already scanned"

# Issue 5: doctor nudge
gbrain doctor --json | jq '.checks[] | select(.name=="sync_consolidation")'

Test plan

  • Full unit suite: 11454 pass / 0 fail
  • Both pre-existing test flakes investigated + fixed
  • Plan-eng-review CLEARED (6 decisions, 0 unresolved)
  • Codex consult ran (14 findings, 4 load-bearing absorbed)
  • Documentation synced (CLAUDE.md + llms-full.txt)

🤖 Generated with Claude Code

garrytan and others added 9 commits May 26, 2026 23:59
Replaces the per-hash transcript loop (7K SQL roundtrips on big brains)
with one batch query using `frontmatter->>'source_hash' = ANY($2::text[])`.
Migration v104 adds the partial expression index that keeps the new query
O(log n) at scale (mirrors v97 pattern: CONCURRENTLY + invalid-remnant
pre-drop on Postgres, plain CREATE INDEX on PGLite).

Helper exported so test/cycle/extract-atoms-batch.test.ts can drive it
directly without orchestrating the full phase. Fail-open posture
preserved from the prior per-hash helper.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ring

Issue 3 + Issue 2 of the v0.41.20.0 ops-fix-wave.

Codex caught during plan review that yieldBetweenPhases (the existing
external hook) does NOT refresh the cycle DB lock — it's just a
setImmediate() from jobs.ts:1405 / autopilot.ts:632, and lock.refresh()
was never called from inside runCycle. Combined with the 30min TTL,
crashed cycles wedged the lock for the full window before another
worker could take over.

Three coordinated changes:

1. LOCK_TTL_MINUTES 30 → 5 (src/core/cycle.ts). Crash recovers in
   ≤5 min instead of ≤30 min.

2. buildYieldDuringPhase(lock, outer) — exported closure that calls
   lock.refresh() AND the existing yieldBetweenPhases hook on every
   fire. Passed to both long phases (extract_atoms,
   synthesize_concepts) as their yieldDuringPhase opt.

3. maybeYield helper inside both phases — 30s throttle, fires inside
   the main work loop AND immediately after every `await chat()` LLM
   call (codex hardening: a single long LLM await could otherwise sit
   past TTL).

Progress reporter wired through to both phases too (Issue 2):
extract_atoms emits `[cycle.extract_atoms] N atoms / M skipped` ticks
every ~1s; synthesize_concepts ticks per concept group. Cycle.ts owns
start()/finish(); phases only call tick() and heartbeat() on the same
reporter (NOT a child — that would produce path collision
`cycle.extract_atoms.extract_atoms.work`).

LockHandle interface exported for tests.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Issue 4 of the v0.41.20.0 ops-fix-wave. On a 322K-page brain the sweep
takes 10+ hours; if it died at 87% the user redid 87% on restart.

Wires the existing `op_checkpoints` framework into extractMentionsFromDb
with a flushAndCheckpoint ordering that closes the four codex-flagged
correctness bugs at once:

  1. Lost-links-on-crash — flush batch links to DB FIRST, commit page
     keys to checkpoint SECOND, persist THIRD. A crash between
     batch.push() and flushBatch() leaves the page un-checkpointed so
     resume re-scans it (no silently lost mention links).
  2. Dry-run resume contradiction — dry-run does NOT load or persist
     the checkpoint. Verification path uses non-dry-run kill-and-resume.
  3. Gazetteer hash in fingerprint — entity pages added mid-pause shift
     the gazetteer hash → new fingerprint → fresh scan against the new
     gazetteer. Without this, resumed runs would silently skip pages
     against a new entity set.
  4. Filtered pages get checkpointed too — pages skipped by `--type` /
     `--since` / empty body / no-mentions all get marked completed so
     resume doesn't re-fetch them.

Persist cadence: every 1000 items OR every 30s, whichever first
(~322 persists / ~24s total overhead on the 322K-page brain). Crash
window capped at 1000 pages (<0.3% loss).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Issue 5 of the v0.41.20.0 ops-fix-wave. Multi-source brains see a
paste-ready `gbrain sync --all --parallel 4 --workers 4 --skip-failed`
in `gbrain doctor` output instead of maintaining two staggered
per-source cron entries with manual deconfliction.

New checkSyncConsolidation surfaces the recommendation when 2+ active
sources exist; "not applicable" for single-source brains. Own
try/catch returns warn on SQL failure — outer doctor catch wasn't a
safe assumption.

`skills/cron-scheduler/SKILL.md` gains a "Multi-source brains"
recipe block documenting the pattern + connection-budget math
(parallel × workers × 2 ≈ 32 connections at default 4/4).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two pre-existing tests assumed a clean ~/.gbrain/config.json and a free
~/.gbrain/cycle.lock — both shared across all gbrain processes on the
machine. Sibling Conductor worktrees running their own gbrain tests
poisoned the shared state, causing flakes:

  - test/cycle-last-full-cycle-at.test.ts test 5 timed out at 5s
    because runCycle returned 'skipped' (file lock held by a parallel
    test process), and last_full_cycle_at exit hook silently no-oped.
    Fix: each test wraps its body in `withEnv({GBRAIN_HOME: tmpdir})`
    so the file lock path becomes per-test.

  - test/schema-cli.test.ts `schema active reports default resolution`
    failed exit 1 because another worktree had set
    `schema_pack: gbrain-base-v2` in the shared config (a pack that
    doesn't exist in the bundle). Fix: gbrain() helper defaults
    GBRAIN_HOME to a per-file tempdir (beforeAll-owned), so subprocess
    invocations get an isolated config dir unless tests explicitly
    override.

Both fixes confirmed via deliberate pollution + retest: 12/12
schema-cli tests pass under simulated `schema_pack: gbrain-base-v2`
contamination; cycle-LFCA test 5 completes <2s with isolated home.

Discovered during v0.41.20.0 ship while investigating parallel-worktree
flake. Not caused by the ops-fix-wave but found via it.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Five daily-driver ops pains fixed in one wave:
  1. extract_atoms 7K-roundtrip overhead → 1 batch query + index
  2. silent long-running phases → progress ticks every ~1s
  3. 30-min crashed-cycle lock TTL → 5 min + active in-phase refresh
  4. by-mention restarts from page 0 → resumes via op_checkpoints
  5. multi-source cron → doctor surfaces `sync --all --parallel` nudge

Two follow-up TODOs filed under v0.41.19.0 ops-fix-wave block (will
be renamed at follow-up time):
  - `gbrain sync print-cron` subcommand (P2 ergonomics)
  - Lock-loss detection in DbLockHandle.refresh() (P2 contract change)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Folds the v0.41.20.0 wave annotations into the cycle/extract/op-checkpoint
key-files block: batch idempotency via atomsExistingForHashes, shorter
cycle lock TTL with buildYieldDuringPhase active refresh, progress wiring
through extract_atoms + synthesize_concepts, by-mention resume via
mentionsFingerprint with flushAndCheckpoint ordering, sync_consolidation
doctor check, and the 44-case test suite pinning every contract.

Regenerated llms-full.txt to match (CLAUDE.md edit invariant).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Two CI-only failures caught on PR #1545 (v0.41.21.0 ops-fix-wave):

1. doctor-categories drift guard — new `sync_consolidation` check from
   T6 wasn't categorized in src/core/doctor-categories.ts. Added under
   OPS_CHECK_NAMES (it surfaces an operator-cron recommendation, not a
   brain-data quality signal).

2. facts-engine `embedding cosine ordering when both sides have
   embeddings` — passed locally, failed under CI's parallel shard.
   Bun's truncated assertion output didn't surface which expect()
   fired; hardened the test against unknown leak vectors by:
     - per-run unique entity_slug (`embed-test-<random8>`) instead of
       the static `embed-test`, so any future cross-test pollution is
       structurally impossible
     - `findIndex` + `aIdx < bIdx` assertion that pins the cosine
       RELATIONSHIP (A closer than B because cos(A,Q)=1.0 vs cos(B,Q)=0.0)
       instead of the brittle `result[0].fact === 'A'` position check.
   The new shape matches the test name's contract verbatim ("ordering
   when both sides have embeddings"), so any unrelated row in the
   result set can no longer flip the test.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@garrytan garrytan merged commit 543f9a7 into master May 27, 2026
21 checks passed
garrytan added a commit that referenced this pull request May 27, 2026
Master advanced to v0.41.21.0 (5 daily-driver ops pains wave, PR #1545).
Resolved 4 conflicts:

- VERSION + package.json: 0.41.22.0 stays (higher than master's 0.41.21.0)
- CHANGELOG: my v0.41.22.0 entry on top, master's v0.41.21.0 entry below
  in proper version-descending order
- src/core/migrate.ts: MIGRATION VERSION COLLISION — master's v0.41.21
  claimed v104 for `pages_atom_source_hash_idx`. Bumped my slug_aliases
  migration to v105 (the canonical "claim next available slot" pattern).
  Updated all slug_aliases-related v104 doc/comment references:
    - src/core/postgres-engine.ts: "Pre-v104 brain" → "Pre-v105 brain"
    - src/core/onboard/checks.ts: "pre-v104 brains" → "pre-v105 brains"
    - src/core/search/hybrid.ts: "pre-v104 brains" → "pre-v105 brains"
    - docs/architecture/pack-upgrade-mechanism.md: migrate.ts:104 → :105
    - CHANGELOG.md (my entry): v104 → v105 with rebump rationale
    - CHANGELOG.md "To take advantage" block: migration v105

Master's v104 (atom source-hash index, `pages_atom_source_hash_idx`)
preserved verbatim. Both migrations now coexist correctly.

Typecheck clean. bun run verify: 28/28 checks pass. 33/33
slug_aliases-touching tests pass (page-to-alias, resolve-slug-with-alias,
unify-types-handler, onboard checks, search boost, E2E full flow).
mgunnin added a commit to mgunnin/gbrain that referenced this pull request May 28, 2026
* upstream/master:
  v0.41.26.1 fix: lock-renewal cathedral — closes ~39 worker crashes/day (supersedes garrytan#1567) (garrytan#1572)
  v0.41.26.0 fix: dream --source + ingest junk titles + emoji-crash (supersedes garrytan#1559, garrytan#1561) (garrytan#1571)
  v0.41.25.0 perf(sync): batched deletes + global page-generation clock (supersedes garrytan#1538) (garrytan#1566)
  v0.41.24.0 fix(conversation-parser): threshold gates + bold-paren-time pattern — 20,167 Circleback messages unblocked (closes garrytan#1533) (garrytan#1543)
  v0.41.23.0 feat: extract operator surfaces + pack-driven extractables (garrytan#1541)
  v0.41.22.1 feat: brainstorm/lsd judge fixes (closes garrytan#1540 end-to-end) (garrytan#1562)
  v0.41.22.0 feat: type-unification cathedral — 94 types → 15 canonical (closes garrytan#1479) (garrytan#1542)
  v0.41.21.0 feat(ops): 5 daily-driver pains fixed in one wave (garrytan#1545)
  v0.41.20.0 feat: gbrain status + doctor --scope=brain (fix wave 2: items garrytan#6 + garrytan#7) (garrytan#1544)
  feat: v0.41.19.0 Supavisor Retry Cathedral (garrytan#1537)
  v0.41.18.0: gbrain onboard — the activation surface gbrain didn't have before (garrytan#1521)
  v0.41.17.0 feat: --workers N on every bulk command + facts dim doctor parity (garrytan#1519)
  v0.41.16.0 feat: conversation parser cathedral + progressive-batch primitive (closes garrytan#1461) (garrytan#1510)
  v0.41.15.0 feat(sync): --timeout + --max-age + partial status (closes garrytan#1472 RFC) (garrytan#1506)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant