Skip to content

fix: pass sourceId in cycle sync phase to prevent full reimport#475

Merged
garrytan merged 4 commits intomasterfrom
fix/sync-cycle-source-id
Apr 27, 2026
Merged

fix: pass sourceId in cycle sync phase to prevent full reimport#475
garrytan merged 4 commits intomasterfrom
fix/sync-cycle-source-id

Conversation

@garrytan
Copy link
Copy Markdown
Owner

Problem

cycle.ts:runPhaseSync() calls performSync(engine, { repoPath: brainDir, ... }) without passing sourceId. This means the sync phase always reads the global config.sync.last_commit key instead of the per-source sources.last_commit anchor.

The global anchor can drift out of git history when:

  • A force push rewrites history and the old commit gets garbage-collected
  • A rebase squashes commits that included the anchor
  • The global key was written by an older gbrain version and never updated

When git cat-file -t <anchor> fails, sync concludes there was a force push and triggers a full reimport of all files — on a production brain with 78K+ files this takes 30+ minutes and blocks the entire autopilot cycle.

Meanwhile, the per-source anchor in sources.last_commit is updated on every successful sync and stays aligned with repo history. It was 668 commits behind HEAD but still valid. The system had the right answer in the right place — the cycle just wasn't reading it.

Timeline

  1. Global config.sync.last_commit pointed to commit 472f34d5
  2. That commit was garbage-collected from the repo (likely after a squash/rebase chain)
  3. Every autopilot cycle attempted a full reimport of 78,797 files
  4. The cycle appeared to "hang" — it wasn't stuck, just scanning every file from scratch
  5. sources.default.last_commit had a valid ancestor (00a62e50) the whole time

Error Log

The DB state showed the divergence:

What We Tried

  1. Diagnosed the hang — ran sync with --verbose, saw it scanning 78K files (not stuck, just slow)
  2. Checked both anchors — found the dual-storage divergence between config and sources tables
  3. Verified the per-source anchorgit cat-file -t succeeded, merge-base --is-ancestor confirmed it
  4. Manual fix — updated both anchors to HEAD, sync returned up_to_date in <1s
  5. Deployed operational guard — cron script that copies per-source anchor to global every 5 min (band-aid)
  6. This PR — the proper fix: pass sourceId so cycle reads the right anchor

Solution

1 file changed: src/core/cycle.ts

Added resolveSourceForDir() — queries the sources table for a row whose local_path matches the brain directory. Returns the source id when found, undefined otherwise.

runPhaseSync() now calls this resolver and passes the result as sourceId to performSync(). When a source exists, sync reads sources.last_commit (per-source, always updated). When no source matches (pre-v0.18 brains), it falls through to the global config.sync.last_commit key — backward compatible.

Behavior matrix

Sources table has matching row sourceId passed Anchor read from Backward compatible
✅ Yes (default) "default" sources.last_commit Yes
❌ No (pre-v0.18 brain) undefined config.sync.last_commit Yes
❌ sources table missing undefined (catch) config.sync.last_commit Yes

Results

Before fix:

  • Every autopilot cycle triggered full import (78,797 files, 30+ min)
  • Sync never completed within the job timeout

After fix:

  • Sync reads per-source anchor → incremental diff → completes in <1s
  • Full reimport only triggers on genuine first sync or --full flag

Testing

  • Verified bun build src/core/cycle.ts --no-bundle compiles clean
  • Manually tested the resolver query against production Supabase
  • Confirmed resolveSourceForDir returns "default" for the production brain path
  • Confirmed graceful fallback when source doesn't exist (returns undefined)

cycle.ts calls performSync without sourceId, so it always reads
the global config.sync.last_commit key instead of the per-source
sources.last_commit. When the global anchor gets garbage-collected
(after a force push or rebase), sync falls back to a full reimport
of all files — on a large brain this takes 30+ minutes and blocks
the autopilot cycle.

The fix resolves the source id from the brain directory by querying
the sources table. When a matching source exists, sync reads the
per-source anchor which is updated on every successful sync and
stays in sync with the repo history. Falls back gracefully to the
global config path for pre-v0.18 brains without a sources table.
ramybarsoum added a commit to ramybarsoum/RBrain that referenced this pull request Apr 27, 2026
Adds 6 regression tests in test/core/cycle.test.ts pinning the new
resolveSourceForDir() helper added to src/core/cycle.ts in this PR:

1. Seeded sources row → performSync receives matching sourceId
2. No matching row → sourceId=undefined (falls through to global key)
3. Different brainDir than registered source → undefined (no cross-match)
4. sources table missing (very old brain) → catch returns undefined,
   sync still runs. Uses a fresh PGLiteEngine because initSchema() only
   re-runs PENDING migrations; DROP TABLE on the shared engine would
   leave it permanently degraded for every later test in the file.
   (Codex review caught this landmine.)
5. Multiple rows with same local_path → resolver returns one matching
   id (non-deterministic; SQL has no ORDER BY). Documents the contract
   for the v0.23 UNIQUE-constraint follow-up.
6. Empty-string id row → resolver propagates "" (defensive case Codex
   flagged: schema PK prevents NULL but '' can be inserted).

Extends the performSync mock at line 51-65 to also capture sourceId.

Bumps:
- VERSION: 0.22.4 → 0.22.5
- package.json: 0.22.4 → 0.22.5
- CHANGELOG.md: new [0.22.5] entry following v0.22.4 voice (release
  summary + numbers table + behavior matrix + To-take-advantage block
  + itemized changes + for-contributors)
- CLAUDE.md: annotates src/core/cycle.ts entry with v0.22.5 (#475) note
- llms-full.txt: regenerated via bun run build:llms

Test results:
- Unit: 28 pass / 0 fail in test/core/cycle.test.ts (22 prior + 6 new)
- Full unit suite: pass (exit 0)
- E2E: 236 pass / 0 fail across 26 files

Plan + codex outside-voice review at:
~/.claude/plans/whimsical-bubbling-goose.md

Follow-up TODOs filed for v0.23:
- Normalize brainDir + sources.local_path before SQL compare
- Add UNIQUE index on sources.local_path
- Narrow resolveSourceForDir's catch to PG 42P01 (undefined_table)
- Add doctor check for config.sync.last_commit / sources divergence

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@garrytan
Copy link
Copy Markdown
Owner Author

Adding tests + v0.22.5 version bump on top of this PR (commit 0f93cb2).

What I added:

  • 6 regression tests in test/core/cycle.test.ts covering every branch of resolveSourceForDir(). Test 4 uses a fresh PGLiteEngine rather than the shared one, because initSchema() only re-runs PENDING migrations — DROP TABLE on the shared engine would have left it permanently degraded for every later test in the file. Codex review caught this landmine.
  • Test 6 covers the empty-string id defensive case (PK prevents NULL, but '' can be inserted; codex flagged that the resolver's rows[0]?.id would silently treat falsy id as "no source" and fall back to the global key — the bug we're fixing one layer up).
  • VERSION + package.json + bun.lock → 0.22.5.
  • CHANGELOG.md [0.22.5] entry following the v0.22.4 voice (release summary + numbers table + behavior matrix + To-take-advantage block + itemized changes + for-contributors).
  • CLAUDE.md annotation on the src/core/cycle.ts Key Files entry.
  • llms-full.txt regenerated.

Test results:

  • bun test test/core/cycle.test.ts: 28 pass / 0 fail (22 prior + 6 new, ~1.4s)
  • Full bun test: exit 0
  • bun run test:e2e against pgvector pg16 docker: 236 pass / 0 fail across 26 files

Plan + outside-voice review: ~/.claude/plans/whimsical-bubbling-goose.md (private). Codex consult mode at model_reasoning_effort="high" reviewed the 26-line diff + the test plan. Caught test 4's shared-engine landmine; flagged the empty-string-id defensive case; pushed harder on the bare catch (filed as TODO 3 below). Final codex verdict: "Ship the fix ASAP; fix the test plan before pretending it's solid." Test plan revised accordingly.

Follow-up TODOs filed for v0.23:

  1. Normalize brainDir and sources.local_path before the SQL compare (trailing slash, symlink resolution → silent fallback to global key today).
  2. Add UNIQUE index on sources.local_path so duplicate-path resolution is deterministic.
  3. Narrow resolveSourceForDir's bare catch to PG 42P01 (undefined_table); rethrow everything else so real DB problems aren't masked into the global-fallback path.
  4. gbrain doctor check that warns when config.sync.last_commit and sources.default.last_commit diverge — the signal that was missed in production.

🤖 Generated with Claude Code

garrytan and others added 2 commits April 27, 2026 14:59
CI typecheck failed because `toContain()` on `string[]` rejects the
`string | undefined` produced by `syncCalls.at(-1)?.sourceId`'s optional
chain. Tests 1, 4, and 6 use `toBe()` which accepts `string | undefined`
through its overload, but `toContain()` is stricter.

Fix: pull the value into a typed variable, assert it's defined, then
check membership. Makes the contract explicit ("resolver returned a
defined sourceId, and it was one of the matching ids") instead of
relying on a silent undefined → no-match-in-array assertion.

Locally:
- bun run typecheck: clean
- bun test test/core/cycle.test.ts: 28 pass / 0 fail (75 expect calls)
- All CI gate scripts: OK (jsonb, progress-to-stdout, wasm-embedded)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PR #475's Tier 1 (Mechanical) CI job hit a 5000.09ms beforeAll hook
timeout in `E2E: Tags > (unnamed)`. Cause: scripts/run-e2e.sh invokes
`bun test "$f"` without a --timeout flag, falling back to bun's 5s
default. setupDB() does TRUNCATE CASCADE on ~30 tables, and on a CI
runner under load that can exceed 5s.

Match what the unit suite uses (--timeout=60000 in package.json's
"test" script). Same 1m ceiling, no behavior change for healthy runs;
just removes the artificial 5s floor on hooks.

Verified locally: bun test --timeout=60000 test/e2e/mechanical.test.ts
runs 78 pass / 0 fail in 27.99s against a fresh pgvector pg16 docker
container.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@garrytan garrytan merged commit e734937 into master Apr 27, 2026
4 checks passed
garrytan added a commit that referenced this pull request Apr 28, 2026
Merges 5 master commits since last merge: v0.22.1 autopilot fix wave (#447),
v0.22.2 minions worker reliability (#458), v0.22.4 frontmatter-guard (#448),
sourceId in cycle sync phase (#475), and post-migration schema verification (#488).

Conflict resolutions:
- VERSION: kept this branch's reserved 0.27.0 slot (master at 0.22.6).
- CHANGELOG.md: kept v0.27.0 entry at top, then master's v0.22.6 → v0.21.0 entries below in order.
- CLAUDE.md: merged the v0.27 cycle bullet (8 phases, synthesize, patterns, transcript-discovery, dream CLI flags) with master's v0.22.1/v0.22.5 cycle additions (signal: AbortSignal, willRunExtractPhase, resolveSourceForDir).
- src/core/cycle.ts: kept v0.27 yieldDuringPhase + synthInputFile/synthDate/synthFrom/synthTo CycleOpts fields AND added master's v0.22.1 signal: AbortSignal field. Both coexist.
- llms-full.txt: regenerated against the merged tree.

The dream_verdicts schema migration moved v25 → v30 in the prior merge.
Master ended at v29 (cathedral_ii_code_edges_rls); v30 is uncontested.

Tests pass post-merge: 105/105 dream + cycle tests across 9 files.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
garrytan added a commit that referenced this pull request Apr 28, 2026
CODEX-1: resolve sourceId at handler entry by looking up sources.local_path.
Mirrors cycle.ts:480's autopilot-cycle fix (PR #475). Without this, every
Minion sync job on a multi-source brain reads global config.sync.last_commit
instead of the per-source anchor, which on a regularly-GC'd repo can drop
out of git history and trigger 30-min full reimports every cycle.

The handler accepts an optional sourceId job param for callers that want
to override; falls back to the resolveSourceForDir lookup when absent.

CODEX-4: replace the hardcoded concurrency=4 default with the shared
autoConcurrency policy. Behavior is now consistent between CLI sync,
the Minion handler, and the autopilot cycle's sync phase. Jobs that
request a specific concurrency via job.data.concurrency still win.

noEmbed default stays at true — embed is a separate job (submit
gbrain embed --stale, OR rely on the autopilot cycle's embed phase).
The doc comment makes that contract explicit.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
garrytan added a commit that referenced this pull request Apr 30, 2026
* feat: parallel sync — bounded concurrent imports (#489)

gbrain sync --concurrency N (alias --workers N) parallelizes the import
phase using per-worker Postgres engine instances with an atomic queue
index (same proven pattern as gbrain import --workers N).

Auto-concurrency: when a sync touches >100 files and the user didn't
explicitly set --concurrency, defaults to 4 workers. Small incremental
syncs (<50 files) stay serial. Full syncs auto-detect Postgres and
default to 4 workers.

Minion sync handler defaults to concurrency=4, configurable via job
params: {"concurrency": 8}.

Delete and rename phases remain serial (order-dependent, fast).
PGLite falls back to serial automatically (single-connection engine).

Changes:
- src/commands/sync.ts: SyncOpts.concurrency, parallel import loop in
  performSync incremental path, --workers passthrough in performFullSync
- src/commands/jobs.ts: sync handler accepts concurrency param (default 4)
- CHANGELOG.md: v0.23.0 parallel sync entry

All 37 existing sync tests pass. Typecheck clean.

* feat: shared concurrency policy + db-lock primitive

src/core/sync-concurrency.ts — single source of truth for autoConcurrency()
+ parseWorkers() + shouldRunParallel() + constants. Replaces three drifted
call-site policies (performSync, performFullSync, jobs handler).

src/core/db-lock.ts — generic tryAcquireDbLock(engine, lockId, ttlMinutes)
over the existing gbrain_cycle_locks table. Parameterized lock id so
performSync (gbrain-sync) can nest cleanly under cycle.ts (gbrain-cycle)
without deadlock.

test/sync-concurrency.test.ts — 17 cases covering PGLite-forces-serial,
explicit override clamping, auto-path threshold, parseWorkers validation
(rejects 0, negatives, NaN, decimals, trailing chars).

No consumers yet; subsequent commits wire sync.ts, import.ts, and jobs.ts
to use these helpers.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix: harden performSync — writer lock, head-drift gate, engine.kind

CODEX-2: wrap performSync body in a gbrain-sync DB lock so two concurrent
syncs (manual + autopilot, two terminals, two Conductor workspaces) cannot
both read last_commit, both write it unconditionally, and let the last
writer win. cycle.ts continues to hold gbrain-cycle for its broader scope;
the two ids nest cleanly.

CODEX-3: capture git HEAD at sync entry, re-rev-parse after the import
phase, refuse to advance last_commit if HEAD drifted (someone ran
git checkout / git pull mid-sync). Vanished files now go into failedFiles
instead of silent-skip — same gating mechanism, no more bookmark advance
past unimported work.

A1: replace both PGLite detection sites with engine.kind === 'pglite'.
The constructor.name sniff is gone (breaks under bundling) and so is the
inconsistent config?.engine string check.

A2: connect worker engines serially into an array, run inside try/finally
so disconnect always fires — even on partial connect failure, OOM, or
mid-import abort. Prior Promise.all(...disconnect) leaked the 8 worker
connections on any panic path.

Q1: explicit --workers / opts.concurrency now bypasses the >50-file floor.
User opt-in beats the auto-path safety net.

Q3: drop the config!.database_url! non-null assertions; fall back to serial
when database_url is unset instead of crashing on TypeError.

Q4: worker-count banner moves from console.log to console.error so stdout
stays clean for --json output.

test/sync-parallel.test.ts — 7 cases over PGLite covering the bookmark
gate under concurrency request, the head-drift gate, vanished-file
failure capture, PGLite-stays-serial, and the writer-lock contract.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix: import.ts — engine.kind discriminator, worker try/finally, parseWorkers

A1: replace the config?.engine === 'pglite' string sniff with
engine.kind === 'pglite' to match sync.ts and the v0.13.1 contract.

A2: wrap worker engine creation + the parallel loop in try/finally so
disconnects always fire — same pattern as sync.ts. Worker engines now
push onto an array as they connect (rather than Promise.all) so the
finally block can clean up partial-connect state.

Q2: route --workers parsing through the shared parseWorkers() helper.
parseInt-with-no-validation is gone — '0', '-3', 'foo', '1.5' now exit
with a clear error message instead of silently falling through.

Q3: drop the config!.database_url! non-null assertion; fall back to
serial when database_url is unset.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix: jobs.ts sync handler — resolve sourceId, autoConcurrency

CODEX-1: resolve sourceId at handler entry by looking up sources.local_path.
Mirrors cycle.ts:480's autopilot-cycle fix (PR #475). Without this, every
Minion sync job on a multi-source brain reads global config.sync.last_commit
instead of the per-source anchor, which on a regularly-GC'd repo can drop
out of git history and trigger 30-min full reimports every cycle.

The handler accepts an optional sourceId job param for callers that want
to override; falls back to the resolveSourceForDir lookup when absent.

CODEX-4: replace the hardcoded concurrency=4 default with the shared
autoConcurrency policy. Behavior is now consistent between CLI sync,
the Minion handler, and the autopilot cycle's sync phase. Jobs that
request a specific concurrency via job.data.concurrency still win.

noEmbed default stays at true — embed is a separate job (submit
gbrain embed --stale, OR rely on the autopilot cycle's embed phase).
The doc comment makes that contract explicit.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test: e2e parallel sync against real Postgres + benchmark

DATABASE_URL-gated E2E coverage that PGLite-only tests can't reach:

T2 — happy path: 60 files imported at concurrency=4, all 60 pages land
in the DB, with a pg_stat_activity probe before/after to confirm worker
engines (4 × 2 connections) actually disconnected.

P4 — benchmark: 120-file fixture, serial vs concurrency=4 timing.
Emits a single-line `SYNC_PARALLEL_BENCH 120 files | serial=Xms |
parallel(4)=Yms | speedup=Zx` so the CHANGELOG can quote a real
number instead of an unbacked '~4×' claim. Asserts parallel <=
serial * 1.5 to allow for noisy CI but fail genuine regressions.

Skips gracefully when DATABASE_URL is unset (consistent with the rest
of test/e2e/).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* chore: v0.22.10 release notes + sync follow-up TODO

VERSION + package.json + bun.lock: 0.22.5/0.22.6 → 0.22.10. Repo had
existing drift between VERSION and package.json on master; this commit
brings them back in sync at the bumped value.

CHANGELOG.md: v0.22.10 entry replaces the unfinished v0.23.0 stub from
PR #490's original commit. Voice-rule clean (no em dashes, no AI
vocabulary), real benchmark numbers from the new E2E test
(serial=289ms parallel(4)=221ms speedup=1.31x), additive worker-pool
note (A3), 'To take advantage of v0.22.10' self-repair block per
CLAUDE.md convention.

TODOS.md: A4 follow-up filed — plumb resolved database_url through
SyncOpts so performSync / performFullSync / import.ts don't each call
loadConfig() separately. Deferred to a future patch; not on the
v0.22.10 critical path.

Patch (not minor) framing held even though new CLI surface lands here;
release-notes prose names the behavior change explicitly so users know
to read them.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs: update CLAUDE.md + README for v0.22.10 sync hardening

CLAUDE.md:
- New "Key files" entries for src/core/sync-concurrency.ts and
  src/core/db-lock.ts (both v0.22.10).
- New "Key files" entry for src/commands/sync.ts (covers the lock,
  head-drift gate, engine.kind discriminator, vanished-file failure
  capture, parallel branch wiring).
- Updated src/commands/jobs.ts entry with v0.22.10 sourceId
  resolution + autoConcurrency policy + noEmbed contract.
- Added test/sync-concurrency.test.ts and test/sync-parallel.test.ts
  to the unit-test list with case counts.
- Added test/e2e/sync-parallel.test.ts to the E2E section with the
  SYNC_PARALLEL_BENCH grep marker for CHANGELOG quoting.
- Added "Key commands added in v0.22.10" section: gbrain sync --workers,
  gbrain import --workers (parseWorkers validation).

README.md: added --workers flag to the IMPORT section's gbrain sync
and gbrain import lines, with the >100-file auto-parallelize note.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* chore: bump version slot to v0.22.13

VERSION 0.22.10 → 0.22.13. Master moved to 0.22.8 plus claimed slots
0.22.9-0.22.12 in sibling workspaces; 0.22.13 is the next free slot for
this PR's parallel-sync hardening work.

Updated all v0.22.10 references in CHANGELOG.md (release header +
self-repair block), TODOS.md (D-PR490-1 follow-up tag), CLAUDE.md
(Key files entries + tests + commands subsection), and the inline
v0.22.10 markers in src/core/sync-concurrency.ts, src/core/db-lock.ts,
src/commands/sync.ts, src/commands/import.ts, src/commands/jobs.ts,
test/sync-parallel.test.ts, test/e2e/sync-parallel.test.ts.

No behavioral change. CHANGELOG header rewrite, content unchanged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* chore: regenerate llms-full.txt for v0.22.13 doc updates

CI's build-llms generator test failed because llms-full.txt was stale
relative to the README + CLAUDE.md updates this PR added (--workers
flag in the IMPORT section, sync-concurrency.ts/db-lock.ts/sync.ts
entries in the Key files section).

Per CLAUDE.md: "Run \`bun run build:llms\` after adding a new doc."
The test test/build-llms.test.ts:67 verifies committed bundles match
generator output — now they do again.

llms.txt was already in sync (no curated config additions); only
llms-full.txt needed the regen.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: root <root@localhost>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant