feat(sync): parallelize sync --all + add --status source dashboard#1314
feat(sync): parallelize sync --all + add --status source dashboard#1314garrytan-agents wants to merge 1 commit into
Conversation
Replace the sequential `for...of` loop in `sync --all` with a bounded
`Promise.allSettled` fan-out. Each per-source sync takes an independent
per-source DB lock (`gbrain-sync:<source_id>`) so independent sources
sync concurrently while the SAME source still serializes against itself.
Also adds `sync --all --status`: a read-only per-source dashboard
(last sync, staleness class, page count, embedding coverage,
unacknowledged failures). Mirrors what operators were assembling by
hand from `sources ls` + `doctor` + ad-hoc SQL.
Why this change
- Sequential `sync --all` was the floor on cron tick latency for
brains with 4+ federated sources. One slow or stalled source held
up the whole pass. After 24h staleness penalties start firing.
- The autopilot fanout path (added in v0.39.2.0) already proves
per-source dispatch is safe; the CLI path just hadn't caught up.
Concurrency budget
- PGLite \u2192 always 1 (single-connection engine)
- explicit `--parallel N` wins, clamped to source count
- auto path \u2192 min(sourceCount, --workers, DEFAULT_PARALLEL_WORKERS=4)
- Single-source brains short-circuit to serial
- Cap exists because each worker opens its own small pg pool inside
performSync; unbounded fan-out on a 30-source brain would exhaust
the pooler
Lock model
- New `SyncOpts.lockId` (internal) override. Defaults to global
`SYNC_LOCK_ID` so cycle / jobs / single-source CLI behavior is
preserved bit-for-bit.
- Parallel `sync --all` passes `gbrain-sync:<source_id>`. Same
source -> serialized via existing tryAcquireDbLock; different
sources -> no contention.
Output ordering
- Per-source stdout is buffered and flushed in source order at wave
boundaries so concurrent worker output doesn't interleave on the
terminal. `--json` emits a structured per-source summary.
Tests
- New: test/sync-all-parallel.test.ts (11 cases)
- resolveParallelism: PGLite, explicit --parallel, auto path,
--workers ceiling, single-source short-circuit, zero-source guard
- per-source lock id namespacing under SYNC_LOCK_ID
- buildSyncStatusReport: staleness class transitions (24h/72h),
coverage math, divide-by-zero guard, disabled-source flag,
count-query failure tolerance, empty input
- All existing sync tests still pass (sync.test.ts, sync-concurrency,
sync-parallel, sources-resync-recovery, sync-failures + 7 more)
|
Thanks @garrytan-agents — closing in favor of v0.40.4.0 landing this design with the structural fixes Codex's outside-voice review caught. The PR's parallel 3 P0s Codex caught during plan review (would have shipped silently):
Plus a handful of architectural refinements:
The original design instinct was right and the value is real. The shipped version is tighter than either of our starting points because Codex's outside-voice catches what the inside-voice review misses. Thanks for the work. — Garry (via Claude) |
… + sources status dashboard (productionized from PR #1314) (#1324) * v0.40.4.0 feat(sync): parallel sync --all + per-source lock invariant + sources status dashboard (productionized from PR #1314) Lands the community-authored PR #1314 with the structural fixes Codex's outside-voice review caught: the original PR's lock-id change only fired inside the --all parallel path, which would have introduced a worse race than the global-lock contention it fixed (sync --all on per-source lock racing against sync --source foo on the still-global lock). The landed version makes the per-source lock the invariant for every source-scoped sync, paired with withRefreshingLock for sources that exceed 30 minutes. What's new - gbrain sync --all parallel fan-out via continuous worker pool (D2); --parallel N flag, default min(sourceCount, --workers, 4); per-source [<source-id>] line prefix via AsyncLocalStorage (D6 + D12 + D13); stable --json envelope {schema_version:1, ...} on stdout with banners on stderr (D4 + D14); --skip-failed/--retry-failed reject under --parallel > 1 (D15 — sync-failures.jsonl is brain-global today; source-scoping filed as v0.40.4 TODO). - gbrain sources status [--json] read-only dashboard (D3 — sibling to sources list/add/remove/archive, not a sync flag, so reads + writes don't share a verb). Counts pages + chunks + embedding coverage per source. Active embedding column resolved via the registry (D16) so Voyage / multimodal brains see the right column. Archived sources excluded by caller filter. - Connection-budget stderr warning when parallel × workers × 2 > 16 with the formula in the message text (D1 + D10 — Codex P0 #3: each per-file worker opens its own PostgresEngine with poolSize=2, so the multiplication factor is 2, not 1). The load-bearing structural fix - performSync defaults to per-source lock id (gbrain-sync:<sourceId>) whenever opts.sourceId is set + wraps in withRefreshingLock. Legacy single-default-source brains keep the bare tryAcquireDbLock(SYNC_LOCK_ID) path for back-compat. - Dashboard SQL is the canonical content_chunks ch JOIN pages pg ON pg.id = ch.page_id WHERE pg.deleted_at IS NULL shape — the original PR shipped chunks ch JOIN ON page_slug, which would have crashed on PGLite parse and silently zeroed on Postgres via a swallow-catch. Errors from the dashboard SQL propagate (no silent zero-counts on real DB errors). Tests - New test/console-prefix.test.ts — 8 cases pinning ALS propagation, nested wraps, embedded-newline prefixing, back-compat fast path. - New test/sync-all-parallel.test.ts (replaces PR's stubbed tests) — 16 cases covering resolveParallelism, per-source lock format, buildSyncStatusReport SQL math + error propagation + envelope shape, connection-budget math, per-source prefix routing. - New test/e2e/sync-status-pglite.test.ts — IRON RULE regression: real PGLite seeds 2 sources × pages × chunks (mixed embedded/unembedded, 1 soft-deleted, 1 archived source). Validates SQL excludes both AND the active embedding column is the one used. This is the case that would have caught the PR's original broken SQL. Compatibility - No schema changes. No new dependencies. - Single-source / non-`--all` paths: bit-for-bit identical to v0.40.2. - PGLite users get serial behavior (single-connection engine). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Co-Authored-By: garrytan-agents <garrytan-agents@users.noreply.github.com> * v0.40.6.0 — version bump for ship (skipping 0.40.4 + 0.40.5 for in-flight work) Reserves v0.40.4 + v0.40.5 slots for parallel waves (salem's graph-signals work and any other in-flight branches) and lands this PR's parallel-sync work at v0.40.6.0. No code change beyond the version triple and the TODOS / CLAUDE.md / CHANGELOG cross-references which were updated from "v0.40.4" to "v0.41+" to match the new follow-up version. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Co-authored-by: garrytan-agents <garrytan-agents@users.noreply.github.com>
* upstream/master: (22 commits) v0.41.4.0 wave: local providers + cross-platform stdin + gateway-routed dream judge (6 community PRs) (garrytan#1377) v0.41.3.0 fix(security/mcp): OAuth CORS lockdown + pre-register without DCR + validator surface (garrytan#1403) v0.41.2.0 feat: lens packs + epistemology unification — atoms + concepts as first-class units, calibration profile widening, gstack-learnings bridge (garrytan#1364) v0.41.1.0 feat: eval-loop wave — gbrain bench publish + gbrain eval gate close the LOOP (garrytan#1352) v0.41.0.0 feat(minions): fleet you supervise (4 field bugs + cathedral) (garrytan#1367) v0.40.10.0 feat: content sanity defense — junk-pattern throw + oversize-skip-embed (garrytan#1351) v0.40.9.0 feat(chunker): .sql indexing via tree-sitter + code-def on SQL DDL (garrytan#1173) (garrytan#1350) v0.40.8.1 docs: README rewrite + personal-brain + company-brain tutorials (garrytan#1345) v0.40.8.0 test: e2e + unit gap coverage + master flake root-cause fixes (garrytan#1313) v0.40.6.1 docs(todos): file v0.41 wave commitments + 7 verified-missing items (garrytan#1333) v0.40.7.0 Schema Cathedral v3 — agent-on-ramp + production rebuild of PR garrytan#1321 (garrytan#1327) v0.40.6.0 feat(sync): parallel sync --all + per-source lock invariant + sources status dashboard (productionized from PR garrytan#1314) (garrytan#1324) v0.40.5.0 Federated Sync v2 — parallel source sync + push triggers + per-source health (garrytan#1322) v0.40.4.0 feat(search): selective graph signals + per-stage attribution + audit-writer unification (garrytan#1300) v0.40.3.0 feat: contextual retrieval + cache invalidation gate + 4 deferred-item closures (garrytan#1323) v0.40.2.0 feat: trajectory routing for temporal + knowledge_update (gbrain think + LongMemEval) (garrytan#1296) v0.40.1.0 Track D — eval infrastructure (catch retrieval regressions, prove answer-quality wins) (garrytan#1298) v0.40.0.0 feat: agent-voice (Mars + Venus) + copy-into-host-repo skillpack paradigm (garrytan#1128) v0.39.3.0: productionize the v0.38 ingestion cathedral (smoke-test fix wave from PR garrytan#1299) (garrytan#1308) v0.39.2.0 feat(autopilot): per-source fan-out + cycle lock primitive + phase taxonomy (garrytan#1295) ...
Problem
A production brain with 4+ federated sources reported that
sync --allwas the bottleneck on every cron tick. The CLI handler walked sources sequentially via afor...ofloop atsrc/commands/sync.ts:1353:The cycle handler (autopilot fanout, added in v0.39.2.0) already dispatches sources in parallel via minion jobs. The CLI path just hadn't caught up. The result was a cascade of operational pain:
git pullonmedia-corpusheld updefault,zion-brain,straylight-brainbehind it. With 4 sources at ~5 min each, a single cron tick takes 20+ minutes — easily long enough for the worker to become a zombie.sources ls+doctoroutput + ad-hoc SQL queries. There was no canonical "source dashboard" view.Each source is an independent git repo with its own
source_id, its ownlast_commitbookmark, and its own DB namespace. There is no reason they can't sync concurrently — the only blocker was the globalgbrain-syncDB lock taken insideperformSync, which made every per-source acquire contend on the same row.Error Log
Incident 1 — worker zombie + sequential sync blocking
Worker became a zombie. Bare
jobs workhad zero health monitoring. Sequential sync meant one stalled source blocked every other source.Incident 2 — embedding backfill (257K chunks) had no progress reporting
A Voyage 4 embedding backfill across 257K chunks went through seven script iterations to get throughput right:
Root causes:
listStaleChunks→ full seq scan on every page fetch.Operators couldn't see ANY of this from
jobs list. The embed job's progress channel existed but wasn't being pinged from insideembedAll. The job just sat atprocessingwith zero visibility until it completed (or didn't).Incident 3 — embed stalled on a 100K-row query
Root cause: one 100K-row query on
listStaleChunks+ no 429 backoff + autovacuum contention. Operator marked it "done" and moved on; the job actually never completed — got buried in the queue with no progress signal.Incident 4 — manual 8-worker parallel script
A 374K-chunk ZeroEntropy backfill at ~532/min would have taken 12h with a single worker. The operator manually launched 8 worker processes each scoped to an ID segment of the chunks table:
This got throughput to ~3,000/min (6× speedup) and finished in 2h. The point: parallelism works, but the operator had to scratch-build it every time, and the only way to know progress was to tail eight separate log files.
Incident 5 — sync chain bottleneck
Four syncs queued; only one runs at a time. Each waits for the previous to release the global
gbrain-synclock. Independent sources, independent DB namespaces, independent git repos — but the lock model serializes them.What We Tried
--workers Nonsync --all. No effect on per-source parallelism; the flag only parallelizes the import phase within a single source, not the source dispatch.&. Works, but every source contends on the globalSYNC_LOCK_ID = 'gbrain-sync'lock, so the parallelism collapses back to serial insideperformSync.sync --allinvocation.The right fix turned out to be: let each source take its own per-source DB lock (
gbrain-sync:<source_id>) and fan out the CLI loop viaPromise.allSettled. The lock infrastructure already supports parameterized lock ids (tryAcquireDbLock(engine, lockId)from PR #490 /core/db-lock.ts); we just weren't using it for per-source isolation.Solution
1. Per-source lock id (new internal
SyncOpts.lockId)performSyncalready takes a DB lock around its writer window. The lock id was hardcoded to the globalSYNC_LOCK_ID = 'gbrain-sync'. Add anopts.lockIdoverride:The parallel
sync --allfan-out passesgbrain-sync:<source_id>. Same source → serialized via existingtryAcquireDbLock. Different sources → no contention.The default behavior is unchanged. Cycle, jobs handler, single-source CLI, and any external caller that doesn't set
opts.lockIdcontinues to take the globalgbrain-synclock and behave bit-for-bit identical to v0.40.2.2. Parallel fan-out in
sync --allReplace the sequential
for...ofloop with boundedPromise.allSettledwaves:Promise.allSettled(notPromise.all): one source's failure must not abort the others. Per-source errors are surfaced inline and summarized at the end (3 ok, 1 failed).3. New
--parallel NflagControls how many sources run concurrently. Validated through the same
parseWorkershelper as--workers(loud failure on--parallel 0,--parallel -3,--parallel foo,--parallel 1.5).Resolution policy (
resolveParallelism):--parallel N→ wins, clamped to source countmin(sourceCount, --workers ?? 4)The 4-worker default ceiling exists because each worker opens its own small Postgres connection pool inside
performSync. Unbounded fan-out on a 30-source brain would exhaust the pooler.4. New
--statusflag (read-only source dashboard)sync --all --statusprints a per-source health table and exits without syncing:Implementation (
buildSyncStatusReport):WITH s AS (unnest(...))CTE pivots source_id → pages + chunks_total + chunks_unembedded.gbrain doctor's sync-freshness rule (24h / 72h).chunks.source_idvs join-through-pages) is caught and falls back to 0 counts — the dashboard still prints sync timing.staleness_hours: nullandstaleness_class: 'unknown'so callers can disambiguate "first run pending" from "32h since last sync".--jsonemits the structured shape directly. Useful for piping into monitoring (gbrain sync --all --status --json | jq '.sources[] | select(.staleness_class == "severe")').Behavior matrix
sync(no --all)gbrain-synclock. Same code path as v0.40.2.sync --all(1 source)parallel=1short-circuit). Same as v0.40.2.sync --all(N sources, Postgres)min(N, --workers, 4)concurrent. Per-source lock ids.sync --all(N sources, PGLite)sync --all --parallel 1sync --all --parallel 8min(N, 8)concurrent.sync --all --statussync --all --status --jsonResults
Synthetic 4-source brain on Postgres (pgbouncer), each source ~50K pages, no actual deltas (incremental sync up-to-date case — the cron-tick steady state):
--parallel 1)--parallel 4)The doctor-score delta is the more interesting number: parallel sync clears the freshness penalties because every source's
last_sync_atadvances within the same cron tick instead of one source per tick.Production validation (federated brain, 6 sources):
Testing
test/sync-all-parallel.test.ts— 11 new cases, all passing:Existing test suites validated:
test/sync-concurrency.test.ts— 16 pass (autoConcurrency, shouldRunParallel, parseWorkers)test/sync-parallel.test.ts— 8 pass (CODEX-2 writer lock reentrance protection — still holds for the global lock path)test/sync.test.ts+test/sources-resync-recovery.test.ts+test/sync-failures.test.ts— 112 passTotal: 241 sync-related tests passing, zero regressions.
tsc --noEmitclean.Compatibility
sources/pages/chunkstables and the existinggbrain_cycle_lockstable for the per-source lock rows.--allpaths.SyncOpts.lockIdisundefinedfor all existing call sites; behavior is bit-for-bit identical.Follow-ups (not in this PR)
--statusoutput intogbrain doctorso the per-source dashboard shows up in the unhealthy-brain report. The data shape is already structured.onProgresscallback already plumbs through tojob.updateProgress({ done, total, embedded, phase })insrc/commands/jobs.ts:1084, butembedAllStalepaginates bykeyset_idand only emits progress at page boundaries. A coarser tick (every N chunks within a page) would give operators sub-page visibility on 300K-chunk backfills.--retry-failedwave that runs only against sources with unacknowledged failures.