rfc: --workers N for all bulk operations#1473
Conversation
extract-conversation-facts takes ~50 hours serial on a 197K-page brain. Manually spawning 5 processes proved 5x speedup (I/O-bound on LLM API). Proposes extracting embed's sliding worker pool into a shared utility and adding --workers N to every page-iterating bulk command. Real-world numbers from the backfill included.
|
Closing as productionized into v0.41.15.0 — the substance shipped:
Real-world numbers from the operator's 197K-page brain landed in the release notes: 50hr → ~3hr at Thanks for the well-shaped RFC + the production data. Architecture decisions for the structural pieces (per-page lock, refreshing TTL, BudgetExhausted bypass, doctor + preflight scope) went through /plan-eng-review with codex outside-voice; 21 captured decisions in the plan at `~/.claude/plans/system-instruction-you-are-working-fancy-creek.md`. CHANGELOG entry: https://github.com/garrytan/gbrain/blob/master/CHANGELOG.md#04115 0---2026-05-26 (after the v0.41.15.0 ship lands). Follow-ups deferred per /plan-eng-review decisions (filed in TODOS.md):
|
Problem
extract-conversation-factsprocesses 6,594 conversation pages serially. At ~2 pages/min, the full backfill takes ~50 hours. The LLM extraction is I/O-bound (waiting on API responses) — perfect for concurrency — but there's no--workersflag.The workaround today is manually launching 5 separate OS processes and relying on terminal audit rows as a distributed lock. This works (confirmed 5x speedup to ~10 pages/min) but it's brittle:
What this PR does
Adds
docs/rfc-bulk-concurrency.md— an RFC proposing:Extract
embed's sliding worker pool intosrc/core/worker-pool.ts—embedalready solved this correctly with jittered backoff, shared budget tracking, and sliding queue. Make it reusable.Add
--workers Nto every bulk command — default 1 (backward compatible). Priority list:extract-conversation-facts(50hr backfill)dream,extractedges-backfill,reindex-multimodalreindex,reindex-code,reindex-frontmatterMake
--backgroundjobs inherit--workersReal-world numbers
From the first backfill on a 197K-page brain (6,594 conversation pages):
The 5x improvement from naive multi-process confirms the bottleneck is pure I/O wait. A proper worker pool with rate-limit-aware backoff should do better than raw process spawning.
Also noted: dimension mismatch
The facts table ships with
halfvec(1536)but zembed-1 produces 1280-dim vectors. Had to manuallyALTER TABLE facts ALTER COLUMN embedding TYPE halfvec(1280)before extraction could insert. The schema should read the configuredembedding_dimensionsinstead of hardcoding 1536.