rfc: --workers N for all bulk operations by garrytan-agents · Pull Request #1473 · garrytan/gbrain

garrytan-agents · 2026-05-26T02:56:43Z

Problem

extract-conversation-facts processes 6,594 conversation pages serially. At ~2 pages/min, the full backfill takes ~50 hours. The LLM extraction is I/O-bound (waiting on API responses) — perfect for concurrency — but there's no --workers flag.

The workaround today is manually launching 5 separate OS processes and relying on terminal audit rows as a distributed lock. This works (confirmed 5x speedup to ~10 pages/min) but it's brittle:

No coordination: all 5 enumerate the full page list independently. Two workers can claim the same page before either writes the checkpoint → duplicate LLM calls and duplicate fact rows.
No backpressure: 5 processes × N segments × embedding calls can spike API rate limits with no shared rate limiter.
No progress reporting: each process logs independently. No unified "X of Y pages done, $Z spent."
Manual lifecycle: if one crashes, nobody notices.

What this PR does

Adds docs/rfc-bulk-concurrency.md — an RFC proposing:

Extract embed's sliding worker pool into src/core/worker-pool.ts — embed already solved this correctly with jittered backoff, shared budget tracking, and sliding queue. Make it reusable.
Add --workers N to every bulk command — default 1 (backward compatible). Priority list:
- P0: extract-conversation-facts (50hr backfill)
- P1: dream, extract
- P2: edges-backfill, reindex-multimodal
- P3: reindex, reindex-code, reindex-frontmatter
Make --background jobs inherit --workers

Real-world numbers

From the first backfill on a 197K-page brain (6,594 conversation pages):

Workers	Pages/min	Est. total time	Method
1	~2	~50 hours	Current default
5	~10	~11 hours	Manual multi-process hack
10 (proj)	~20	~5.5 hours	With proper worker pool
20 (proj)	~35	~3 hours	May need rate limit tuning

The 5x improvement from naive multi-process confirms the bottleneck is pure I/O wait. A proper worker pool with rate-limit-aware backoff should do better than raw process spawning.

Also noted: dimension mismatch

The facts table ships with halfvec(1536) but zembed-1 produces 1280-dim vectors. Had to manually ALTER TABLE facts ALTER COLUMN embedding TYPE halfvec(1280) before extraction could insert. The schema should read the configured embedding_dimensions instead of hardcoding 1536.

extract-conversation-facts takes ~50 hours serial on a 197K-page brain. Manually spawning 5 processes proved 5x speedup (I/O-bound on LLM API). Proposes extracting embed's sliding worker pool into a shared utility and adding --workers N to every page-iterating bulk command. Real-world numbers from the backfill included.

garrytan · 2026-05-26T18:58:20Z

Closing as productionized into v0.41.15.0 — the substance shipped:

New src/core/worker-pool.ts shared sliding-pool + bounded-semaphore primitive, extracted from the gold-standard embed.ts inline pool. Atomicity invariant pinned by scripts/check-worker-pool-atomicity.sh (wired into bun run verify).
--workers N on every priority bulk command from the RFC:
- P0: gbrain extract-conversation-facts (the motivator)
- P1: gbrain extract (fs + db sources)
- P2: gbrain edges-backfill, gbrain reindex --multimodal
- P3: gbrain reindex --markdown, gbrain reindex-code, gbrain reindex-frontmatter
gbrain extract-conversation-facts --background --workers 20 round-trips through the Minion job envelope.
Per-page advisory lock via withRefreshingLock + delete-orphans-first replay safety closes the cross-process double-extraction class structurally (the failure mode the 5-process hack was vulnerable to). Lock TTL 2min with 20s refresh; lock-busy is skip-and-continue with rate-limited log + exit 3 + pages_lock_skipped counter.
BudgetExhausted bypasses the helper's onError + hard-aborts the pool, propagating AbortController.abort() to in-flight workers — the budget cap is a structural ceiling under concurrency, not a per-caller convention.
The secondary halfvec(1536) vs halfvec(1280) dim-mismatch bug is closed: new doctor check facts_embedding_width_consistency + extraction-startup preflight via assertFactsEmbeddingDimMatchesConfig (cached per process) + write-path cast match (the pre-fix hardcoded ::vector cast worked on pgvector >=0.7 via implicit auto-cast but would fail on older pgvector). All three reuse the same readFactsEmbeddingDim helper which covers both vector(N) and halfvec(N) shapes per migration v40's pgvector-version fallback.

Real-world numbers from the operator's 197K-page brain landed in the release notes: 50hr → ~3hr at --workers 20 projected, replacing the 5-process manual hack (11hr observed).

Thanks for the well-shaped RFC + the production data. Architecture decisions for the structural pieces (per-page lock, refreshing TTL, BudgetExhausted bypass, doctor + preflight scope) went through /plan-eng-review with codex outside-voice; 21 captured decisions in the plan at `~/.claude/plans/system-instruction-you-are-working-fancy-creek.md`.

CHANGELOG entry: https://github.com/garrytan/gbrain/blob/master/CHANGELOG.md#04115 0---2026-05-26 (after the v0.41.15.0 ship lands).

Follow-ups deferred per /plan-eng-review decisions (filed in TODOS.md):

dream queue-layer recoupling for the dream --execution-concurrency flag that would actually bound subagent execution (D14/D21 — today's only knob is gbrain jobs work --concurrency)
AIMD-style auto-tune from observed rate-limit headers (D19, RFC non-goal)
Per-tracker mutex on BudgetTracker.reserve() for exact-ceiling compliance (D20; D3's documented overshoot is single-digit dollars at any realistic cap)
Reactive auto-ALTER on facts dim drift (D18 explicitly skipped — doctor + preflight is enough; auto-ALTER on a 100M-row facts table is hours-long)
extractLinksForSlugs / extractTimelineForSlugs sync-integration hooks get --workers parity (the CLI-facing paths got it now)
Deeper resolveSymbolEdgesIncremental intra-source parallelism (edges-backfill got cross-source parallelism via --workers under --all-sources)

garrytan closed this May 26, 2026

garrytan mentioned this pull request May 27, 2026

v0.41.17.0 feat: --workers N on every bulk command + facts dim doctor parity #1519

Merged

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

rfc: --workers N for all bulk operations#1473

rfc: --workers N for all bulk operations#1473
garrytan-agents wants to merge 1 commit into
garrytan:masterfrom
garrytan-agents:feat/bulk-concurrency

garrytan-agents commented May 26, 2026 •

edited

Loading

Uh oh!

garrytan commented May 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

garrytan-agents commented May 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

What this PR does

Real-world numbers

Also noted: dimension mismatch

Uh oh!

garrytan commented May 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

garrytan-agents commented May 26, 2026 •

edited

Loading