Skip to content

rfc: --workers N for all bulk operations#1473

Closed
garrytan-agents wants to merge 1 commit into
garrytan:masterfrom
garrytan-agents:feat/bulk-concurrency
Closed

rfc: --workers N for all bulk operations#1473
garrytan-agents wants to merge 1 commit into
garrytan:masterfrom
garrytan-agents:feat/bulk-concurrency

Conversation

@garrytan-agents

@garrytan-agents garrytan-agents commented May 26, 2026

Copy link
Copy Markdown
Contributor

Problem

extract-conversation-facts processes 6,594 conversation pages serially. At ~2 pages/min, the full backfill takes ~50 hours. The LLM extraction is I/O-bound (waiting on API responses) — perfect for concurrency — but there's no --workers flag.

The workaround today is manually launching 5 separate OS processes and relying on terminal audit rows as a distributed lock. This works (confirmed 5x speedup to ~10 pages/min) but it's brittle:

  • No coordination: all 5 enumerate the full page list independently. Two workers can claim the same page before either writes the checkpoint → duplicate LLM calls and duplicate fact rows.
  • No backpressure: 5 processes × N segments × embedding calls can spike API rate limits with no shared rate limiter.
  • No progress reporting: each process logs independently. No unified "X of Y pages done, $Z spent."
  • Manual lifecycle: if one crashes, nobody notices.

What this PR does

Adds docs/rfc-bulk-concurrency.md — an RFC proposing:

  1. Extract embed's sliding worker pool into src/core/worker-pool.tsembed already solved this correctly with jittered backoff, shared budget tracking, and sliding queue. Make it reusable.

  2. Add --workers N to every bulk command — default 1 (backward compatible). Priority list:

    • P0: extract-conversation-facts (50hr backfill)
    • P1: dream, extract
    • P2: edges-backfill, reindex-multimodal
    • P3: reindex, reindex-code, reindex-frontmatter
  3. Make --background jobs inherit --workers

Real-world numbers

From the first backfill on a 197K-page brain (6,594 conversation pages):

Workers Pages/min Est. total time Method
1 ~2 ~50 hours Current default
5 ~10 ~11 hours Manual multi-process hack
10 (proj) ~20 ~5.5 hours With proper worker pool
20 (proj) ~35 ~3 hours May need rate limit tuning

The 5x improvement from naive multi-process confirms the bottleneck is pure I/O wait. A proper worker pool with rate-limit-aware backoff should do better than raw process spawning.

Also noted: dimension mismatch

The facts table ships with halfvec(1536) but zembed-1 produces 1280-dim vectors. Had to manually ALTER TABLE facts ALTER COLUMN embedding TYPE halfvec(1280) before extraction could insert. The schema should read the configured embedding_dimensions instead of hardcoding 1536.

extract-conversation-facts takes ~50 hours serial on a 197K-page brain.
Manually spawning 5 processes proved 5x speedup (I/O-bound on LLM API).

Proposes extracting embed's sliding worker pool into a shared utility
and adding --workers N to every page-iterating bulk command.

Real-world numbers from the backfill included.
@garrytan

Copy link
Copy Markdown
Owner

Closing as productionized into v0.41.15.0 — the substance shipped:

  • New src/core/worker-pool.ts shared sliding-pool + bounded-semaphore primitive, extracted from the gold-standard embed.ts inline pool. Atomicity invariant pinned by scripts/check-worker-pool-atomicity.sh (wired into bun run verify).
  • --workers N on every priority bulk command from the RFC:
    • P0: gbrain extract-conversation-facts (the motivator)
    • P1: gbrain extract (fs + db sources)
    • P2: gbrain edges-backfill, gbrain reindex --multimodal
    • P3: gbrain reindex --markdown, gbrain reindex-code, gbrain reindex-frontmatter
  • gbrain extract-conversation-facts --background --workers 20 round-trips through the Minion job envelope.
  • Per-page advisory lock via withRefreshingLock + delete-orphans-first replay safety closes the cross-process double-extraction class structurally (the failure mode the 5-process hack was vulnerable to). Lock TTL 2min with 20s refresh; lock-busy is skip-and-continue with rate-limited log + exit 3 + pages_lock_skipped counter.
  • BudgetExhausted bypasses the helper's onError + hard-aborts the pool, propagating AbortController.abort() to in-flight workers — the budget cap is a structural ceiling under concurrency, not a per-caller convention.
  • The secondary halfvec(1536) vs halfvec(1280) dim-mismatch bug is closed: new doctor check facts_embedding_width_consistency + extraction-startup preflight via assertFactsEmbeddingDimMatchesConfig (cached per process) + write-path cast match (the pre-fix hardcoded ::vector cast worked on pgvector >=0.7 via implicit auto-cast but would fail on older pgvector). All three reuse the same readFactsEmbeddingDim helper which covers both vector(N) and halfvec(N) shapes per migration v40's pgvector-version fallback.

Real-world numbers from the operator's 197K-page brain landed in the release notes: 50hr → ~3hr at --workers 20 projected, replacing the 5-process manual hack (11hr observed).

Thanks for the well-shaped RFC + the production data. Architecture decisions for the structural pieces (per-page lock, refreshing TTL, BudgetExhausted bypass, doctor + preflight scope) went through /plan-eng-review with codex outside-voice; 21 captured decisions in the plan at `~/.claude/plans/system-instruction-you-are-working-fancy-creek.md`.

CHANGELOG entry: https://github.com/garrytan/gbrain/blob/master/CHANGELOG.md#04115 0---2026-05-26 (after the v0.41.15.0 ship lands).

Follow-ups deferred per /plan-eng-review decisions (filed in TODOS.md):

  • dream queue-layer recoupling for the dream --execution-concurrency flag that would actually bound subagent execution (D14/D21 — today's only knob is gbrain jobs work --concurrency)
  • AIMD-style auto-tune from observed rate-limit headers (D19, RFC non-goal)
  • Per-tracker mutex on BudgetTracker.reserve() for exact-ceiling compliance (D20; D3's documented overshoot is single-digit dollars at any realistic cap)
  • Reactive auto-ALTER on facts dim drift (D18 explicitly skipped — doctor + preflight is enough; auto-ALTER on a 100M-row facts table is hours-long)
  • extractLinksForSlugs / extractTimelineForSlugs sync-integration hooks get --workers parity (the CLI-facing paths got it now)
  • Deeper resolveSymbolEdgesIncremental intra-source parallelism (edges-backfill got cross-source parallelism via --workers under --all-sources)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants