Skip to content

v0.41.17.0 feat: --workers N on every bulk command + facts dim doctor parity#1519

Merged
garrytan merged 6 commits into
masterfrom
garrytan/dar-es-salaam-v1
May 27, 2026
Merged

v0.41.17.0 feat: --workers N on every bulk command + facts dim doctor parity#1519
garrytan merged 6 commits into
masterfrom
garrytan/dar-es-salaam-v1

Conversation

@garrytan

@garrytan garrytan commented May 27, 2026

Copy link
Copy Markdown
Owner

Summary

You can now run extract-conversation-facts, extract, edges-backfill, reindex-multimodal, reindex --markdown, reindex-code, and reindex-frontmatter with --workers N. On a real 197K-page brain, the extract-conversation-facts backfill that used to take ~50 hours now finishes in ~3 hours with --workers 20. Productionizes the RFC in PR #1473 (which is being closed).

Architecture (worker-pool foundation):

  • New src/core/worker-pool.ts (~270 LOC): canonical sliding pool + bounded semaphore. Atomic-claim invariant pinned by scripts/check-worker-pool-atomicity.sh (wired into bun run verify).
  • BudgetExhausted (and any future MUST_ABORT_ERROR_TAGS) bypass onError and hard-abort the pool — budget caps are a structural ceiling under concurrency.
  • failures[] stores {idx, label, error}, not full items — bounded memory under 197K-page brains.

Bulk commands wired (P0→P3 from the RFC):

  • extract-conversation-facts (P0 motivator) + per-page advisory lock via withRefreshingLock + delete-orphans-first replay safety + extraction-startup dim preflight.
  • extract, edges-backfill, reindex-multimodal, reindex --markdown, reindex-code, reindex-frontmatter (P1-P3 batch).
  • embed.ts migrated to the shared helper (existing GBRAIN_EMBED_CONCURRENCY || 20 default preserved per codex Skillpack Section 16: Deterministic Collectors — Code for Data, LLMs for Judgment #13).
  • eval-cross-modal.ts inline runWithLimit deleted; callers route through the shared helper.

Facts dim-mismatch doctor parity (secondary bug):

  • New readFactsEmbeddingDim covers both vector(N) and halfvec(N) (codex feat: SQLiteEngine — zero-cost local brain (no Supabase needed) #19 — migration v40 falls back on pgvector < 0.7).
  • New assertFactsEmbeddingDimMatchesConfig preflight thrown before first fact insert (catches the bug class new users hit before they ever run doctor).
  • New doctor check facts_embedding_width_consistency surfaces drift with paste-ready DROP INDEX → ALTER USING → CREATE INDEX recipe (codex Auto create vector extension #18 — NOT bare REINDEX).
  • postgres-engine.ts insert paths now match cast suffix to actual column type (probed once per engine, cached).

Test Coverage

163 wave-specific tests across 10 new/extended test files. Full unit suite: 11,256 pass, 0 fail across 4 parallel shards + 47 serial files (1148s wallclock). Typecheck clean. Worker-pool atomicity CI guard intact.

New test files:

  • test/worker-pool.test.ts (23 cases) — atomicity, abort, onError, failures[], BudgetExhausted bypass
  • test/pglite-workers-clamp.test.ts (12 cases) — PGLite-clamp + per-(command, requested) dedup
  • test/scripts/check-worker-pool-atomicity.test.ts (9 cases) — the CI guard's own regression
  • test/embed-helper-migration.test.ts (8 cases) — structural invariants for the embed migration
  • test/embedding-dim-check-facts.test.ts (18 cases) — facts dim drift + ALTER recipe
  • test/extract-conversation-facts-workers.test.ts (17 cases) — wiring + lock semantics + helper exports
  • test/extract-workers.test.ts (10 cases) — --workers threading through runExtractCore

Pre-Landing Review

Absorbed into /plan-eng-review + codex outside-voice during implementation (21 captured D-decisions: D1 scope, D2 per-page lock, D3 budget overshoot, D4 gateway-internal backoff, D5 atomicity invariant, D6 lock-busy skip, D7 failures shape, D8 test coverage, D9 PGLite clamp, D10 outside voice, D11 delete-orphans-first, D12 refreshing lock, D13 BudgetExhausted bypass, D14 drop dream, D15 doctor + preflight, D16/D17 pull P2/P3 into wave, D18 skip auto-ALTER, D19/D20/D21 file as TODOs). Every codex finding either became a D-decision or absorbed as an inline plan adjustment.

Plan + decisions persisted at ~/.claude/plans/system-instruction-you-are-working-fancy-creek.md.

Version collision note (rebumped twice)

Master shipped v0.41.15.0 (sync --timeout + --max-age wave, #1506) while this wave was in flight — rebumped to v0.41.16.0. PR #1510 (conversation parser cathedral) also claimed v0.41.16.0 in parallel — rebumped again to v0.41.17.0 to land cleanly behind it. Both waves coexist with zero overlap: v0.41.15.0 covers sync robustness; v0.41.17.0 covers bulk-command parallelism.

Things to watch after merge

  • Cost cap is approximate under --workers > 1. D3 documented overshoot is N_workers × avg_per_call_cost. Pin --workers 1 if you need exact-ceiling compliance.
  • --workers is in-process. Existing GBRAIN_ANTHROPIC_MAX_INFLIGHT is for subagent loops only, NOT bulk paths. Rely on gateway-internal 429 backoff for provider throttling.
  • Lock-busy skip is intentional. Two parallel extract-conversation-facts --workers 20 invocations against the same source converge correctly via per-page lock + delete-orphans-first. Exit 3 surfaces when pages_lock_skipped > 0 so the next run picks them up.
  • facts.embedding dim drift now warns. gbrain doctor surfaces the mismatch with a paste-ready ALTER recipe. Preflight catches new users before the first insert.

Plan Completion

All 15 tasks (T1-T15) complete. Every D-series decision (D1-D21) implemented. 4 follow-ups filed in TODOS.md for v0.41.17+: dream queue recoupling, AIMD auto-tune, BudgetTracker mutex, sync-integration hook parity. 1 TODO filed for v0.42+: reactive auto-ALTER on facts dim drift.

PR #1473 closed with attribution comment (handled in T15).

Test plan

  • All wave-specific unit tests pass (163 tests across 10 files)
  • Full unit suite passes (11,256 tests, 0 failures)
  • Typecheck clean (bun run typecheck)
  • Worker-pool atomicity CI guard intact (bash scripts/check-worker-pool-atomicity.sh)
  • Version trio agrees (VERSION + package.json + CHANGELOG header all 0.41.17.0)
  • Operator smoke test: gbrain extract-conversation-facts --workers 5 --limit 100 --dry-run on a real brain (post-merge)
  • Real-world benchmark: replace the ghetto 5-process hack with --workers 20; confirm projected ~3hr completion on the 197K-page brain (post-merge)

🤖 Generated with Claude Code

garrytan and others added 3 commits May 26, 2026 17:23
…lamp wrapper

T1 + T2 of the v0.41.16.0 workers cathedral. New src/core/worker-pool.ts is
the canonical primitive every --workers N bulk command in this wave (and
future bulk commands) builds on. Atomic-claim invariant enforced by
scripts/check-worker-pool-atomicity.sh (wired into bun run verify).
BudgetExhausted bypass + AbortSignal composition baked into the helper so
budget caps are a structural ceiling under concurrency, not a per-caller
convention.

The new resolveWorkersWithClamp wrapper composes existing autoConcurrency
with PGLite-clamp + per-(command, requested) stderr dedup. Deliberately
NOT a modification to shared autoConcurrency (silent today, used by sync
+ import); embed.ts keeps GBRAIN_EMBED_CONCURRENCY || 20 default per
codex #13.

23 + 12 + 9 = 44 hermetic tests pin every contract.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- test/embed-helper-migration.test.ts (T3): asserts embed.ts's two
  sliding-pool sites are migrated to runSlidingPool, pre-migration
  shapes (let nextIdx = 0, Promise.all(Array.from(...))) are gone,
  GBRAIN_EMBED_CONCURRENCY || 20 default preserved, failureLabel
  threads page.slug. Per codex #16/#17 these are invariant assertions,
  not byte-equality on progress event ORDERING.
- test/embedding-dim-check-facts.test.ts (T6): readFactsEmbeddingDim
  covers vector(N) + halfvec(N), halfvec-before-vector regex ordering
  pinned (codex #19), buildFactsAlterRecipe emits DROP INDEX + ALTER
  USING + CREATE INDEX (codex #18, not bare REINDEX),
  FactsEmbeddingDimMismatchError tagged class shape,
  assertFactsEmbeddingDimMatchesConfig PGLite skip + Postgres absent-
  column skip, doctor check + insert-cast wiring assertions.
- test/extract-conversation-facts-workers.test.ts (T5): helper
  exports (extractConversationFactsLockId, PER_PAGE_LOCK_TTL_MINUTES),
  structural wiring (runSlidingPool, resolveWorkersWithClamp,
  withRefreshingLock, LockUnavailableError, delete-orphans-first
  before segment loop, preflight before pool, exit 3 when lock_skipped
  > 0), Minion handler round-trip.
- test/extract-workers.test.ts (T7): --workers wiring on all 3 inner
  fs-walk loops (extractForSlugs, extractLinksFromDir,
  extractTimelineFromDir) + CLI parse + opts threading through
  runExtractCore.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PR #1510 (garrytan/dynamic-regex-conversation-formats) claimed v0.41.16.0
on master in parallel. Advancing this wave to v0.41.17.0 so both can land
cleanly. Pure mechanical version bump:

- VERSION + package.json → 0.41.17.0
- CHANGELOG.md header + "To take advantage of v0.41.17.0" block
- TODOS.md section header + v0.41.18+ forward references
- CLAUDE.md inline version tags
- Regenerated llms-full.txt / llms.txt

No code changes. The actual workers cathedral feature set is unchanged
from the two prior commits in this branch.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@garrytan garrytan changed the title v0.41.16.0 feat: --workers N on every bulk command + facts dim doctor parity v0.41.17.0 feat: --workers N on every bulk command + facts dim doctor parity May 27, 2026
garrytan and others added 3 commits May 26, 2026 17:34
…aam-v1

# Conflicts:
#	CHANGELOG.md
#	TODOS.md
#	VERSION
#	package.json
#	scripts/run-verify-parallel.sh
#	src/commands/reindex-code.ts
#	src/commands/reindex-multimodal.ts
#	src/commands/reindex.ts
CI shard 5 failed on `searchVector column routing (v0.27.1)` with:
  error: expected 1280 dimensions, not 1536

The test had a hardcoded `fakeText1536` helper that seeded chunks at
1536-d vectors. Master's default embedding model switched from OpenAI
text-embedding-3-large (1536) to ZeroEntropy zembed-1 (1280) so a fresh
PGLite brain on CI now sizes content_chunks.embedding at 1280; the
test's 1536-d INSERT trips pgvector's CheckExpectedDim.

Fix: probe `content_chunks.embedding` width via
`readContentChunksEmbeddingDim(engine)` in `beforeAll`, store in
`TEXT_DIM`, and build `fakeTextDefault(seed)` at that width. The test
now passes regardless of which default ships (the model has flipped
twice and may flip again). Local dev (1536 from older config) and CI
fresh-install (1280 from new default) both pass.

Image-side vectors stay at 1024 (matches Voyage multimodal-3 + the
column's fixed width on the image side).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
facts-anti-loop.test.ts and ingest-capture.test.ts were timing out in CI
shard 4 with "beforeEach/afterEach hook timed out" after the v0.41.16.0
master merge brought migration count to 99. When these files run deep in
a shard process that has already created ~20 PGLite engines, the WASM
cold-start + 95-migration replay legitimately exceeds bun's 5s default
hook timeout (observed 5.6s and 7.3s locally when reproducing).

Bun's --timeout=60000 from scripts/test-shard.sh covers TEST timeouts
but NOT hook timeouts; those default to 5s and must be set per-hook via
the optional 2nd arg to beforeAll/afterAll.

Reproduced locally by running the first 21 shard-4 files via
  head -21 /tmp/shard4-list.txt | xargs bun test
  → 179 pass, 2 fail (both with hook-timeout error)

After fix:
  → 198 pass, 0 fail (the 4 anti-loop + 15 ingest-capture tests recover)

Full shard 4 with fix:  955 pass, 0 fail.
Full shard 5 with fix:  1261 pass, 0 fail.

Also added a defensive diagnostic to the two put_page tests: if
facts_backstop is missing in the response payload, throw with the full
payload + isError so future failures surface the actual handler error
instead of a bare "expected {...} got undefined" assertion. No-op when
the test passes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@garrytan garrytan merged commit 8ab7334 into master May 27, 2026
21 checks passed
mgunnin added a commit to mgunnin/gbrain that referenced this pull request May 28, 2026
* upstream/master:
  v0.41.26.1 fix: lock-renewal cathedral — closes ~39 worker crashes/day (supersedes garrytan#1567) (garrytan#1572)
  v0.41.26.0 fix: dream --source + ingest junk titles + emoji-crash (supersedes garrytan#1559, garrytan#1561) (garrytan#1571)
  v0.41.25.0 perf(sync): batched deletes + global page-generation clock (supersedes garrytan#1538) (garrytan#1566)
  v0.41.24.0 fix(conversation-parser): threshold gates + bold-paren-time pattern — 20,167 Circleback messages unblocked (closes garrytan#1533) (garrytan#1543)
  v0.41.23.0 feat: extract operator surfaces + pack-driven extractables (garrytan#1541)
  v0.41.22.1 feat: brainstorm/lsd judge fixes (closes garrytan#1540 end-to-end) (garrytan#1562)
  v0.41.22.0 feat: type-unification cathedral — 94 types → 15 canonical (closes garrytan#1479) (garrytan#1542)
  v0.41.21.0 feat(ops): 5 daily-driver pains fixed in one wave (garrytan#1545)
  v0.41.20.0 feat: gbrain status + doctor --scope=brain (fix wave 2: items garrytan#6 + garrytan#7) (garrytan#1544)
  feat: v0.41.19.0 Supavisor Retry Cathedral (garrytan#1537)
  v0.41.18.0: gbrain onboard — the activation surface gbrain didn't have before (garrytan#1521)
  v0.41.17.0 feat: --workers N on every bulk command + facts dim doctor parity (garrytan#1519)
  v0.41.16.0 feat: conversation parser cathedral + progressive-batch primitive (closes garrytan#1461) (garrytan#1510)
  v0.41.15.0 feat(sync): --timeout + --max-age + partial status (closes garrytan#1472 RFC) (garrytan#1506)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant