perf: parallelize WordIndex build (cold index 1.49×) + leaner ranked search#520
Conversation
…search Cold-index commit was dominated by the serial WordIndex.indexFile. Build it in parallel per-worker shards (reusing indexFile, backed by each worker's arena), then fold them in worker order via the new WordIndex.mergeShard: global doc_ids land in chunk ranges that reproduce the serial assignment, so postings stay doc_id-ascending and BM25 results are byte-identical (the Lucene DocumentsWriterPerThread pattern). Gated on skip_file_words — shards omit the per-file word set, so they only fit the bulk cold-index build that never needs incremental removeFile; main.zig always sets it. Threading model is unchanged (parse parallel, commit serial), so self.allocator stays single-threaded. Controlled interleaved A/B on a ~16k-file repo (same binary, shards on/off): cold index 3.96s -> 2.66s (1.49x); commit phase 1517ms -> 206ms (7.4x); user-time flat -> same total CPU work, parallelized rather than reduced. Research-backed, behavior-preserving search-path wins: - searchContentRanked: lazy top-k max-heap (std.PriorityQueue) instead of full-sorting every matched doc — identical ordering and skip-and-continue semantics, O(C + (k+skips)*logC) (bm25s top-k retrieval). - TrigramIndex.candidates: hoist per-pair packTrigram + getPtr hash probes out of the per-candidate bloom loop (bit-identical result set). - searchContentRanked: U32Context (std.hash.int) for the three u32-keyed aggregation maps instead of Wyhash over 4 key bytes. Adds a workers=1-vs-4 parity test asserting identical word hits AND BM25 ranked results. 671/671 tests pass. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Benchmark Regression ReportThresholds: 10.00% and 50,000 ns absolute delta
|
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 460f48d979
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| .outline = outline_copy, | ||
| .skip_trigram = file.skip_trigram, | ||
| }) catch continue; | ||
| if (word_shard) |ws| ws.indexFile(file.path, content_copy) catch {}; |
There was a problem hiding this comment.
Propagate shard indexing failures
When the parallel shard path is enabled, a file is appended to the worker results and later committed, but any failure from ws.indexFile is swallowed here. Under allocation pressure this can persist a word index that silently omits files even though their outlines/contents/trigrams were committed, so codedb_word and multi-word ranked search miss them; the serial path would return the indexing error instead of publishing an incomplete index.
Useful? React with 👍 / 👎.
Summary
Research-backed cold-index and search-path optimizations (from a 13-agent profile+literature workflow). Headline: the cold-index
commitphase was dominated by the serialWordIndex.indexFile; this parallelizes it.#1 — Parallel WordIndex shards (Lucene DocumentsWriterPerThread)
Each parse worker builds a private
WordIndexshard (reusingindexFile, backed by its arena); after join, the newWordIndex.mergeShardfolds them in worker order. Global doc-ids land in chunk ranges that exactly reproduce the serial assignment, so postings stay doc-id-ascending and BM25 results are byte-identical.Gated on
skip_file_words— shards omit the per-file word set, so they only fit the bulk cold-index build (which never needs incrementalremoveFile);main.zigalways sets it. Threading model unchanged (parse parallel, commit serial →self.allocatorstays single-threaded).Controlled interleaved A/B (same binary, shards on/off, ~16k-file repo, 6 iters, medians):
user-time staying flat confirms it's the same total work, parallelized rather than reduced.Search-path wins (behavior-preserving)
searchContentRanked: lazy top-k max-heap (std.PriorityQueue) instead of full-sorting every matched doc — identical ordering + skip-and-continue semantics,O(C + (k+skips)·logC)(bm25s top-k).TrigramIndex.candidates: hoist per-pairpackTrigram+getPtrhash probes out of the per-candidate bloom loop (bit-identical result set).searchContentRanked:U32Context(std.hash.int) for the three u32-keyed aggregation maps instead of Wyhash-over-4-bytes.Tests
zig build test→ 671/671. Adds aworkers=1-vs-workers=4parity test asserting identical word hits and BM25 ranked results (locks down the shard doc-id/merge correctness).Note on measurement
Absolute cold-index wall-time is noisy on Apple Silicon (efficiency-core placement under load — same binary measured 2.5s vs 37s), so the A/B above is interleaved on the same binary to cancel that out; the
user-time invariant is the reliable cross-check.