fix: use min fieldnorm for BMW skip entries in parallel build#230
Merged
fix: use min fieldnorm for BMW skip entries in parallel build#230
Conversation
The parallel build's write_posting_blocks() computed MAX fieldnorm per block instead of MIN. The block_max_norm field stores the minimum fieldnorm (shortest document) so BMW can compute a valid upper bound on block scores. Using the maximum (longest document) produced artificially low upper bounds, causing BMW to incorrectly skip blocks containing high-scoring short documents. The serial build (segment.c) and merge (merge.c) paths already used min_norm correctly. This fix aligns the parallel build path. Add a regression test (parallel_bmw) that deterministically reproduces the bug using a 3-tier document design: medium-length docs establish the BMW threshold in early blocks, then mixed short+long doc blocks follow. With the wrong MAX fieldnorm, the upper bound for the mixed blocks is based on the long docs and falls below the threshold, causing BMW to skip them entirely and miss the short docs that should rank highest. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
write_posting_blocks()in parallel build to compute MIN fieldnorm (shortest doc) instead of MAX (longest doc) per blockblock_max_normskip entry field must store the minimum fieldnorm so BMW computes valid score upper bounds; using maximum caused BMW to incorrectly skip blocks containing high-scoring short documentssegment.c) and merge (merge.c) paths already usedmin_normcorrectly — this aligns the parallel build pathparallel_bmwregression test that deterministically reproduces the bug: medium-length docs set the BMW threshold in early blocks, then mixed short+long doc blocks follow where the wrong upper bound causes BMW to skip themTest plan
parallel_bmwtest fails deterministically without fix (0 short docs in top-10), passes with fix (10 short docs)bmw+parallel_buildtestsbinary_iofailure)make format-checkpasses