fix: widen TpDictEntry.block_count from uint16 to uint32#266
Merged
Conversation
block_count was uint16, capping each term at 65,535 blocks (~8.4M postings). For corpora >50M docs with common terms, a compacted segment can exceed this limit. The cast to uint16 silently wraps, corrupting query results. The fix is binary-compatible: the old V4 layout used uint16 at bytes 8-9 with a reserved uint16 (always 0) at bytes 10-11. On little-endian platforms (x86-64, ARM64) this reads identically as uint32. No format version bump needed; existing indexes work without rebuild. Changes across 8 files: - TpDictEntry: uint16 block_count + uint16 reserved -> uint32 block_count - Remove all (uint16) casts on num_blocks assignments - Remove all entry.reserved = 0 lines for dict entries - Widen block_count/block_idx locals in BMW, scan, merge paths - Widen tp_segment_read_skip_entry block_idx parameter to uint32 V3 segments (TpDictEntryV3) are unchanged -- their uint32 offset limit makes it impossible for a single term to reach the 65K block cap.
d61f56c to
412d8bf
Compare
tjgreen42
added a commit
that referenced
this pull request
Mar 6, 2026
Add BM25 ground truth precomputation and validation scripts for the MS-MARCO v2 (138M) dataset, analogous to the existing v1 pipeline. Validates 20 curated queries: 10 with high-frequency terms (doc_freq > 8.4M, targeting the block_count overflow bug in #266) and 10 with low-frequency terms (baseline correctness). Key design choices for 138M scale: - Single-pass doc_freq computation (~20 min vs ~13 min/term) - Materialized doc_term_data and doc_lengths tables to eliminate correlated subqueries that block parallel execution - fieldnorm_quantize marked PARALLEL SAFE for parallel scans - Total precompute time: ~2 hours (vs ~5+ hours with serial approach) Validation results: - Without #266 fix: 4/10 high-freq queries fail (doc mismatches) - With #266 fix: 20/20 queries pass (worst diff: 0.000002)
4 tasks
tjgreen42
added a commit
that referenced
this pull request
Mar 6, 2026
Expand validation from 20 curated queries to 400: 10 high-frequency term queries, 10 low-frequency term queries, and 380 randomly sampled from the full dev set. Random selection uses hashint4(query_id + 42) for reproducibility. All 400 queries validated against the BM25 index with PR #266 fix: 100% docs match, 100% scores match, worst absolute diff 0.000003. Total precomputation time: ~14.5 hours on 138M rows.
tjgreen42
added a commit
that referenced
this pull request
Mar 6, 2026
## Summary - Add BM25 ground truth precomputation and validation scripts for MS-MARCO v2 (138M docs) - Validates 20 curated queries: 10 high-frequency terms (targeting #266 block_count overflow) + 10 low-frequency terms - Includes `ground_truth_pg17.tsv` with precomputed reference scores ## Validation Results | Condition | Docs Match | Scores Match | |-----------|-----------|--------------| | Without #266 fix | 16/20 (80%) | 20/20 (100%) | | With #266 fix | 20/20 (100%) | 20/20 (100%) | The 4 failing queries without #266 all contain high-frequency terms (>8.4M postings) — exactly the terms affected by the uint16 block_count overflow. ## Design Notes At 138M rows, naive approaches don't work: - Materialized `doc_term_data` and `doc_lengths` tables eliminate correlated subqueries that block parallel execution - `fieldnorm_quantize()` marked `PARALLEL SAFE` (defaults to UNSAFE, silently blocks all parallelism) - Single-pass doc_freq computation instead of per-term COUNT - Total precompute time: ~2 hours with 13 parallel workers ## Test plan - [x] Precompute ground truth on PG17 (138M rows) - [x] Validate against unfixed main — 4/10 high-freq queries fail as expected - [x] Validate against #266 fix — all 20 queries pass - [ ] Review scripts for correctness
Closed
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
TpDictEntry.block_countwasuint16, capping each term at 65,535 blocks (~8.4M postings). For corpora >50M docs, common terms in a compacted segment exceed this limit and the silent(uint16)cast wraps to 0, corrupting query results.block_countfromuint16touint32and removes the adjacentreservedfield, keepingTpDictEntryat exactly 16 bytes.uint16at bytes 8-9 withreserved=0at bytes 10-11. On little-endian platforms (x86-64, ARM64) this reads identically asuint32. No format version bump; existing indexes work without rebuild.TpDictEntryV3) are unchanged — theiruint32offset limit makes it impossible for a single term to reach the 65K block cap.Changes across 8 files
segment.hTpDictEntry:uint16 block_count+uint16 reserved→uint32 block_countsegment.cTermBlockInfo.block_count, remove(uint16)cast, removereservedassignmentsmerge_internal.hMergeTermBlockInfo.block_countandTpPostingMergeSource.block_countmerge.c(uint16)cast, removereservedassignmentbuild_context.cTermBlockInfostructs, remove casts andreservedassignmentsbmw.cblock_count/block_idx/block/target_blocklocals (6 variables, 4 functions)scan.cblock_count/target_blocklocals, widenblock_idxparametersegment_io.htp_segment_read_skip_entryblock_idxparameterTest plan
sizeof(TpDictEntry) == 16confirmed (struct size unchanged)make clean && make— zero new warningsmake installcheck— all 49 regression tests passmake format-check— formatting clean