Skip to content

fix: widen TpDictEntry.block_count from uint16 to uint32#266

Merged
tjgreen42 merged 1 commit intomainfrom
fix/widen-block-count-uint32
Mar 6, 2026
Merged

fix: widen TpDictEntry.block_count from uint16 to uint32#266
tjgreen42 merged 1 commit intomainfrom
fix/widen-block-count-uint32

Conversation

@tjgreen42
Copy link
Copy Markdown
Collaborator

Summary

  • Fixes Prevent uint16 overflow in segment block_count #144: TpDictEntry.block_count was uint16, capping each term at 65,535 blocks (~8.4M postings). For corpora >50M docs, common terms in a compacted segment exceed this limit and the silent (uint16) cast wraps to 0, corrupting query results.
  • Widens block_count from uint16 to uint32 and removes the adjacent reserved field, keeping TpDictEntry at exactly 16 bytes.
  • Binary-compatible: old V4 segments stored uint16 at bytes 8-9 with reserved=0 at bytes 10-11. On little-endian platforms (x86-64, ARM64) this reads identically as uint32. No format version bump; existing indexes work without rebuild.
  • V3 segments (TpDictEntryV3) are unchanged — their uint32 offset limit makes it impossible for a single term to reach the 65K block cap.

Changes across 8 files

File Change
segment.h TpDictEntry: uint16 block_count + uint16 reserveduint32 block_count
segment.c Widen TermBlockInfo.block_count, remove (uint16) cast, remove reserved assignments
merge_internal.h Widen MergeTermBlockInfo.block_count and TpPostingMergeSource.block_count
merge.c Remove (uint16) cast, remove reserved assignment
build_context.c Widen both TermBlockInfo structs, remove casts and reserved assignments
bmw.c Widen block_count/block_idx/block/target_block locals (6 variables, 4 functions)
scan.c Widen block_count/target_block locals, widen block_idx parameter
segment_io.h Widen tp_segment_read_skip_entry block_idx parameter

Test plan

  • sizeof(TpDictEntry) == 16 confirmed (struct size unchanged)
  • make clean && make — zero new warnings
  • make installcheck — all 49 regression tests pass
  • make format-check — formatting clean
  • CI: PG17 + PG18 regression tests
  • CI: Address sanitizer build
  • CI: Formatting check

block_count was uint16, capping each term at 65,535 blocks (~8.4M
postings). For corpora >50M docs with common terms, a compacted
segment can exceed this limit. The cast to uint16 silently wraps,
corrupting query results.

The fix is binary-compatible: the old V4 layout used uint16 at bytes
8-9 with a reserved uint16 (always 0) at bytes 10-11. On little-endian
platforms (x86-64, ARM64) this reads identically as uint32. No format
version bump needed; existing indexes work without rebuild.

Changes across 8 files:
- TpDictEntry: uint16 block_count + uint16 reserved -> uint32 block_count
- Remove all (uint16) casts on num_blocks assignments
- Remove all entry.reserved = 0 lines for dict entries
- Widen block_count/block_idx locals in BMW, scan, merge paths
- Widen tp_segment_read_skip_entry block_idx parameter to uint32

V3 segments (TpDictEntryV3) are unchanged -- their uint32 offset limit
makes it impossible for a single term to reach the 65K block cap.
@tjgreen42 tjgreen42 force-pushed the fix/widen-block-count-uint32 branch from d61f56c to 412d8bf Compare March 5, 2026 16:46
tjgreen42 added a commit that referenced this pull request Mar 6, 2026
Add BM25 ground truth precomputation and validation scripts for the
MS-MARCO v2 (138M) dataset, analogous to the existing v1 pipeline.

Validates 20 curated queries: 10 with high-frequency terms (doc_freq >
8.4M, targeting the block_count overflow bug in #266) and 10 with
low-frequency terms (baseline correctness).

Key design choices for 138M scale:
- Single-pass doc_freq computation (~20 min vs ~13 min/term)
- Materialized doc_term_data and doc_lengths tables to eliminate
  correlated subqueries that block parallel execution
- fieldnorm_quantize marked PARALLEL SAFE for parallel scans
- Total precompute time: ~2 hours (vs ~5+ hours with serial approach)

Validation results:
- Without #266 fix: 4/10 high-freq queries fail (doc mismatches)
- With #266 fix: 20/20 queries pass (worst diff: 0.000002)
tjgreen42 added a commit that referenced this pull request Mar 6, 2026
Expand validation from 20 curated queries to 400: 10 high-frequency
term queries, 10 low-frequency term queries, and 380 randomly sampled
from the full dev set. Random selection uses hashint4(query_id + 42)
for reproducibility.

All 400 queries validated against the BM25 index with PR #266 fix:
100% docs match, 100% scores match, worst absolute diff 0.000003.

Total precomputation time: ~14.5 hours on 138M rows.
@tjgreen42 tjgreen42 merged commit 8cb6397 into main Mar 6, 2026
15 checks passed
@tjgreen42 tjgreen42 deleted the fix/widen-block-count-uint32 branch March 6, 2026 22:57
tjgreen42 added a commit that referenced this pull request Mar 6, 2026
## Summary

- Add BM25 ground truth precomputation and validation scripts for
MS-MARCO v2 (138M docs)
- Validates 20 curated queries: 10 high-frequency terms (targeting #266
block_count overflow) + 10 low-frequency terms
- Includes `ground_truth_pg17.tsv` with precomputed reference scores

## Validation Results

| Condition | Docs Match | Scores Match |
|-----------|-----------|--------------|
| Without #266 fix | 16/20 (80%) | 20/20 (100%) |
| With #266 fix | 20/20 (100%) | 20/20 (100%) |

The 4 failing queries without #266 all contain high-frequency terms
(>8.4M postings) — exactly the terms affected by the uint16 block_count
overflow.

## Design Notes

At 138M rows, naive approaches don't work:
- Materialized `doc_term_data` and `doc_lengths` tables eliminate
correlated subqueries that block parallel execution
- `fieldnorm_quantize()` marked `PARALLEL SAFE` (defaults to UNSAFE,
silently blocks all parallelism)
- Single-pass doc_freq computation instead of per-term COUNT
- Total precompute time: ~2 hours with 13 parallel workers

## Test plan

- [x] Precompute ground truth on PG17 (138M rows)
- [x] Validate against unfixed main — 4/10 high-freq queries fail as
expected
- [x] Validate against #266 fix — all 20 queries pass
- [ ] Review scripts for correctness
@tjgreen42 tjgreen42 mentioned this pull request Mar 12, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Prevent uint16 overflow in segment block_count

1 participant