fix: widen TpDictEntry.block_count from uint16 to uint32 by tjgreen42 · Pull Request #266 · timescale/pg_textsearch

tjgreen42 · 2026-03-05T16:42:17Z

Summary

Fixes Prevent uint16 overflow in segment block_count #144: TpDictEntry.block_count was uint16, capping each term at 65,535 blocks (~8.4M postings). For corpora >50M docs, common terms in a compacted segment exceed this limit and the silent (uint16) cast wraps to 0, corrupting query results.
Widens block_count from uint16 to uint32 and removes the adjacent reserved field, keeping TpDictEntry at exactly 16 bytes.
Binary-compatible: old V4 segments stored uint16 at bytes 8-9 with reserved=0 at bytes 10-11. On little-endian platforms (x86-64, ARM64) this reads identically as uint32. No format version bump; existing indexes work without rebuild.
V3 segments (TpDictEntryV3) are unchanged — their uint32 offset limit makes it impossible for a single term to reach the 65K block cap.

Changes across 8 files

File	Change
`segment.h`	`TpDictEntry`: `uint16 block_count` + `uint16 reserved` → `uint32 block_count`
`segment.c`	Widen `TermBlockInfo.block_count`, remove `(uint16)` cast, remove `reserved` assignments
`merge_internal.h`	Widen `MergeTermBlockInfo.block_count` and `TpPostingMergeSource.block_count`
`merge.c`	Remove `(uint16)` cast, remove `reserved` assignment
`build_context.c`	Widen both `TermBlockInfo` structs, remove casts and `reserved` assignments
`bmw.c`	Widen `block_count`/`block_idx`/`block`/`target_block` locals (6 variables, 4 functions)
`scan.c`	Widen `block_count`/`target_block` locals, widen `block_idx` parameter
`segment_io.h`	Widen `tp_segment_read_skip_entry` `block_idx` parameter

Test plan

sizeof(TpDictEntry) == 16 confirmed (struct size unchanged)
make clean && make — zero new warnings
make installcheck — all 49 regression tests pass
make format-check — formatting clean
CI: PG17 + PG18 regression tests
CI: Address sanitizer build
CI: Formatting check

block_count was uint16, capping each term at 65,535 blocks (~8.4M postings). For corpora >50M docs with common terms, a compacted segment can exceed this limit. The cast to uint16 silently wraps, corrupting query results. The fix is binary-compatible: the old V4 layout used uint16 at bytes 8-9 with a reserved uint16 (always 0) at bytes 10-11. On little-endian platforms (x86-64, ARM64) this reads identically as uint32. No format version bump needed; existing indexes work without rebuild. Changes across 8 files: - TpDictEntry: uint16 block_count + uint16 reserved -> uint32 block_count - Remove all (uint16) casts on num_blocks assignments - Remove all entry.reserved = 0 lines for dict entries - Widen block_count/block_idx locals in BMW, scan, merge paths - Widen tp_segment_read_skip_entry block_idx parameter to uint32 V3 segments (TpDictEntryV3) are unchanged -- their uint32 offset limit makes it impossible for a single term to reach the 65K block cap.

Add BM25 ground truth precomputation and validation scripts for the MS-MARCO v2 (138M) dataset, analogous to the existing v1 pipeline. Validates 20 curated queries: 10 with high-frequency terms (doc_freq > 8.4M, targeting the block_count overflow bug in #266) and 10 with low-frequency terms (baseline correctness). Key design choices for 138M scale: - Single-pass doc_freq computation (~20 min vs ~13 min/term) - Materialized doc_term_data and doc_lengths tables to eliminate correlated subqueries that block parallel execution - fieldnorm_quantize marked PARALLEL SAFE for parallel scans - Total precompute time: ~2 hours (vs ~5+ hours with serial approach) Validation results: - Without #266 fix: 4/10 high-freq queries fail (doc mismatches) - With #266 fix: 20/20 queries pass (worst diff: 0.000002)

Expand validation from 20 curated queries to 400: 10 high-frequency term queries, 10 low-frequency term queries, and 380 randomly sampled from the full dev set. Random selection uses hashint4(query_id + 42) for reproducibility. All 400 queries validated against the BM25 index with PR #266 fix: 100% docs match, 100% scores match, worst absolute diff 0.000003. Total precomputation time: ~14.5 hours on 138M rows.

## Summary - Add BM25 ground truth precomputation and validation scripts for MS-MARCO v2 (138M docs) - Validates 20 curated queries: 10 high-frequency terms (targeting #266 block_count overflow) + 10 low-frequency terms - Includes `ground_truth_pg17.tsv` with precomputed reference scores ## Validation Results | Condition | Docs Match | Scores Match | |-----------|-----------|--------------| | Without #266 fix | 16/20 (80%) | 20/20 (100%) | | With #266 fix | 20/20 (100%) | 20/20 (100%) | The 4 failing queries without #266 all contain high-frequency terms (>8.4M postings) — exactly the terms affected by the uint16 block_count overflow. ## Design Notes At 138M rows, naive approaches don't work: - Materialized `doc_term_data` and `doc_lengths` tables eliminate correlated subqueries that block parallel execution - `fieldnorm_quantize()` marked `PARALLEL SAFE` (defaults to UNSAFE, silently blocks all parallelism) - Single-pass doc_freq computation instead of per-term COUNT - Total precompute time: ~2 hours with 13 parallel workers ## Test plan - [x] Precompute ground truth on PG17 (138M rows) - [x] Validate against unfixed main — 4/10 high-freq queries fail as expected - [x] Validate against #266 fix — all 20 queries pass - [ ] Review scripts for correctness

tjgreen42 force-pushed the fix/widen-block-count-uint32 branch from d61f56c to 412d8bf Compare March 5, 2026 16:46

tjgreen42 mentioned this pull request Mar 6, 2026

bench: add MS-MARCO v2 ground truth validation #268

Merged

4 tasks

tjgreen42 merged commit 8cb6397 into main Mar 6, 2026
15 checks passed

tjgreen42 deleted the fix/widen-block-count-uint32 branch March 6, 2026 22:57

tjgreen42 mentioned this pull request Mar 12, 2026

Release v0.6.1 #281

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: widen TpDictEntry.block_count from uint16 to uint32#266

fix: widen TpDictEntry.block_count from uint16 to uint32#266
tjgreen42 merged 1 commit intomainfrom
fix/widen-block-count-uint32

tjgreen42 commented Mar 5, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

tjgreen42 commented Mar 5, 2026

Summary

Changes across 8 files

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant