bench: add MS-MARCO v2 ground truth validation by tjgreen42 · Pull Request #268 · timescale/pg_textsearch

tjgreen42 · 2026-03-06T01:24:10Z

Summary

Add BM25 ground truth precomputation and validation scripts for MS-MARCO v2 (138M docs)
Validates 20 curated queries: 10 high-frequency terms (targeting fix: widen TpDictEntry.block_count from uint16 to uint32 #266 block_count overflow) + 10 low-frequency terms
Includes ground_truth_pg17.tsv with precomputed reference scores

Validation Results

Condition	Docs Match	Scores Match
Without #266 fix	16/20 (80%)	20/20 (100%)
With #266 fix	20/20 (100%)	20/20 (100%)

The 4 failing queries without #266 all contain high-frequency terms (>8.4M postings) — exactly the terms affected by the uint16 block_count overflow.

Design Notes

At 138M rows, naive approaches don't work:

Materialized doc_term_data and doc_lengths tables eliminate correlated subqueries that block parallel execution
fieldnorm_quantize() marked PARALLEL SAFE (defaults to UNSAFE, silently blocks all parallelism)
Single-pass doc_freq computation instead of per-term COUNT
Total precompute time: ~2 hours with 13 parallel workers

Test plan

Precompute ground truth on PG17 (138M rows)
Validate against unfixed main — 4/10 high-freq queries fail as expected
Validate against fix: widen TpDictEntry.block_count from uint16 to uint32 #266 fix — all 20 queries pass
Review scripts for correctness

Add BM25 ground truth precomputation and validation scripts for the MS-MARCO v2 (138M) dataset, analogous to the existing v1 pipeline. Validates 20 curated queries: 10 with high-frequency terms (doc_freq > 8.4M, targeting the block_count overflow bug in #266) and 10 with low-frequency terms (baseline correctness). Key design choices for 138M scale: - Single-pass doc_freq computation (~20 min vs ~13 min/term) - Materialized doc_term_data and doc_lengths tables to eliminate correlated subqueries that block parallel execution - fieldnorm_quantize marked PARALLEL SAFE for parallel scans - Total precompute time: ~2 hours (vs ~5+ hours with serial approach) Validation results: - Without #266 fix: 4/10 high-freq queries fail (doc mismatches) - With #266 fix: 20/20 queries pass (worst diff: 0.000002)

Expand validation from 20 curated queries to 400: 10 high-frequency term queries, 10 low-frequency term queries, and 380 randomly sampled from the full dev set. Random selection uses hashint4(query_id + 42) for reproducibility. All 400 queries validated against the BM25 index with PR #266 fix: 100% docs match, 100% scores match, worst absolute diff 0.000003. Total precomputation time: ~14.5 hours on 138M rows.

tjgreen42 added 3 commits March 6, 2026 01:23

Merge branch 'main' into bench/msmarco-v2-ground-truth

218879e

tjgreen42 merged commit 9783a1f into main Mar 6, 2026
1 check passed

tjgreen42 deleted the bench/msmarco-v2-ground-truth branch March 6, 2026 22:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bench: add MS-MARCO v2 ground truth validation#268

bench: add MS-MARCO v2 ground truth validation#268
tjgreen42 merged 3 commits intomainfrom
bench/msmarco-v2-ground-truth

tjgreen42 commented Mar 6, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

tjgreen42 commented Mar 6, 2026

Summary

Validation Results

Design Notes

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant