Skip to content

bench: add MS-MARCO v2 ground truth validation#268

Merged
tjgreen42 merged 3 commits intomainfrom
bench/msmarco-v2-ground-truth
Mar 6, 2026
Merged

bench: add MS-MARCO v2 ground truth validation#268
tjgreen42 merged 3 commits intomainfrom
bench/msmarco-v2-ground-truth

Conversation

@tjgreen42
Copy link
Copy Markdown
Collaborator

Summary

  • Add BM25 ground truth precomputation and validation scripts for MS-MARCO v2 (138M docs)
  • Validates 20 curated queries: 10 high-frequency terms (targeting fix: widen TpDictEntry.block_count from uint16 to uint32 #266 block_count overflow) + 10 low-frequency terms
  • Includes ground_truth_pg17.tsv with precomputed reference scores

Validation Results

Condition Docs Match Scores Match
Without #266 fix 16/20 (80%) 20/20 (100%)
With #266 fix 20/20 (100%) 20/20 (100%)

The 4 failing queries without #266 all contain high-frequency terms (>8.4M postings) — exactly the terms affected by the uint16 block_count overflow.

Design Notes

At 138M rows, naive approaches don't work:

  • Materialized doc_term_data and doc_lengths tables eliminate correlated subqueries that block parallel execution
  • fieldnorm_quantize() marked PARALLEL SAFE (defaults to UNSAFE, silently blocks all parallelism)
  • Single-pass doc_freq computation instead of per-term COUNT
  • Total precompute time: ~2 hours with 13 parallel workers

Test plan

Add BM25 ground truth precomputation and validation scripts for the
MS-MARCO v2 (138M) dataset, analogous to the existing v1 pipeline.

Validates 20 curated queries: 10 with high-frequency terms (doc_freq >
8.4M, targeting the block_count overflow bug in #266) and 10 with
low-frequency terms (baseline correctness).

Key design choices for 138M scale:
- Single-pass doc_freq computation (~20 min vs ~13 min/term)
- Materialized doc_term_data and doc_lengths tables to eliminate
  correlated subqueries that block parallel execution
- fieldnorm_quantize marked PARALLEL SAFE for parallel scans
- Total precompute time: ~2 hours (vs ~5+ hours with serial approach)

Validation results:
- Without #266 fix: 4/10 high-freq queries fail (doc mismatches)
- With #266 fix: 20/20 queries pass (worst diff: 0.000002)
Expand validation from 20 curated queries to 400: 10 high-frequency
term queries, 10 low-frequency term queries, and 380 randomly sampled
from the full dev set. Random selection uses hashint4(query_id + 42)
for reproducibility.

All 400 queries validated against the BM25 index with PR #266 fix:
100% docs match, 100% scores match, worst absolute diff 0.000003.

Total precomputation time: ~14.5 hours on 138M rows.
@tjgreen42 tjgreen42 merged commit 9783a1f into main Mar 6, 2026
1 check passed
@tjgreen42 tjgreen42 deleted the bench/msmarco-v2-ground-truth branch March 6, 2026 22:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant