bench: add MS-MARCO v2 ground truth validation#268
Merged
Conversation
Add BM25 ground truth precomputation and validation scripts for the MS-MARCO v2 (138M) dataset, analogous to the existing v1 pipeline. Validates 20 curated queries: 10 with high-frequency terms (doc_freq > 8.4M, targeting the block_count overflow bug in #266) and 10 with low-frequency terms (baseline correctness). Key design choices for 138M scale: - Single-pass doc_freq computation (~20 min vs ~13 min/term) - Materialized doc_term_data and doc_lengths tables to eliminate correlated subqueries that block parallel execution - fieldnorm_quantize marked PARALLEL SAFE for parallel scans - Total precompute time: ~2 hours (vs ~5+ hours with serial approach) Validation results: - Without #266 fix: 4/10 high-freq queries fail (doc mismatches) - With #266 fix: 20/20 queries pass (worst diff: 0.000002)
Expand validation from 20 curated queries to 400: 10 high-frequency term queries, 10 low-frequency term queries, and 380 randomly sampled from the full dev set. Random selection uses hashint4(query_id + 42) for reproducibility. All 400 queries validated against the BM25 index with PR #266 fix: 100% docs match, 100% scores match, worst absolute diff 0.000003. Total precomputation time: ~14.5 hours on 138M rows.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
ground_truth_pg17.tsvwith precomputed reference scoresValidation Results
The 4 failing queries without #266 all contain high-frequency terms (>8.4M postings) — exactly the terms affected by the uint16 block_count overflow.
Design Notes
At 138M rows, naive approaches don't work:
doc_term_dataanddoc_lengthstables eliminate correlated subqueries that block parallel executionfieldnorm_quantize()markedPARALLEL SAFE(defaults to UNSAFE, silently blocks all parallelism)Test plan