bench: add MS MARCO v2 dataset and weighted-average latency metric by tjgreen42 · Pull Request #226 · timescale/pg_textsearch

tjgreen42 · 2026-02-12T16:00:14Z

Summary

Add benchmark infrastructure for the MS MARCO v2 passage ranking dataset (138M passages) with download, load, query, and System X comparison scripts
Introduce weighted-average query latency as a new tracked metric across all benchmark datasets (MS MARCO v1, v2, Wikipedia)
Per-bucket p50 latencies are weighted by the observed MS-MARCO v1 lexeme distribution (1M queries: 3-token mode at 30%, mean 3.7 tokens) to produce a single summary number that reflects realistic workload performance

Files changed

New (MS MARCO v2 benchmark infrastructure):

benchmarks/datasets/msmarco-v2/ — download, load, query scripts + benchmark queries TSV
benchmarks/datasets/msmarco-v2/systemx/ — System X (ParadeDB) comparison scripts

Modified (weighted-average metric):

benchmarks/datasets/msmarco/queries.sql — add weighted-average computation
benchmarks/datasets/wikipedia/queries.sql — add weighted-average computation
benchmarks/runner/extract_metrics.sh — extract WEIGHTED_LATENCY_RESULT from logs
benchmarks/runner/format_for_action.sh — report weighted latency in github-action-benchmark format
benchmarks/runner/run_benchmark.sh — add msmarco-v2 as a dataset option
benchmarks/gh-pages/methodology.html — document weighted methodology + MS MARCO v2 dataset

Distribution weights (MS-MARCO v1, 1,010,905 queries)

Tokens	Count	Weight
1	35,638	3.5%
2	165,033	16.3%
3	304,887	30.2%
4	264,177	26.1%
5	143,765	14.2%
6	59,558	5.9%
7	22,595	2.2%
8+	15,252	1.5%

Test plan

Verify weighted-average computation is correct by manual calculation
Run MS MARCO v1 benchmark and confirm WEIGHTED_LATENCY_RESULT appears in output
Run extract_metrics.sh on output and verify JSON includes weighted_latency section
Run format_for_action.sh and verify "Weighted Latency" appears in action output
CI passes (benchmark-only changes, no C code modified)

🤖 Generated with Claude Code

Add benchmark infrastructure for the MS MARCO v2 passage ranking dataset (138M passages) with download, load, query, and System X comparison scripts. Introduce weighted-average query latency as a new tracked metric across all benchmark datasets (MS MARCO v1, v2, Wikipedia). Per-bucket p50 latencies are weighted by the observed MS-MARCO v1 lexeme distribution (1M queries: 3-token mode at 30%, mean 3.7 tokens) to produce a single summary number that reflects realistic workload performance. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Add a separate weighted throughput output line (weighted avg ms/query using the MS-MARCO v1 lexeme distribution) alongside the existing weighted latency. Update extract_metrics.sh and format_for_action.sh to extract and report the new metric. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

tjgreen42 and others added 2 commits February 12, 2026 15:59

tjgreen42 marked this pull request as ready for review February 17, 2026 18:51

tjgreen42 merged commit 0f9564c into main Feb 17, 2026
1 check passed

tjgreen42 deleted the benchmarks/weighted-throughput-msmarco-v2 branch February 17, 2026 18:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bench: add MS MARCO v2 dataset and weighted-average latency metric#226

bench: add MS MARCO v2 dataset and weighted-average latency metric#226
tjgreen42 merged 2 commits intomainfrom
benchmarks/weighted-throughput-msmarco-v2

tjgreen42 commented Feb 12, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

tjgreen42 commented Feb 12, 2026

Summary

Files changed

Distribution weights (MS-MARCO v1, 1,010,905 queries)

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant