Skip to content

bench: add MS MARCO v2 dataset and weighted-average latency metric#226

Merged
tjgreen42 merged 2 commits intomainfrom
benchmarks/weighted-throughput-msmarco-v2
Feb 17, 2026
Merged

bench: add MS MARCO v2 dataset and weighted-average latency metric#226
tjgreen42 merged 2 commits intomainfrom
benchmarks/weighted-throughput-msmarco-v2

Conversation

@tjgreen42
Copy link
Copy Markdown
Collaborator

Summary

  • Add benchmark infrastructure for the MS MARCO v2 passage ranking dataset (138M passages) with download, load, query, and System X comparison scripts
  • Introduce weighted-average query latency as a new tracked metric across all benchmark datasets (MS MARCO v1, v2, Wikipedia)
  • Per-bucket p50 latencies are weighted by the observed MS-MARCO v1 lexeme distribution (1M queries: 3-token mode at 30%, mean 3.7 tokens) to produce a single summary number that reflects realistic workload performance

Files changed

New (MS MARCO v2 benchmark infrastructure):

  • benchmarks/datasets/msmarco-v2/ — download, load, query scripts + benchmark queries TSV
  • benchmarks/datasets/msmarco-v2/systemx/ — System X (ParadeDB) comparison scripts

Modified (weighted-average metric):

  • benchmarks/datasets/msmarco/queries.sql — add weighted-average computation
  • benchmarks/datasets/wikipedia/queries.sql — add weighted-average computation
  • benchmarks/runner/extract_metrics.sh — extract WEIGHTED_LATENCY_RESULT from logs
  • benchmarks/runner/format_for_action.sh — report weighted latency in github-action-benchmark format
  • benchmarks/runner/run_benchmark.sh — add msmarco-v2 as a dataset option
  • benchmarks/gh-pages/methodology.html — document weighted methodology + MS MARCO v2 dataset

Distribution weights (MS-MARCO v1, 1,010,905 queries)

Tokens Count Weight
1 35,638 3.5%
2 165,033 16.3%
3 304,887 30.2%
4 264,177 26.1%
5 143,765 14.2%
6 59,558 5.9%
7 22,595 2.2%
8+ 15,252 1.5%

Test plan

  • Verify weighted-average computation is correct by manual calculation
  • Run MS MARCO v1 benchmark and confirm WEIGHTED_LATENCY_RESULT appears in output
  • Run extract_metrics.sh on output and verify JSON includes weighted_latency section
  • Run format_for_action.sh and verify "Weighted Latency" appears in action output
  • CI passes (benchmark-only changes, no C code modified)

🤖 Generated with Claude Code

tjgreen42 and others added 2 commits February 12, 2026 15:59
Add benchmark infrastructure for the MS MARCO v2 passage ranking
dataset (138M passages) with download, load, query, and System X
comparison scripts.

Introduce weighted-average query latency as a new tracked metric
across all benchmark datasets (MS MARCO v1, v2, Wikipedia). Per-bucket
p50 latencies are weighted by the observed MS-MARCO v1 lexeme
distribution (1M queries: 3-token mode at 30%, mean 3.7 tokens) to
produce a single summary number that reflects realistic workload
performance.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add a separate weighted throughput output line (weighted avg ms/query
using the MS-MARCO v1 lexeme distribution) alongside the existing
weighted latency. Update extract_metrics.sh and format_for_action.sh
to extract and report the new metric.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@tjgreen42 tjgreen42 marked this pull request as ready for review February 17, 2026 18:51
@tjgreen42 tjgreen42 merged commit 0f9564c into main Feb 17, 2026
1 check passed
@tjgreen42 tjgreen42 deleted the benchmarks/weighted-throughput-msmarco-v2 branch February 17, 2026 18:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant