Add public benchmark suite with MS MARCO and Wikipedia#66
Merged
Conversation
- MS MARCO Passage Ranking (8.8M passages, 10K queries) - Wikipedia dataset loader (configurable: 10K to full 6M articles) - Benchmark runner script with timing and reporting - GitHub Actions workflow for weekly benchmarks Initial local benchmark on release build: - MS MARCO data load: 76 seconds - BM25 index build (8.8M docs): 4 min 45 sec
- Add disk cleanup step to free ~25GB before running benchmarks - Add msmarco_size input option (100K, 500K, 1M, full) - Default to 1M passages for weekly runs (vs 8.8M full) - Update download.sh to support subset creation
- Rewrite queries.sql to use EXPLAIN ANALYZE for proper index scan timing - Add queries for different query types: short, medium, long, common, rare - Add throughput benchmark (20 queries in batch) - Add extract_metrics.sh to parse results into JSON format - Add GitHub job summary with key metrics table - Include benchmark_metrics.json in artifacts for historical comparison
- Fix grep patterns to extract index build and load times - Fix RAISE NOTICE format specifiers for throughput output
- Change default msmarco_size from 1M to full (8.8M passages) - Add Benchmarks section to CONTRIBUTING.md with: - On-demand benchmark commands using gh CLI - How to view and download results - Local benchmark instructions - Dataset descriptions
Benchmark tracking: - Add github-action-benchmark integration for historical graphs - Publish performance dashboard to GitHub Pages - Alert on PRs when performance regresses >150% - Add benchmark gate to releases (fails on >120% regression) - Add format_for_action.sh to convert metrics JSON Version bump to 0.1.1-dev: - Update pg_textsearch.control default_version - Add sql/pg_textsearch--0.1.1-dev.sql - Add upgrade path sql/pg_textsearch--0.1.0--0.1.1-dev.sql - Update test expected output files - Update README status and CHANGELOG
The previous benchmark disabled index scans (enable_indexscan = off), which meant it only tested sequential scan + scoring performance and would not catch BM25 index regressions. Changes: - Remove enable_indexscan/enable_bitmapscan = off settings - Add EXPLAIN ANALYZE for 5 representative queries to measure index scan latency (short, medium, long, common terms, rare terms) - Add throughput test running all 225 standard Cranfield queries - Add search quality validation (Precision@10 against relevance judgments) - Add index statistics output Local test results (1400 documents, 225 queries): - Query latency: 3.5-6.6ms per query - Throughput: 5.89 ms/query average (225 queries in 1.3s) - Precision@10: 0.60
The gh-pages branch doesn't exist yet, so PR benchmarks and release gates would fail trying to fetch it. Add skip-fetch-gh-pages: true since these jobs don't save data anyway - they only compare/alert.
The gh-pages branch doesn't exist yet and the benchmark action fails trying to switch to it. Switch to using GitHub Actions cache to store benchmark history: - PR benchmarks: restore from cache, compare, don't save (PRs don't update baseline) - Full benchmarks (weekly/manual): restore from cache, compare, save new baseline to cache - Release gate: restore from cache, compare with strict threshold This approach: 1. Doesn't require gh-pages branch setup 2. Keeps benchmark history in Actions cache (persists across runs) 3. Still enables regression detection and PR comments
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds benchmark suite with historical tracking and regression detection:
Performance Tracking
Dashboard URL (after first run): https://timescale.github.io/pg_textsearch/benchmarks/
Latest Results (full MS MARCO, 8.8M passages)
Version Bump
Includes version bump to 0.1.1-dev to start the next development cycle.
Files Added/Modified
.github/workflows/benchmark.yml- CI workflow with tracking.github/workflows/release.yml- Added benchmark gatebenchmarks/runner/format_for_action.sh- Metrics JSON converterbenchmarks/datasets/msmarco/- Download, load, and query scriptsbenchmarks/datasets/wikipedia/- Download, load, and query scriptsCONTRIBUTING.md- Documented benchmark dashboard and alertsTesting
One-Time Setup Required
After merging, enable GitHub Pages: