fix: run BM25 validation in benchmark CI workflow#239
Merged
Conversation
faef665 to
918a24f
Compare
tjgreen42
added a commit
that referenced
this pull request
Feb 25, 2026
## Summary - Fix a TOCTOU race condition in `tp_leader_process_buffers()` that can silently lose documents during parallel index builds. The leader collected ready buffers *before* checking `workers_done`, allowing a window where a worker marks its final buffer READY and increments `workers_done` between the two reads — causing the leader to exit without processing the final buffer. - Fix a latent deadlock where `shared->nworkers` was set to the *requested* worker count but never updated to the *actually launched* count. If Postgres launched fewer workers, `workers_done` could never reach `nworkers`. - Update expected test output for the build-completed NOTICE format change from #214. ### The race ``` Leader: Worker (finishing): 1. Mark final buffer READY 2. SpinLock → workers_done++ 1. Collect buffers → num_ready=0 (missed the READY buffer) 2. Check all_done → true 3. num_ready(0) > 0 → false → skip 4. while(!all_done) → EXIT *** FINAL BUFFER NEVER PROCESSED *** ``` ### The fix Check `workers_done` *first* (under SpinLock, providing the memory barrier), *then* collect buffers. Workers set buffer READY before the SpinLock acquire around `workers_done++`, so after the leader's SpinLock release, all final buffers are guaranteed visible. The loop now exits only when `all_done && num_ready == 0`. ### Impact On MS-MARCO (8.8M passages), this race non-deterministically lost ~4M documents (~45% of the corpus). One CI run indexed only 4,834,444 documents while the same data loaded correctly in another run (8,841,770 documents). The root cause was discovered while investigating benchmark validation failures in #239. ## Testing Regression tests pass (47/47). The race is non-deterministic so it cannot be reliably reproduced in unit tests — validation should be done via the MS-MARCO benchmark pipeline which exercises parallel builds at scale.
The MS-MARCO and Wikipedia benchmark validation scripts (validate_queries.sql) were never invoked by the CI workflow. The workflow ran load.sql and queries.sql directly, bypassing run_benchmark.sh which contains the validation logic. Add explicit validation steps that run after each benchmark's query phase, checking BM25 scores against precomputed ground truth to 4 decimal places. Validation is gated on the dataset size matching what was used to precompute ground truth (MS-MARCO: full, Wikipedia: 100K). Both validations run automatically on the nightly schedule since it uses these exact sizes by default.
The precompute_ground_truth.sql scripts for both MS-MARCO and Wikipedia were using bm25_summarize_index() to get total_docs and avg_doc_len, creating an unnecessary dependency on the index being present and correct. This could produce wrong ground truth if the index had incorrect statistics (e.g., due to build bugs). Compute corpus statistics directly from the raw table data by scanning all documents and counting tokens via to_tsvector(). Documents with empty tsvectors are excluded to match the index behavior. This makes the ground truth computation fully independent of the index, ensuring it serves as a true reference for validation. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
f2ab0dd to
5684855
Compare
PG18 updated the English Snowball stemmer: "adding"/"added" now stem to "add" instead of "ad" (PG17). Since CI runs PG17, the ground truth must use PG17 df values. This affected query 158 (the only validation query containing "add"). Also improve validation tolerance: use absolute threshold (0.001) instead of 4-decimal-place rounding, and handle tied-score rank swaps at the top-10 boundary.
PG17 and PG18 have different Snowball stemmers ("adding"/"added" stem
to "ad" on PG17, "add" on PG18), which changes df values and BM25
scores for queries containing "add". Store separate ground truth files
per PG version and select the right one in CI based on pg_config.
CI and run_benchmark.sh now select ground_truth_pgNN.tsv based on the Postgres major version and copy it to ground_truth.tsv at runtime. The checked-in generic file was always identical to one of the versioned files, so remove it and add it to .gitignore. Also fail CI explicitly if no versioned ground truth exists for the detected Postgres version, rather than silently proceeding.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
MARCO and Wikipedia datasets. The validation scripts existed but were
never invoked.
(total_docs, avg_doc_len) from raw table data instead of
bm25_summarize_index(), removing the index dependency from ground
truth generation.
ground_truth_pg18.tsv). PG18 updated the English Snowball stemmer so
"adding"/"added" stem to "add" instead of "ad" (PG17), changing df
values for affected queries. The CI workflow selects the correct file
via
pg_config --version.4-decimal-place rounding to account for float4/float8 precision
differences. Handle tied-score rank swaps at the top-10 boundary.
Testing
Two successful benchmark runs on PG17 with versioned ground truth
selection (22414808747,
22415250080).