fix: run BM25 validation in benchmark CI workflow by tjgreen42 · Pull Request #239 · timescale/pg_textsearch

tjgreen42 · 2026-02-24T23:40:24Z

Summary

Wire up BM25 score validation in the benchmark CI workflow for MS
MARCO and Wikipedia datasets. The validation scripts existed but were
never invoked.
Fix precompute_ground_truth.sql scripts to compute corpus statistics
(total_docs, avg_doc_len) from raw table data instead of
bm25_summarize_index(), removing the index dependency from ground
truth generation.
Add per-Postgres-version ground truth files (ground_truth_pg17.tsv,
ground_truth_pg18.tsv). PG18 updated the English Snowball stemmer so
"adding"/"added" stem to "add" instead of "ad" (PG17), changing df
values for affected queries. The CI workflow selects the correct file
via pg_config --version.
Use absolute tolerance (0.001) for score comparison instead of
4-decimal-place rounding to account for float4/float8 precision
differences. Handle tied-score rank swaps at the top-10 boundary.

Testing

Two successful benchmark runs on PG17 with versioned ground truth
selection (22414808747,
22415250080).

## Summary - Fix a TOCTOU race condition in `tp_leader_process_buffers()` that can silently lose documents during parallel index builds. The leader collected ready buffers *before* checking `workers_done`, allowing a window where a worker marks its final buffer READY and increments `workers_done` between the two reads — causing the leader to exit without processing the final buffer. - Fix a latent deadlock where `shared->nworkers` was set to the *requested* worker count but never updated to the *actually launched* count. If Postgres launched fewer workers, `workers_done` could never reach `nworkers`. - Update expected test output for the build-completed NOTICE format change from #214. ### The race ``` Leader: Worker (finishing): 1. Mark final buffer READY 2. SpinLock → workers_done++ 1. Collect buffers → num_ready=0 (missed the READY buffer) 2. Check all_done → true 3. num_ready(0) > 0 → false → skip 4. while(!all_done) → EXIT *** FINAL BUFFER NEVER PROCESSED *** ``` ### The fix Check `workers_done` *first* (under SpinLock, providing the memory barrier), *then* collect buffers. Workers set buffer READY before the SpinLock acquire around `workers_done++`, so after the leader's SpinLock release, all final buffers are guaranteed visible. The loop now exits only when `all_done && num_ready == 0`. ### Impact On MS-MARCO (8.8M passages), this race non-deterministically lost ~4M documents (~45% of the corpus). One CI run indexed only 4,834,444 documents while the same data loaded correctly in another run (8,841,770 documents). The root cause was discovered while investigating benchmark validation failures in #239. ## Testing Regression tests pass (47/47). The race is non-deterministic so it cannot be reliably reproduced in unit tests — validation should be done via the MS-MARCO benchmark pipeline which exercises parallel builds at scale.

The MS-MARCO and Wikipedia benchmark validation scripts (validate_queries.sql) were never invoked by the CI workflow. The workflow ran load.sql and queries.sql directly, bypassing run_benchmark.sh which contains the validation logic. Add explicit validation steps that run after each benchmark's query phase, checking BM25 scores against precomputed ground truth to 4 decimal places. Validation is gated on the dataset size matching what was used to precompute ground truth (MS-MARCO: full, Wikipedia: 100K). Both validations run automatically on the nightly schedule since it uses these exact sizes by default.

The precompute_ground_truth.sql scripts for both MS-MARCO and Wikipedia were using bm25_summarize_index() to get total_docs and avg_doc_len, creating an unnecessary dependency on the index being present and correct. This could produce wrong ground truth if the index had incorrect statistics (e.g., due to build bugs). Compute corpus statistics directly from the raw table data by scanning all documents and counting tokens via to_tsvector(). Documents with empty tsvectors are excluded to match the index behavior. This makes the ground truth computation fully independent of the index, ensuring it serves as a true reference for validation. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

PG18 updated the English Snowball stemmer: "adding"/"added" now stem to "add" instead of "ad" (PG17). Since CI runs PG17, the ground truth must use PG17 df values. This affected query 158 (the only validation query containing "add"). Also improve validation tolerance: use absolute threshold (0.001) instead of 4-decimal-place rounding, and handle tied-score rank swaps at the top-10 boundary.

PG17 and PG18 have different Snowball stemmers ("adding"/"added" stem to "ad" on PG17, "add" on PG18), which changes df values and BM25 scores for queries containing "add". Store separate ground truth files per PG version and select the right one in CI based on pg_config.

CI and run_benchmark.sh now select ground_truth_pgNN.tsv based on the Postgres major version and copy it to ground_truth.tsv at runtime. The checked-in generic file was always identical to one of the versioned files, so remove it and add it to .gitignore. Also fail CI explicitly if no versioned ground truth exists for the detected Postgres version, rather than silently proceeding.

tjgreen42 mentioned this pull request Feb 25, 2026

fix: TOCTOU race in parallel build loses documents #240

Merged

tjgreen42 force-pushed the fix/benchmark-validation-in-ci branch from faef665 to 918a24f Compare February 25, 2026 04:55

tjgreen42 marked this pull request as ready for review February 25, 2026 17:01

tjgreen42 and others added 2 commits February 25, 2026 09:13

tjgreen42 force-pushed the fix/benchmark-validation-in-ci branch from f2ab0dd to 5684855 Compare February 25, 2026 18:49

tjgreen42 added 4 commits February 25, 2026 10:53

ci: retrigger CI

fd86345

tjgreen42 merged commit 9ed758b into main Feb 25, 2026
1 check passed

tjgreen42 deleted the fix/benchmark-validation-in-ci branch February 25, 2026 21:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: run BM25 validation in benchmark CI workflow#239

fix: run BM25 validation in benchmark CI workflow#239
tjgreen42 merged 6 commits intomainfrom
fix/benchmark-validation-in-ci

tjgreen42 commented Feb 24, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

tjgreen42 commented Feb 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Testing

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

tjgreen42 commented Feb 24, 2026 •

edited

Loading