Skip to content

fix: run BM25 validation in benchmark CI workflow#239

Merged
tjgreen42 merged 6 commits intomainfrom
fix/benchmark-validation-in-ci
Feb 25, 2026
Merged

fix: run BM25 validation in benchmark CI workflow#239
tjgreen42 merged 6 commits intomainfrom
fix/benchmark-validation-in-ci

Conversation

@tjgreen42
Copy link
Copy Markdown
Collaborator

@tjgreen42 tjgreen42 commented Feb 24, 2026

Summary

  • Wire up BM25 score validation in the benchmark CI workflow for MS
    MARCO and Wikipedia datasets. The validation scripts existed but were
    never invoked.
  • Fix precompute_ground_truth.sql scripts to compute corpus statistics
    (total_docs, avg_doc_len) from raw table data instead of
    bm25_summarize_index(), removing the index dependency from ground
    truth generation.
  • Add per-Postgres-version ground truth files (ground_truth_pg17.tsv,
    ground_truth_pg18.tsv). PG18 updated the English Snowball stemmer so
    "adding"/"added" stem to "add" instead of "ad" (PG17), changing df
    values for affected queries. The CI workflow selects the correct file
    via pg_config --version.
  • Use absolute tolerance (0.001) for score comparison instead of
    4-decimal-place rounding to account for float4/float8 precision
    differences. Handle tied-score rank swaps at the top-10 boundary.

Testing

Two successful benchmark runs on PG17 with versioned ground truth
selection (22414808747,
22415250080).

@tjgreen42 tjgreen42 force-pushed the fix/benchmark-validation-in-ci branch from faef665 to 918a24f Compare February 25, 2026 04:55
tjgreen42 added a commit that referenced this pull request Feb 25, 2026
## Summary

- Fix a TOCTOU race condition in `tp_leader_process_buffers()` that can
silently lose documents during parallel index builds. The leader
collected ready buffers *before* checking `workers_done`, allowing a
window where a worker marks its final buffer READY and increments
`workers_done` between the two reads — causing the leader to exit
without processing the final buffer.
- Fix a latent deadlock where `shared->nworkers` was set to the
*requested* worker count but never updated to the *actually launched*
count. If Postgres launched fewer workers, `workers_done` could never
reach `nworkers`.
- Update expected test output for the build-completed NOTICE format
change from #214.

### The race

```
Leader:                              Worker (finishing):
                                     1. Mark final buffer READY
                                     2. SpinLock → workers_done++
1. Collect buffers → num_ready=0
   (missed the READY buffer)
2. Check all_done → true
3. num_ready(0) > 0 → false → skip
4. while(!all_done) → EXIT
   *** FINAL BUFFER NEVER PROCESSED ***
```

### The fix

Check `workers_done` *first* (under SpinLock, providing the memory
barrier), *then* collect buffers. Workers set buffer READY before the
SpinLock acquire around `workers_done++`, so after the leader's SpinLock
release, all final buffers are guaranteed visible. The loop now exits
only when `all_done && num_ready == 0`.

### Impact

On MS-MARCO (8.8M passages), this race non-deterministically lost ~4M
documents (~45% of the corpus). One CI run indexed only 4,834,444
documents while the same data loaded correctly in another run (8,841,770
documents). The root cause was discovered while investigating benchmark
validation failures in #239.

## Testing

Regression tests pass (47/47). The race is non-deterministic so it
cannot be reliably reproduced in unit tests — validation should be done
via the MS-MARCO benchmark pipeline which exercises parallel builds at
scale.
@tjgreen42 tjgreen42 marked this pull request as ready for review February 25, 2026 17:01
tjgreen42 and others added 2 commits February 25, 2026 09:13
The MS-MARCO and Wikipedia benchmark validation scripts
(validate_queries.sql) were never invoked by the CI workflow. The
workflow ran load.sql and queries.sql directly, bypassing
run_benchmark.sh which contains the validation logic.

Add explicit validation steps that run after each benchmark's query
phase, checking BM25 scores against precomputed ground truth to 4
decimal places. Validation is gated on the dataset size matching what
was used to precompute ground truth (MS-MARCO: full, Wikipedia: 100K).

Both validations run automatically on the nightly schedule since it
uses these exact sizes by default.
The precompute_ground_truth.sql scripts for both MS-MARCO and Wikipedia
were using bm25_summarize_index() to get total_docs and avg_doc_len,
creating an unnecessary dependency on the index being present and
correct. This could produce wrong ground truth if the index had
incorrect statistics (e.g., due to build bugs).

Compute corpus statistics directly from the raw table data by scanning
all documents and counting tokens via to_tsvector(). Documents with
empty tsvectors are excluded to match the index behavior.

This makes the ground truth computation fully independent of the index,
ensuring it serves as a true reference for validation.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@tjgreen42 tjgreen42 force-pushed the fix/benchmark-validation-in-ci branch from f2ab0dd to 5684855 Compare February 25, 2026 18:49
PG18 updated the English Snowball stemmer: "adding"/"added" now stem
to "add" instead of "ad" (PG17). Since CI runs PG17, the ground truth
must use PG17 df values. This affected query 158 (the only validation
query containing "add").

Also improve validation tolerance: use absolute threshold (0.001)
instead of 4-decimal-place rounding, and handle tied-score rank swaps
at the top-10 boundary.
PG17 and PG18 have different Snowball stemmers ("adding"/"added" stem
to "ad" on PG17, "add" on PG18), which changes df values and BM25
scores for queries containing "add". Store separate ground truth files
per PG version and select the right one in CI based on pg_config.
CI and run_benchmark.sh now select ground_truth_pgNN.tsv based on the
Postgres major version and copy it to ground_truth.tsv at runtime.
The checked-in generic file was always identical to one of the versioned
files, so remove it and add it to .gitignore.

Also fail CI explicitly if no versioned ground truth exists for the
detected Postgres version, rather than silently proceeding.
@tjgreen42 tjgreen42 merged commit 9ed758b into main Feb 25, 2026
1 check passed
@tjgreen42 tjgreen42 deleted the fix/benchmark-validation-in-ci branch February 25, 2026 21:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant