fix: TOCTOU race in parallel build loses documents by tjgreen42 · Pull Request #240 · timescale/pg_textsearch

tjgreen42 · 2026-02-25T01:48:52Z

Summary

Fix a TOCTOU race condition in tp_leader_process_buffers() that can silently lose documents during parallel index builds. The leader collected ready buffers before checking workers_done, allowing a window where a worker marks its final buffer READY and increments workers_done between the two reads — causing the leader to exit without processing the final buffer.
Fix a latent deadlock where shared->nworkers was set to the requested worker count but never updated to the actually launched count. If Postgres launched fewer workers, workers_done could never reach nworkers.
Update expected test output for the build-completed NOTICE format change from feat: reduce build chatter for partitioned tables #214.

The race

Leader:                              Worker (finishing):
                                     1. Mark final buffer READY
                                     2. SpinLock → workers_done++
1. Collect buffers → num_ready=0
   (missed the READY buffer)
2. Check all_done → true
3. num_ready(0) > 0 → false → skip
4. while(!all_done) → EXIT
   *** FINAL BUFFER NEVER PROCESSED ***

The fix

Check workers_done first (under SpinLock, providing the memory barrier), then collect buffers. Workers set buffer READY before the SpinLock acquire around workers_done++, so after the leader's SpinLock release, all final buffers are guaranteed visible. The loop now exits only when all_done && num_ready == 0.

Impact

On MS-MARCO (8.8M passages), this race non-deterministically lost ~4M documents (~45% of the corpus). One CI run indexed only 4,834,444 documents while the same data loaded correctly in another run (8,841,770 documents). The root cause was discovered while investigating benchmark validation failures in #239.

Testing

Regression tests pass (47/47). The race is non-deterministic so it cannot be reliably reproduced in unit tests — validation should be done via the MS-MARCO benchmark pipeline which exercises parallel builds at scale.

The leader's buffer processing loop collected ready buffers BEFORE checking workers_done. A worker marking its final buffer READY and incrementing workers_done between these two reads caused the leader to see all_done=true with num_ready=0, exiting without processing the final buffer. With large datasets this could silently lose millions of documents -- we observed ~4M of 8.8M documents lost on MS-MARCO. Fix by checking workers_done FIRST (under SpinLock, providing the necessary memory barrier), THEN collecting buffers. Workers set buffer READY before the SpinLock acquire around workers_done++, so after the leader's SpinLock release, all final buffers are guaranteed visible. The loop now exits only when all_done AND num_ready == 0. Also fix a latent deadlock: shared->nworkers was set to the requested count but never updated to the actual launched count. If Postgres launched fewer workers, workers_done could never reach nworkers.

Port applicable fixes from PR #240 to the rewritten parallel build: 1. Error out if no parallel workers could be launched, preventing a division-by-zero in the TID range block assignment. 2. Use nworkers_launched (actual count) instead of nworkers (requested count) in worker_execute_merge_group() and worker memory budget calculation. When Postgres launches fewer workers than requested, the old code over-divided the memory budget and iterated over non-existent worker results. The TOCTOU race fix from PR #240 does not apply — the rewritten two-phase barrier design has no leader buffer processing loop.

tjgreen42 force-pushed the fix/parallel-build-toctou branch from e225608 to d6b27ea Compare February 25, 2026 01:59

tjgreen42 mentioned this pull request Feb 25, 2026

fix: run BM25 validation in benchmark CI workflow #239

Merged

tjgreen42 marked this pull request as ready for review February 25, 2026 17:00

tjgreen42 merged commit b75e7e3 into main Feb 25, 2026
15 checks passed

tjgreen42 deleted the fix/parallel-build-toctou branch February 25, 2026 17:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: TOCTOU race in parallel build loses documents#240

fix: TOCTOU race in parallel build loses documents#240
tjgreen42 merged 1 commit intomainfrom
fix/parallel-build-toctou

tjgreen42 commented Feb 25, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

tjgreen42 commented Feb 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

The race

The fix

Impact

Testing

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

tjgreen42 commented Feb 25, 2026 •

edited

Loading