fix: TOCTOU race in parallel build loses documents#240
Merged
Conversation
The leader's buffer processing loop collected ready buffers BEFORE checking workers_done. A worker marking its final buffer READY and incrementing workers_done between these two reads caused the leader to see all_done=true with num_ready=0, exiting without processing the final buffer. With large datasets this could silently lose millions of documents -- we observed ~4M of 8.8M documents lost on MS-MARCO. Fix by checking workers_done FIRST (under SpinLock, providing the necessary memory barrier), THEN collecting buffers. Workers set buffer READY before the SpinLock acquire around workers_done++, so after the leader's SpinLock release, all final buffers are guaranteed visible. The loop now exits only when all_done AND num_ready == 0. Also fix a latent deadlock: shared->nworkers was set to the requested count but never updated to the actual launched count. If Postgres launched fewer workers, workers_done could never reach nworkers.
e225608 to
d6b27ea
Compare
tjgreen42
added a commit
that referenced
this pull request
Feb 25, 2026
Port applicable fixes from PR #240 to the rewritten parallel build: 1. Error out if no parallel workers could be launched, preventing a division-by-zero in the TID range block assignment. 2. Use nworkers_launched (actual count) instead of nworkers (requested count) in worker_execute_merge_group() and worker memory budget calculation. When Postgres launches fewer workers than requested, the old code over-divided the memory budget and iterated over non-existent worker results. The TOCTOU race fix from PR #240 does not apply — the rewritten two-phase barrier design has no leader buffer processing loop.
tjgreen42
added a commit
that referenced
this pull request
Feb 26, 2026
Port applicable fixes from PR #240 to the rewritten parallel build: 1. Error out if no parallel workers could be launched, preventing a division-by-zero in the TID range block assignment. 2. Use nworkers_launched (actual count) instead of nworkers (requested count) in worker_execute_merge_group() and worker memory budget calculation. When Postgres launches fewer workers than requested, the old code over-divided the memory budget and iterated over non-existent worker results. The TOCTOU race fix from PR #240 does not apply — the rewritten two-phase barrier design has no leader buffer processing loop.
tjgreen42
added a commit
that referenced
this pull request
Feb 26, 2026
Port applicable fixes from PR #240 to the rewritten parallel build: 1. Error out if no parallel workers could be launched, preventing a division-by-zero in the TID range block assignment. 2. Use nworkers_launched (actual count) instead of nworkers (requested count) in worker_execute_merge_group() and worker memory budget calculation. When Postgres launches fewer workers than requested, the old code over-divided the memory budget and iterated over non-existent worker results. The TOCTOU race fix from PR #240 does not apply — the rewritten two-phase barrier design has no leader buffer processing loop.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
tp_leader_process_buffers()that can silently lose documents during parallel index builds. The leader collected ready buffers before checkingworkers_done, allowing a window where a worker marks its final buffer READY and incrementsworkers_donebetween the two reads — causing the leader to exit without processing the final buffer.shared->nworkerswas set to the requested worker count but never updated to the actually launched count. If Postgres launched fewer workers,workers_donecould never reachnworkers.The race
The fix
Check
workers_donefirst (under SpinLock, providing the memory barrier), then collect buffers. Workers set buffer READY before the SpinLock acquire aroundworkers_done++, so after the leader's SpinLock release, all final buffers are guaranteed visible. The loop now exits only whenall_done && num_ready == 0.Impact
On MS-MARCO (8.8M passages), this race non-deterministically lost ~4M documents (~45% of the corpus). One CI run indexed only 4,834,444 documents while the same data loaded correctly in another run (8,841,770 documents). The root cause was discovered while investigating benchmark validation failures in #239.
Testing
Regression tests pass (47/47). The race is non-deterministic so it cannot be reliably reproduced in unit tests — validation should be done via the MS-MARCO benchmark pipeline which exercises parallel builds at scale.