Skip to content

feat: leader-only merge for parallel index build#244

Merged
tjgreen42 merged 5 commits intomainfrom
experiment/leader-only-merge
Mar 2, 2026
Merged

feat: leader-only merge for parallel index build#244
tjgreen42 merged 5 commits intomainfrom
experiment/leader-only-merge

Conversation

@tjgreen42
Copy link
Copy Markdown
Collaborator

@tjgreen42 tjgreen42 commented Mar 1, 2026

Summary

Simplify parallel index build by replacing the complex two-phase architecture
(worker compaction + work-stealing merge groups) with a single leader-only
N-way merge.

How it works now:

  1. Phase 1: Workers scan heap partitions, flush L0 segments to BufFiles (no compaction)
  2. Leader merge: Opens all worker BufFiles, performs single N-way merge directly to paged storage
  3. Workers exit: Leader signals completion, workers wake and exit

What was removed (~800 lines):

  • Worker-side compaction (worker_maybe_compact_level)
  • Worker Phase 2 (COPY segments to pages, work-steal merge groups)
  • Leader merge-group planning (plan_merge_groups, compute_total_pages_needed)
  • Cross-worker merge execution (worker_execute_merge_group)
  • Page pool pre-allocation

Benchmark Results

GitHub CI Benchmarks

Dataset Metric main This PR Change
MS MARCO (8.8M) Build time 225s 218s -3%
MS MARCO (8.8M) Index size 1,360 MB 1,215 MB -11%
MS MARCO (8.8M) Query avg 7.49 ms 7.92 ms +6%
Wikipedia (100K) Build time 12,430ms 10,085ms -19%
Wikipedia (100K) Index size 39 MB 39 MB same
Wikipedia (100K) Query avg 0.33 ms 0.32 ms -3%
Cranfield (1.4K) Build time 253ms 237ms -6%

All validations PASSED.

Local MS-MARCO v2 (138M rows, 4 workers)

Metric Previous This PR Change
Total build time ~60 min 27 min -54%
Index size 17 GB 17 GB same
Segments produced 2 L1 1 L0 simpler

Test plan

  • make installcheck — all 48 SQL tests pass
  • CI green (PG17, PG18, sanitizers, formatting, coverage)
  • GitHub benchmark workflow passes with validation
  • Local MS-MARCO v2 (138M rows) builds successfully
  • Self code-review: fixed FlushRelationBuffers ordering, merge_source_close usage, removed dead field

Replace the two-phase parallel build (worker compaction + work-stealing
merge groups) with a simpler architecture: workers flush L0 segments to
BufFiles without compaction, then the leader performs a single N-way
merge of all segments directly to paged storage.

This produces a single segment per index build, which is optimal for
query performance (no multi-segment scanning needed).

Removes ~800 lines of complexity: worker_maybe_compact_level,
plan_merge_groups, compute_total_pages_needed,
worker_execute_merge_group, write_temp_segment_to_index_parallel,
and all Phase 2 worker code (COPY + work-steal).
- Move FlushRelationBuffers before metapage update for consistency
  with merge.c pattern (segment data durable before metadata)
- Use merge_source_close() instead of manual pfree for cleanup
- NULL readers[] slot when source takes ownership to prevent
  double-close
- Remove dead segment_count field from TpParallelWorkerResult
Remove code rendered dead by the leader-only merge refactor:

- Remove BufFile write path from merge sink (is_buffile branches,
  buffile_write_at, merge_sink_init_buffile, merge_sink_init_pages_parallel)
- Remove atomic page counter from TpSegmentWriter (page_counter field,
  tp_segment_writer_init_parallel, write_page_index_with_counter,
  tp_page_index_entries_per_page)
- Remove dead includes and extern from build_parallel.c
- Simplify TpMergeSink struct to pages-only fields

5 files changed, -209 lines net.
The "merging N segments from M workers" message is noisy and
not useful for end users. The launched-workers message already
indicates parallel build is active.
@tjgreen42 tjgreen42 force-pushed the experiment/leader-only-merge branch from 68177d8 to ee9c738 Compare March 2, 2026 04:51
@tjgreen42 tjgreen42 marked this pull request as ready for review March 2, 2026 04:56
@tjgreen42 tjgreen42 changed the title experiment: leader-only merge for parallel index build feat: leader-only merge for parallel index build Mar 2, 2026
@tjgreen42 tjgreen42 merged commit 44897e4 into main Mar 2, 2026
15 checks passed
@tjgreen42 tjgreen42 deleted the experiment/leader-only-merge branch March 2, 2026 16:35
tjgreen42 added a commit that referenced this pull request Mar 3, 2026
## Summary
- Update comparison page with results from benchmark run
[22642807624](https://github.com/timescale/pg_textsearch/actions/runs/22642807624)
- Overall throughput improved from 2.8x to 3.2x faster than System X
- Build time gap narrowed from 2.0x to 1.6x (270s → 234s)
- Key improvements since Feb 9: SIMD bitpack decoding (#250),
stack-allocated decode buffers (#253), BMW term state pointer
indirection (#249), arena allocator rewrite (#231), leader-only merge
(#244)

## Testing
- Numbers extracted from benchmark run on commit 1b09cc9
- gh-pages branch also needs updating (will push after merge)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant