perf: cache skip entries and compressed buffer in BMW inner loop#274
Merged
perf: cache skip entries and compressed buffer in BMW inner loop#274
Conversation
Three optimizations to reduce per-block overhead in the WAND traversal: 1. Pre-load all skip entries into an array during init_segment_term_states. load_block uses the cache instead of reading from the buffer pool (avoiding pin/unpin/lock per block load). 2. Allocate a single reusable compressed-data buffer per term iterator, eliminating palloc/pfree of TP_MAX_COMPRESSED_BLOCK_SIZE per block. 3. Cache current doc_id in TpTermState.cur_doc_id, updated inline by advance_term_iterator and seek_term_to_doc, replacing triple indirection through iter→block_postings→[offset].doc_id. MS-MARCO v2 (138M docs) p50 latency improvement by token bucket: Bucket | Before | After | Improvement | vs Parade 1-token 5.71ms 5.48ms 4% 10.9x faster 2-token 11.70ms 10.03ms 14% 5.9x faster 3-token 26.36ms 20.48ms 22% 3.8x faster 4-token 56.35ms 42.38ms 25% 2.3x faster 5-token 90.51ms 68.16ms 25% 1.8x faster 6-token 132.80ms 103.54ms 22% 1.4x faster 7-token 201.40ms 157.07ms 22% 1.1x faster 8-token 234.13ms 178.51ms 24% 1.1x faster Ground truth validation: 400/400 queries pass.
- Move cache lifetime ownership to BMW (cleanup_segment_term_states frees cached_skip_entries and compressed_buf_cache). Iterator treats them as borrowed pointers and only NULLs them on free. Fixes a memory leak of cached_skip_entries. - Add UINT32_MAX-1 guard in build_context.c to match the existing check in docmap.c, since UINT32_MAX is used as a sentinel value for exhausted iterators.
Handle theoretical load_block failure (data corruption) instead of dereferencing a NULL block_postings pointer. Matches the defensive pattern already used in seek_term_to_doc.
tjgreen42
added a commit
that referenced
this pull request
Mar 10, 2026
BMW cache optimizations (PR #274) improved multi-token query latency by 20-25%, making pg_textsearch faster than System X across all 8 token buckets at p50. Key changes: - Weighted p50: 47.62ms -> 40.61ms (2.0x -> 2.3x vs System X) - 8+ tokens p50: 212ms -> 178ms (was 0.9x, now 1.1x vs System X) - Single-client throughput: 70ms/q -> 63ms/q (1.5x -> 1.7x) - p95 improved on 1-5 token buckets; 6-8+ still mixed
2 tasks
tjgreen42
added a commit
that referenced
this pull request
Mar 10, 2026
## Summary - Update comparison page and summary.md with post-PR #274 benchmark numbers - pg_textsearch now **faster across all 8 token buckets** at p50 (was losing on bucket 8+) - Weighted p50 improved from 2.0x to **2.3x** vs System X ## Key number changes | Metric | Before | After | |--------|--------|-------| | Weighted p50 | 47.62ms (2.0x) | 40.61ms (2.3x) | | 7-token p50 | 163ms (1.0x) | 159ms (1.1x) | | 8+ token p50 | 212ms (0.9x) | 178ms (1.1x) | | Throughput | 70ms/q (1.5x) | 63ms/q (1.7x) | ## Test plan - [x] Benchmark run twice for consistency on same hardware/config as original - [x] System X numbers unchanged (same hardware, not re-run)
tjgreen42
added a commit
that referenced
this pull request
Mar 10, 2026
BMW cache optimizations (PR #274) improved multi-token query latency by 20-25%. pg_textsearch now faster than System X across all 8 token buckets at p50. Weighted p50: 40.61ms vs 94.36ms (2.3x faster).
Closed
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
load_blockreads from the cache instead of hitting the buffer pool (avoids pin/unpin/lock per block)TpTermState.cur_doc_id, updated inline by advance/seek, replacing triple indirection throughiter→block_postings→[offset].doc_idBenchmark (MS-MARCO v2, 138M docs)
Test plan