Skip to content

perf: cache skip entries and compressed buffer in BMW inner loop#274

Merged
tjgreen42 merged 3 commits intomainfrom
perf/bmw-cache-optimizations
Mar 10, 2026
Merged

perf: cache skip entries and compressed buffer in BMW inner loop#274
tjgreen42 merged 3 commits intomainfrom
perf/bmw-cache-optimizations

Conversation

@tjgreen42
Copy link
Copy Markdown
Collaborator

Summary

  • Pre-load all skip entries into a per-term array during BMW init; load_block reads from the cache instead of hitting the buffer pool (avoids pin/unpin/lock per block)
  • Allocate a single reusable compressed-data buffer per term, eliminating palloc/pfree of 898 bytes on every block load
  • Cache current doc_id in TpTermState.cur_doc_id, updated inline by advance/seek, replacing triple indirection through iter→block_postings→[offset].doc_id

Benchmark (MS-MARCO v2, 138M docs)

Bucket Before After Improvement vs ParadeDB
1-token 5.71ms 5.48ms 4% 10.9x faster
2-token 11.70ms 10.03ms 14% 5.9x faster
3-token 26.36ms 20.48ms 22% 3.8x faster
4-token 56.35ms 42.38ms 25% 2.3x faster
5-token 90.51ms 68.16ms 25% 1.8x faster
6-token 132.80ms 103.54ms 22% 1.4x faster
7-token 201.40ms 157.07ms 22% 1.1x faster
8-token 234.13ms 178.51ms 24% 1.1x faster

Test plan

  • Ground truth validation: 400/400 queries pass (scores match within tolerance)
  • All 50 SQL regression tests pass
  • CI

Three optimizations to reduce per-block overhead in the WAND traversal:

1. Pre-load all skip entries into an array during init_segment_term_states.
   load_block uses the cache instead of reading from the buffer pool
   (avoiding pin/unpin/lock per block load).

2. Allocate a single reusable compressed-data buffer per term iterator,
   eliminating palloc/pfree of TP_MAX_COMPRESSED_BLOCK_SIZE per block.

3. Cache current doc_id in TpTermState.cur_doc_id, updated inline by
   advance_term_iterator and seek_term_to_doc, replacing triple
   indirection through iter→block_postings→[offset].doc_id.

MS-MARCO v2 (138M docs) p50 latency improvement by token bucket:

  Bucket | Before  | After   | Improvement | vs Parade
  1-token  5.71ms    5.48ms    4%            10.9x faster
  2-token  11.70ms   10.03ms   14%           5.9x faster
  3-token  26.36ms   20.48ms   22%           3.8x faster
  4-token  56.35ms   42.38ms   25%           2.3x faster
  5-token  90.51ms   68.16ms   25%           1.8x faster
  6-token  132.80ms  103.54ms  22%           1.4x faster
  7-token  201.40ms  157.07ms  22%           1.1x faster
  8-token  234.13ms  178.51ms  24%           1.1x faster

Ground truth validation: 400/400 queries pass.
- Move cache lifetime ownership to BMW (cleanup_segment_term_states
  frees cached_skip_entries and compressed_buf_cache). Iterator treats
  them as borrowed pointers and only NULLs them on free.
  Fixes a memory leak of cached_skip_entries.

- Add UINT32_MAX-1 guard in build_context.c to match the existing
  check in docmap.c, since UINT32_MAX is used as a sentinel value
  for exhausted iterators.
Handle theoretical load_block failure (data corruption) instead of
dereferencing a NULL block_postings pointer.  Matches the defensive
pattern already used in seek_term_to_doc.
@tjgreen42 tjgreen42 marked this pull request as ready for review March 10, 2026 18:03
@tjgreen42 tjgreen42 merged commit fb3b3b1 into main Mar 10, 2026
15 checks passed
@tjgreen42 tjgreen42 deleted the perf/bmw-cache-optimizations branch March 10, 2026 18:03
tjgreen42 added a commit that referenced this pull request Mar 10, 2026
BMW cache optimizations (PR #274) improved multi-token query latency
by 20-25%, making pg_textsearch faster than System X across all 8
token buckets at p50.

Key changes:
- Weighted p50: 47.62ms -> 40.61ms (2.0x -> 2.3x vs System X)
- 8+ tokens p50: 212ms -> 178ms (was 0.9x, now 1.1x vs System X)
- Single-client throughput: 70ms/q -> 63ms/q (1.5x -> 1.7x)
- p95 improved on 1-5 token buckets; 6-8+ still mixed
tjgreen42 added a commit that referenced this pull request Mar 10, 2026
## Summary

- Update comparison page and summary.md with post-PR #274 benchmark
numbers
- pg_textsearch now **faster across all 8 token buckets** at p50 (was
losing on bucket 8+)
- Weighted p50 improved from 2.0x to **2.3x** vs System X

## Key number changes

| Metric | Before | After |
|--------|--------|-------|
| Weighted p50 | 47.62ms (2.0x) | 40.61ms (2.3x) |
| 7-token p50 | 163ms (1.0x) | 159ms (1.1x) |
| 8+ token p50 | 212ms (0.9x) | 178ms (1.1x) |
| Throughput | 70ms/q (1.5x) | 63ms/q (1.7x) |

## Test plan

- [x] Benchmark run twice for consistency on same hardware/config as
original
- [x] System X numbers unchanged (same hardware, not re-run)
tjgreen42 added a commit that referenced this pull request Mar 10, 2026
BMW cache optimizations (PR #274) improved multi-token query latency
by 20-25%. pg_textsearch now faster than System X across all 8 token
buckets at p50. Weighted p50: 40.61ms vs 94.36ms (2.3x faster).
@tjgreen42 tjgreen42 mentioned this pull request Mar 12, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant