perf: cache skip entries and compressed buffer in BMW inner loop by tjgreen42 · Pull Request #274 · timescale/pg_textsearch

tjgreen42 · 2026-03-10T16:07:51Z

Summary

Pre-load all skip entries into a per-term array during BMW init; load_block reads from the cache instead of hitting the buffer pool (avoids pin/unpin/lock per block)
Allocate a single reusable compressed-data buffer per term, eliminating palloc/pfree of 898 bytes on every block load
Cache current doc_id in TpTermState.cur_doc_id, updated inline by advance/seek, replacing triple indirection through iter→block_postings→[offset].doc_id

Benchmark (MS-MARCO v2, 138M docs)

Bucket	Before	After	Improvement	vs ParadeDB
1-token	5.71ms	5.48ms	4%	10.9x faster
2-token	11.70ms	10.03ms	14%	5.9x faster
3-token	26.36ms	20.48ms	22%	3.8x faster
4-token	56.35ms	42.38ms	25%	2.3x faster
5-token	90.51ms	68.16ms	25%	1.8x faster
6-token	132.80ms	103.54ms	22%	1.4x faster
7-token	201.40ms	157.07ms	22%	1.1x faster
8-token	234.13ms	178.51ms	24%	1.1x faster

Test plan

Ground truth validation: 400/400 queries pass (scores match within tolerance)
All 50 SQL regression tests pass
CI

Three optimizations to reduce per-block overhead in the WAND traversal: 1. Pre-load all skip entries into an array during init_segment_term_states. load_block uses the cache instead of reading from the buffer pool (avoiding pin/unpin/lock per block load). 2. Allocate a single reusable compressed-data buffer per term iterator, eliminating palloc/pfree of TP_MAX_COMPRESSED_BLOCK_SIZE per block. 3. Cache current doc_id in TpTermState.cur_doc_id, updated inline by advance_term_iterator and seek_term_to_doc, replacing triple indirection through iter→block_postings→[offset].doc_id. MS-MARCO v2 (138M docs) p50 latency improvement by token bucket: Bucket | Before | After | Improvement | vs Parade 1-token 5.71ms 5.48ms 4% 10.9x faster 2-token 11.70ms 10.03ms 14% 5.9x faster 3-token 26.36ms 20.48ms 22% 3.8x faster 4-token 56.35ms 42.38ms 25% 2.3x faster 5-token 90.51ms 68.16ms 25% 1.8x faster 6-token 132.80ms 103.54ms 22% 1.4x faster 7-token 201.40ms 157.07ms 22% 1.1x faster 8-token 234.13ms 178.51ms 24% 1.1x faster Ground truth validation: 400/400 queries pass.

- Move cache lifetime ownership to BMW (cleanup_segment_term_states frees cached_skip_entries and compressed_buf_cache). Iterator treats them as borrowed pointers and only NULLs them on free. Fixes a memory leak of cached_skip_entries. - Add UINT32_MAX-1 guard in build_context.c to match the existing check in docmap.c, since UINT32_MAX is used as a sentinel value for exhausted iterators.

Handle theoretical load_block failure (data corruption) instead of dereferencing a NULL block_postings pointer. Matches the defensive pattern already used in seek_term_to_doc.

BMW cache optimizations (PR #274) improved multi-token query latency by 20-25%, making pg_textsearch faster than System X across all 8 token buckets at p50. Key changes: - Weighted p50: 47.62ms -> 40.61ms (2.0x -> 2.3x vs System X) - 8+ tokens p50: 212ms -> 178ms (was 0.9x, now 1.1x vs System X) - Single-client throughput: 70ms/q -> 63ms/q (1.5x -> 1.7x) - p95 improved on 1-5 token buckets; 6-8+ still mixed

## Summary - Update comparison page and summary.md with post-PR #274 benchmark numbers - pg_textsearch now **faster across all 8 token buckets** at p50 (was losing on bucket 8+) - Weighted p50 improved from 2.0x to **2.3x** vs System X ## Key number changes | Metric | Before | After | |--------|--------|-------| | Weighted p50 | 47.62ms (2.0x) | 40.61ms (2.3x) | | 7-token p50 | 163ms (1.0x) | 159ms (1.1x) | | 8+ token p50 | 212ms (0.9x) | 178ms (1.1x) | | Throughput | 70ms/q (1.5x) | 63ms/q (1.7x) | ## Test plan - [x] Benchmark run twice for consistency on same hardware/config as original - [x] System X numbers unchanged (same hardware, not re-run)

BMW cache optimizations (PR #274) improved multi-token query latency by 20-25%. pg_textsearch now faster than System X across all 8 token buckets at p50. Weighted p50: 40.61ms vs 94.36ms (2.3x faster).

tjgreen42 added 3 commits March 10, 2026 16:07

fix: check load_block return in advance_term_iterator

4f38613

Handle theoretical load_block failure (data corruption) instead of dereferencing a NULL block_postings pointer. Matches the defensive pattern already used in seek_term_to_doc.

tjgreen42 marked this pull request as ready for review March 10, 2026 18:03

tjgreen42 merged commit fb3b3b1 into main Mar 10, 2026
15 checks passed

tjgreen42 deleted the perf/bmw-cache-optimizations branch March 10, 2026 18:03

tjgreen42 mentioned this pull request Mar 10, 2026

docs: update MS-MARCO v2 comparison with PR #274 numbers #275

Merged

2 tasks

tjgreen42 mentioned this pull request Mar 12, 2026

Release v0.6.1 #281

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: cache skip entries and compressed buffer in BMW inner loop#274

perf: cache skip entries and compressed buffer in BMW inner loop#274
tjgreen42 merged 3 commits intomainfrom
perf/bmw-cache-optimizations

tjgreen42 commented Mar 10, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

tjgreen42 commented Mar 10, 2026

Summary

Benchmark (MS-MARCO v2, 138M docs)

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant