Skip to content

perf: stack-allocate decode buffers in tp_decompress_block#253

Merged
tjgreen42 merged 1 commit intomainfrom
perf/stack-alloc-decompress-buffers
Mar 3, 2026
Merged

perf: stack-allocate decode buffers in tp_decompress_block#253
tjgreen42 merged 1 commit intomainfrom
perf/stack-alloc-decompress-buffers

Conversation

@tjgreen42
Copy link
Copy Markdown
Collaborator

@tjgreen42 tjgreen42 commented Mar 3, 2026

Summary

  • Replace palloc/pfree of two temporary uint32 arrays (doc_deltas, frequencies) in tp_decompress_block with fixed-size stack arrays of TP_BLOCK_SIZE (128) elements
  • These 512-byte arrays (1 KiB total on stack) were being heap-allocated on every block decompression — a hot path at 6.5% of CPU in profiling
  • TP_BLOCK_SIZE is a #define constant, so these are NOT VLAs

Test plan

  • make clean && make compiles with zero new warnings
  • All 49 SQL regression tests pass (make installcheck)
  • make format-check passes
  • MS-MARCO v2 benchmark to measure latency improvement

Replace palloc/pfree of two temporary uint32 arrays (doc_deltas,
frequencies) with fixed-size stack arrays of TP_BLOCK_SIZE (128)
elements. These 512-byte arrays (1 KiB total) are allocated on every
block decompression call, and since TP_BLOCK_SIZE is a compile-time
constant, they are safe VLA-free stack allocations.

Eliminates allocator overhead on the hot path where profiling shows
tp_decompress_block at 6.5% of CPU.
@tjgreen42
Copy link
Copy Markdown
Collaborator Author

Benchmark Results — MS-MARCO v2 (138M passages, 691 queries, LIMIT 10)

Back-to-back runs on the same machine (16 cores, 123 GB RAM, PG17), identical query set.

Per-Bucket Latency (p50 ms)

Tokens Baseline (main) Patched Delta
1 5.43 5.37 -1.1%
2 11.00 11.14 +1.3%
3 25.78 25.43 -1.4%
4 53.12 52.55 -1.1%
5 91.11 84.85 -6.9%
6 102.76 102.80 ~0%
7 164.79 164.18 -0.4%
8+ 211.46 212.49 +0.5%

Summary

Metric Baseline Patched Delta
Weighted p50 49.53 ms 48.41 ms -2.3%
Weighted avg 56.46 ms 56.24 ms -0.4%
Throughput (avg/query) 69.62 ms 69.42 ms -0.3%
Throughput (median batch, 691 queries) 48,107 ms 47,969 ms -0.3%

Modest end-to-end improvement as expected — tp_decompress_block is 6.5% of CPU, and this change only eliminates the palloc/pfree overhead within it. The main value is removing unnecessary allocator traffic on the hottest path, reducing memory context churn under concurrent workloads.

@tjgreen42 tjgreen42 merged commit 1b09cc9 into main Mar 3, 2026
15 checks passed
@tjgreen42 tjgreen42 deleted the perf/stack-alloc-decompress-buffers branch March 3, 2026 20:54
tjgreen42 added a commit that referenced this pull request Mar 3, 2026
## Summary
- Update comparison page with results from benchmark run
[22642807624](https://github.com/timescale/pg_textsearch/actions/runs/22642807624)
- Overall throughput improved from 2.8x to 3.2x faster than System X
- Build time gap narrowed from 2.0x to 1.6x (270s → 234s)
- Key improvements since Feb 9: SIMD bitpack decoding (#250),
stack-allocated decode buffers (#253), BMW term state pointer
indirection (#249), arena allocator rewrite (#231), leader-only merge
(#244)

## Testing
- Numbers extracted from benchmark run on commit 1b09cc9
- gh-pages branch also needs updating (will push after merge)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant