perf: stack-allocate decode buffers in tp_decompress_block by tjgreen42 · Pull Request #253 · timescale/pg_textsearch

tjgreen42 · 2026-03-03T20:42:55Z

Summary

Replace palloc/pfree of two temporary uint32 arrays (doc_deltas, frequencies) in tp_decompress_block with fixed-size stack arrays of TP_BLOCK_SIZE (128) elements
These 512-byte arrays (1 KiB total on stack) were being heap-allocated on every block decompression — a hot path at 6.5% of CPU in profiling
TP_BLOCK_SIZE is a #define constant, so these are NOT VLAs

Test plan

make clean && make compiles with zero new warnings
All 49 SQL regression tests pass (make installcheck)
make format-check passes
MS-MARCO v2 benchmark to measure latency improvement

Replace palloc/pfree of two temporary uint32 arrays (doc_deltas, frequencies) with fixed-size stack arrays of TP_BLOCK_SIZE (128) elements. These 512-byte arrays (1 KiB total) are allocated on every block decompression call, and since TP_BLOCK_SIZE is a compile-time constant, they are safe VLA-free stack allocations. Eliminates allocator overhead on the hot path where profiling shows tp_decompress_block at 6.5% of CPU.

tjgreen42 · 2026-03-03T20:53:26Z

Benchmark Results — MS-MARCO v2 (138M passages, 691 queries, LIMIT 10)

Back-to-back runs on the same machine (16 cores, 123 GB RAM, PG17), identical query set.

Per-Bucket Latency (p50 ms)

Tokens	Baseline (main)	Patched	Delta
1	5.43	5.37	-1.1%
2	11.00	11.14	+1.3%
3	25.78	25.43	-1.4%
4	53.12	52.55	-1.1%
5	91.11	84.85	-6.9%
6	102.76	102.80	~0%
7	164.79	164.18	-0.4%
8+	211.46	212.49	+0.5%

Summary

Metric	Baseline	Patched	Delta
Weighted p50	49.53 ms	48.41 ms	-2.3%
Weighted avg	56.46 ms	56.24 ms	-0.4%
Throughput (avg/query)	69.62 ms	69.42 ms	-0.3%
Throughput (median batch, 691 queries)	48,107 ms	47,969 ms	-0.3%

Modest end-to-end improvement as expected — tp_decompress_block is 6.5% of CPU, and this change only eliminates the palloc/pfree overhead within it. The main value is removing unnecessary allocator traffic on the hottest path, reducing memory context churn under concurrent workloads.

## Summary - Update comparison page with results from benchmark run [22642807624](https://github.com/timescale/pg_textsearch/actions/runs/22642807624) - Overall throughput improved from 2.8x to 3.2x faster than System X - Build time gap narrowed from 2.0x to 1.6x (270s → 234s) - Key improvements since Feb 9: SIMD bitpack decoding (#250), stack-allocated decode buffers (#253), BMW term state pointer indirection (#249), arena allocator rewrite (#231), leader-only merge (#244) ## Testing - Numbers extracted from benchmark run on commit 1b09cc9 - gh-pages branch also needs updating (will push after merge)

tjgreen42 merged commit 1b09cc9 into main Mar 3, 2026
15 checks passed

tjgreen42 deleted the perf/stack-alloc-decompress-buffers branch March 3, 2026 20:54

tjgreen42 mentioned this pull request Mar 3, 2026

docs: update benchmark comparison with March 3 results #255

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: stack-allocate decode buffers in tp_decompress_block#253

perf: stack-allocate decode buffers in tp_decompress_block#253
tjgreen42 merged 1 commit intomainfrom
perf/stack-alloc-decompress-buffers

tjgreen42 commented Mar 3, 2026 •

edited

Loading

Uh oh!

tjgreen42 commented Mar 3, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

tjgreen42 commented Mar 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

tjgreen42 commented Mar 3, 2026

Benchmark Results — MS-MARCO v2 (138M passages, 691 queries, LIMIT 10)

Per-Bucket Latency (p50 ms)

Summary

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

tjgreen42 commented Mar 3, 2026 •

edited

Loading