perf: SIMD-accelerated bitpack decoding by tjgreen42 · Pull Request #250 · timescale/pg_textsearch

tjgreen42 · 2026-03-03T19:01:46Z

Summary

Replaces byte-at-a-time accumulator loop in bitpack_decode with branchless direct-indexed uint64 loads
SSE2 (x86-64) and NEON (ARM64) paths for vectorized mask+store of 4 values
Scalar fallback uses the same branchless approach on unsupported platforms

Motivation

Profiling on 138M MS-MARCO v2 showed bitpack_decode at 19.7% of CPU. The original code used a branchy byte-at-a-time accumulator. The new code computes each value's bit offset and does a single branchless uint64 load+shift+mask.

Closes #140

Test plan

All regression tests pass
CI passes (gcc, clang, sanitizer)
Benchmark on MS-MARCO v2

Replace the byte-at-a-time accumulator loop in bitpack_decode with branchless direct-indexed uint64 loads. Each value is extracted by computing its bit offset, loading 8 bytes from that position, and applying a shift+mask -- eliminating all branches from the hot loop. SIMD support (SSE2 on x86-64, NEON on ARM64) adds vectorized mask+store for groups of 4 values. Unsupported platforms use the same branchless scalar code. Profiling on 138M MS-MARCO v2 passages showed bitpack_decode at 19.7% of CPU time for multi-token queries. Closes #140

## Summary - Update comparison page with results from benchmark run [22642807624](https://github.com/timescale/pg_textsearch/actions/runs/22642807624) - Overall throughput improved from 2.8x to 3.2x faster than System X - Build time gap narrowed from 2.0x to 1.6x (270s → 234s) - Key improvements since Feb 9: SIMD bitpack decoding (#250), stack-allocated decode buffers (#253), BMW term state pointer indirection (#249), arena allocator rewrite (#231), leader-only merge (#244) ## Testing - Numbers extracted from benchmark run on commit 1b09cc9 - gh-pages branch also needs updating (will push after merge)

tjgreen42 merged commit ecad404 into main Mar 3, 2026
15 checks passed

tjgreen42 deleted the optimize/simd-bitpack-decode branch March 3, 2026 20:10

tjgreen42 mentioned this pull request Mar 3, 2026

docs: update benchmark comparison with March 3 results #255

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: SIMD-accelerated bitpack decoding#250

perf: SIMD-accelerated bitpack decoding#250
tjgreen42 merged 1 commit intomainfrom
optimize/simd-bitpack-decode

tjgreen42 commented Mar 3, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

tjgreen42 commented Mar 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Motivation

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

tjgreen42 commented Mar 3, 2026 •

edited

Loading