Skip to content

perf: SIMD-accelerated bitpack decoding#250

Merged
tjgreen42 merged 1 commit intomainfrom
optimize/simd-bitpack-decode
Mar 3, 2026
Merged

perf: SIMD-accelerated bitpack decoding#250
tjgreen42 merged 1 commit intomainfrom
optimize/simd-bitpack-decode

Conversation

@tjgreen42
Copy link
Copy Markdown
Collaborator

@tjgreen42 tjgreen42 commented Mar 3, 2026

Summary

  • Replaces byte-at-a-time accumulator loop in bitpack_decode with branchless direct-indexed uint64 loads
  • SSE2 (x86-64) and NEON (ARM64) paths for vectorized mask+store of 4 values
  • Scalar fallback uses the same branchless approach on unsupported platforms

Motivation

Profiling on 138M MS-MARCO v2 showed bitpack_decode at 19.7% of CPU. The original code used a branchy byte-at-a-time accumulator. The new code computes each value's bit offset and does a single branchless uint64 load+shift+mask.

Closes #140

Test plan

  • All regression tests pass
  • CI passes (gcc, clang, sanitizer)
  • Benchmark on MS-MARCO v2

Replace the byte-at-a-time accumulator loop in bitpack_decode with
branchless direct-indexed uint64 loads. Each value is extracted by
computing its bit offset, loading 8 bytes from that position, and
applying a shift+mask -- eliminating all branches from the hot loop.

SIMD support (SSE2 on x86-64, NEON on ARM64) adds vectorized
mask+store for groups of 4 values. Unsupported platforms use the
same branchless scalar code.

Profiling on 138M MS-MARCO v2 passages showed bitpack_decode at
19.7% of CPU time for multi-token queries.

Closes #140
@tjgreen42 tjgreen42 merged commit ecad404 into main Mar 3, 2026
15 checks passed
@tjgreen42 tjgreen42 deleted the optimize/simd-bitpack-decode branch March 3, 2026 20:10
tjgreen42 added a commit that referenced this pull request Mar 3, 2026
## Summary
- Update comparison page with results from benchmark run
[22642807624](https://github.com/timescale/pg_textsearch/actions/runs/22642807624)
- Overall throughput improved from 2.8x to 3.2x faster than System X
- Build time gap narrowed from 2.0x to 1.6x (270s → 234s)
- Key improvements since Feb 9: SIMD bitpack decoding (#250),
stack-allocated decode buffers (#253), BMW term state pointer
indirection (#249), arena allocator rewrite (#231), leader-only merge
(#244)

## Testing
- Numbers extracted from benchmark run on commit 1b09cc9
- gh-pages branch also needs updating (will push after merge)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

SIMD-accelerated bitpack decoding

1 participant