Skip to content

perf: use pointer indirection for BMW term state ordering#249

Merged
tjgreen42 merged 2 commits intomainfrom
optimize/pointer-indirection-bmw
Mar 3, 2026
Merged

perf: use pointer indirection for BMW term state ordering#249
tjgreen42 merged 2 commits intomainfrom
optimize/pointer-indirection-bmw

Conversation

@tjgreen42
Copy link
Copy Markdown
Collaborator

Summary

  • Changes TpTermState *terms (contiguous struct array) to TpTermState **terms (array of pointers) in the BMW scoring engine
  • restore_ordering now moves 8-byte pointers via memmove instead of ~200-byte TpTermState structs (~25x reduction in bytes moved)
  • All 13 internal functions updated with mechanical &terms[i]terms[i] and .field->field changes

Motivation

Profiling multi-token queries (5-8 tokens) on 138M MS-MARCO v2 passages showed restore_ordering consuming 21.7% of CPU time. The TpTermState struct is ~200 bytes due to the embedded TpSegmentPostingIterator (which contains TpDictEntry, TpSkipEntry, TpSegmentDirectAccess, etc.). Every time a term advances in the WAND traversal, the sorted order is restored by memmove-ing these large structs.

Test plan

  • All 48 regression tests pass (make installcheck)
  • CI passes (compile, format, sanitizer)
  • Benchmark on MS-MARCO v2 to measure latency improvement on 5-8 token queries

Change the TpTermState array from contiguous structs to an array of
pointers. This makes restore_ordering swap 8-byte pointers via memmove
instead of ~200-byte TpTermState structs, reducing CPU overhead for
the sorted-order maintenance in the WAND traversal hot loop.

Profiling on 138M MS-MARCO v2 passages showed restore_ordering at
21.7% of CPU time for multi-token queries (5-8 tokens). The large
struct size (~200 bytes due to embedded TpSegmentPostingIterator)
made each memmove expensive.
@tjgreen42 tjgreen42 merged commit 01a7044 into main Mar 3, 2026
15 checks passed
@tjgreen42 tjgreen42 deleted the optimize/pointer-indirection-bmw branch March 3, 2026 18:44
tjgreen42 added a commit that referenced this pull request Mar 3, 2026
## Summary
- Update comparison page with results from benchmark run
[22642807624](https://github.com/timescale/pg_textsearch/actions/runs/22642807624)
- Overall throughput improved from 2.8x to 3.2x faster than System X
- Build time gap narrowed from 2.0x to 1.6x (270s → 234s)
- Key improvements since Feb 9: SIMD bitpack decoding (#250),
stack-allocated decode buffers (#253), BMW term state pointer
indirection (#249), arena allocator rewrite (#231), leader-only merge
(#244)

## Testing
- Numbers extracted from benchmark run on commit 1b09cc9
- gh-pages branch also needs updating (will push after merge)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant