[Native] int4 x86 SIMD optimizations#144649
Merged
ldematte merged 16 commits intoelastic:mainfrom Mar 25, 2026
Merged
Conversation
Accumulate maddubs results in 16-bit and widen to 32-bit only after each chunk, removing 2 vpmaddwd per inner loop iteration. Made-with: Cursor
Widen AVX2 implementation to 512-bit registers with deferred 16-to-32 bit widening. Bulk path uses batches=4 (vs 2 for AVX2) leveraging the 32 ZMM register file. Masked loads replace the scalar tail for clean handling of non-aligned dimensions. Made-with: Cursor
The target pragma for icelake-client prevents the compiler from inlining std::min, generating a PLT call with vzeroupper and register spills on every outer loop iteration. Replace with inline conditionals. Also replace std::copy_n for the same reason. Made-with: Cursor
Most CPUs have a single port for 512-bit integer multiply (vpmaddubsw zmm). With batches=4, 8 multiplies per inner iteration saturate this port without improving per-doc throughput. batches=2 gives identical data throughput with fewer instructions and better IPC. Made-with: Cursor
Collaborator
|
Hi @ldematte, I've created a changelog YAML for you. |
Collaborator
|
Pinging @elastic/es-search-relevance (Team:Search Relevance) |
…elasticsearch into native/int4-simd-optimizations
thecoop
reviewed
Mar 23, 2026
thecoop
reviewed
Mar 23, 2026
thecoop
reviewed
Mar 23, 2026
thecoop
reviewed
Mar 23, 2026
thecoop
reviewed
Mar 23, 2026
PR elastic#144634 refactored mappers to return pointers directly instead of indices. Update vec_i4_2.cpp to use the new init_pointers, sequential_mapper, and offsets_mapper. Made-with: Cursor
Widen the bulk loop guard from 2*batches to batches remaining docs, and conditionally skip prefetch on the last iteration. This avoids falling through to the scalar tail for the final batch. Made-with: Cursor
thecoop
reviewed
Mar 24, 2026
thecoop
approved these changes
Mar 24, 2026
Member
thecoop
left a comment
There was a problem hiding this comment.
Couple of tweaks, but all good
seanzatzdev
pushed a commit
to seanzatzdev/elasticsearch
that referenced
this pull request
Mar 27, 2026
This PR introduces some smaller optimizations to the x64 int4 implementations. Now that elastic#144429 is merged, I resumed elastic#109238 and the detailed analysis I did there, and discovered that we were not using the optimal set of instructions. The older PR used a inner loop that was at the theoretical maximum for most of the processors, with a throughput of 32 elements per CPU cycle. I applied the same schema to the new implementations introduce in the previous PR; the bulk scoring paths show significant gains — +19% to +25% on the Bulk variants, and +9% to +19% on the non-bulk variants. Also, I implemented a AVX-512 variant; this should give us an additional theoretical speedup of 2x in the inner calculation loop (over the AVX2 implementation), which should translate to a 12-50% throughput increase depending on vector dimensions (higher dimensions --> more time spent in the inner loop).
This was referenced Mar 27, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR introduces some smaller optimizations to the x64 int4 implementations.
Now that #144429 is merged, I resumed #109238 and the detailed analysis I did there, and discovered that we were not using the optimal set of instructions.
The older PR used a inner loop that was at the theoretical maximum for most of the processors, with a throughput of 32 elements per CPU cycle. I applied the same schema to the new implementations introduce in the previous PR; the bulk scoring paths show significant gains — +19% to +25% on the Bulk variants, and +9% to +19% on the non-bulk variants.
Also, I implemented a AVX-512 variant; this should give us an additional theoretical speedup of 2x in the inner calculation loop (over the AVX2 implementation), which should translate to a 12-50% throughput increase depending on vector dimensions (higher dimensions --> more time spent in the inner loop). Benchmarks on c8a and c8i confirm the speedup:
Single-vector
score(ns/op, lower is better)Unfortunately, the bulk variants do not show the same speedup; we still get something, but < 10%.
The likely explanation was already hinted in #109238, and it's linked to the "difficult" hardware implementation of AVX-512: by doing some math, it seems that (at least on Zen5 and Sapphirerapids, but it's very likely on other processors too) that the processor cannot yield 2 vpmaddubsw per clock cycle (like it can do with AVX2), and that's the bottleneck. Likely, these CPUs have a single 512-bit integer multiply port. Both 512-bit SIMD pipes can handle adds, logic, and shifts (which is why single-vector with only 2 maddubs/iter achieves 2x), but only one can do integer multiply. So double the data, but half the processing.
Bulk scorer (ops/s, dims=1024, numVectors=1500, bulkSize=32)
(Benchmark Results: AVX2 vs AVX-512 (AMD Zen 5, c8a))