Skip to content

[Native] int4 x86 SIMD optimizations#144649

Merged
ldematte merged 16 commits intoelastic:mainfrom
ldematte:native/int4-simd-optimizations
Mar 25, 2026
Merged

[Native] int4 x86 SIMD optimizations#144649
ldematte merged 16 commits intoelastic:mainfrom
ldematte:native/int4-simd-optimizations

Conversation

@ldematte
Copy link
Copy Markdown
Contributor

@ldematte ldematte commented Mar 20, 2026

This PR introduces some smaller optimizations to the x64 int4 implementations.

Now that #144429 is merged, I resumed #109238 and the detailed analysis I did there, and discovered that we were not using the optimal set of instructions.

The older PR used a inner loop that was at the theoretical maximum for most of the processors, with a throughput of 32 elements per CPU cycle. I applied the same schema to the new implementations introduce in the previous PR; the bulk scoring paths show significant gains — +19% to +25% on the Bulk variants, and +9% to +19% on the non-bulk variants.

Also, I implemented a AVX-512 variant; this should give us an additional theoretical speedup of 2x in the inner calculation loop (over the AVX2 implementation), which should translate to a 12-50% throughput increase depending on vector dimensions (higher dimensions --> more time spent in the inner loop). Benchmarks on c8a and c8i confirm the speedup:

Single-vector score (ns/op, lower is better)

dims packed_len AVX2 iters AVX-512 iters AVX2 (ns) AVX-512 (ns) Speedup
1024 512 16 8 20.83 18.56 1.12x
2048 1024 32 16 28.24 23.89 1.18x
4096 2048 64 32 45.85 30.58 1.50x

Unfortunately, the bulk variants do not show the same speedup; we still get something, but < 10%.

The likely explanation was already hinted in #109238, and it's linked to the "difficult" hardware implementation of AVX-512: by doing some math, it seems that (at least on Zen5 and Sapphirerapids, but it's very likely on other processors too) that the processor cannot yield 2 vpmaddubsw per clock cycle (like it can do with AVX2), and that's the bottleneck. Likely, these CPUs have a single 512-bit integer multiply port. Both 512-bit SIMD pipes can handle adds, logic, and shifts (which is why single-vector with only 2 maddubs/iter achieves 2x), but only one can do integer multiply. So double the data, but half the processing.

Bulk scorer (ops/s, dims=1024, numVectors=1500, bulkSize=32)

Benchmark AVX2 (us/op) AVX-512 (us/op) Change
scoreMultipleSequentialBulk 46.5 41.9 +9.8%
scoreMultipleRandomBulk 48.5 42.5 +12.3%
scoreQueryMultipleRandomBulk 52.6 47.8 +9.1%

(Benchmark Results: AVX2 vs AVX-512 (AMD Zen 5, c8a))

Accumulate maddubs results in 16-bit and widen to 32-bit
only after each chunk, removing 2 vpmaddwd per inner loop
iteration.

Made-with: Cursor
Widen AVX2 implementation to 512-bit registers with deferred
16-to-32 bit widening. Bulk path uses batches=4 (vs 2 for AVX2)
leveraging the 32 ZMM register file. Masked loads replace the
scalar tail for clean handling of non-aligned dimensions.

Made-with: Cursor
The target pragma for icelake-client prevents the compiler from
inlining std::min, generating a PLT call with vzeroupper and
register spills on every outer loop iteration. Replace with
inline conditionals. Also replace std::copy_n for the same
reason.

Made-with: Cursor
Most CPUs have a single port for 512-bit integer multiply
(vpmaddubsw zmm). With batches=4, 8 multiplies per inner
iteration saturate this port without improving per-doc
throughput. batches=2 gives identical data throughput with
fewer instructions and better IPC.

Made-with: Cursor
@elasticsearchmachine
Copy link
Copy Markdown
Collaborator

Hi @ldematte, I've created a changelog YAML for you.

@ldematte ldematte marked this pull request as ready for review March 20, 2026 14:14
@elasticsearchmachine elasticsearchmachine added the Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch label Mar 20, 2026
@elasticsearchmachine
Copy link
Copy Markdown
Collaborator

Pinging @elastic/es-search-relevance (Team:Search Relevance)

Copy link
Copy Markdown
Contributor

@ChrisHegarty ChrisHegarty left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ldematte ldematte requested a review from a team as a code owner March 20, 2026 15:04
PR elastic#144634 refactored mappers to return pointers directly
instead of indices. Update vec_i4_2.cpp to use the new
init_pointers, sequential_mapper, and offsets_mapper.

Made-with: Cursor
Widen the bulk loop guard from 2*batches to batches
remaining docs, and conditionally skip prefetch on
the last iteration. This avoids falling through to
the scalar tail for the final batch.

Made-with: Cursor
Copy link
Copy Markdown
Member

@thecoop thecoop left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couple of tweaks, but all good

@ldematte ldematte enabled auto-merge (squash) March 25, 2026 13:17
@ldematte ldematte merged commit dee07c7 into elastic:main Mar 25, 2026
36 of 37 checks passed
@ldematte ldematte deleted the native/int4-simd-optimizations branch March 25, 2026 14:12
seanzatzdev pushed a commit to seanzatzdev/elasticsearch that referenced this pull request Mar 27, 2026
This PR introduces some smaller optimizations to the x64 int4 implementations.

Now that elastic#144429 is merged, I resumed elastic#109238 and the detailed analysis I did there, and discovered that we were not using the optimal set of instructions.

The older PR used a inner loop that was at the theoretical maximum for most of the processors, with a throughput of 32 elements per CPU cycle. I applied the same schema to the new implementations introduce in the previous PR; the bulk scoring paths show significant gains — +19% to +25% on the Bulk variants, and +9% to +19% on the non-bulk variants.

Also, I implemented a AVX-512 variant; this should give us an additional theoretical speedup of 2x in the inner calculation loop (over the AVX2 implementation), which should translate to a 12-50% throughput increase depending on vector dimensions (higher dimensions --> more time spent in the inner loop).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

>enhancement :Search Relevance/Vectors Vector search Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch v9.4.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants