[Native] `int4` x86 SIMD optimizations by ldematte · Pull Request #144649 · elastic/elasticsearch

ldematte · 2026-03-20T13:48:33Z

This PR introduces some smaller optimizations to the x64 int4 implementations.

Now that #144429 is merged, I resumed #109238 and the detailed analysis I did there, and discovered that we were not using the optimal set of instructions.

The older PR used a inner loop that was at the theoretical maximum for most of the processors, with a throughput of 32 elements per CPU cycle. I applied the same schema to the new implementations introduce in the previous PR; the bulk scoring paths show significant gains — +19% to +25% on the Bulk variants, and +9% to +19% on the non-bulk variants.

Also, I implemented a AVX-512 variant; this should give us an additional theoretical speedup of 2x in the inner calculation loop (over the AVX2 implementation), which should translate to a 12-50% throughput increase depending on vector dimensions (higher dimensions --> more time spent in the inner loop). Benchmarks on c8a and c8i confirm the speedup:

Single-vector `score` (ns/op, lower is better)

dims	packed_len	AVX2 iters	AVX-512 iters	AVX2 (ns)	AVX-512 (ns)	Speedup
1024	512	16	8	20.83	18.56	1.12x
2048	1024	32	16	28.24	23.89	1.18x
4096	2048	64	32	45.85	30.58	1.50x

Unfortunately, the bulk variants do not show the same speedup; we still get something, but < 10%.

The likely explanation was already hinted in #109238, and it's linked to the "difficult" hardware implementation of AVX-512: by doing some math, it seems that (at least on Zen5 and Sapphirerapids, but it's very likely on other processors too) that the processor cannot yield 2 vpmaddubsw per clock cycle (like it can do with AVX2), and that's the bottleneck. Likely, these CPUs have a single 512-bit integer multiply port. Both 512-bit SIMD pipes can handle adds, logic, and shifts (which is why single-vector with only 2 maddubs/iter achieves 2x), but only one can do integer multiply. So double the data, but half the processing.

Bulk scorer (ops/s, dims=1024, numVectors=1500, bulkSize=32)

Benchmark	AVX2 (us/op)	AVX-512 (us/op)	Change
scoreMultipleSequentialBulk	46.5	41.9	+9.8%
scoreMultipleRandomBulk	48.5	42.5	+12.3%
scoreQueryMultipleRandomBulk	52.6	47.8	+9.1%

(Benchmark Results: AVX2 vs AVX-512 (AMD Zen 5, c8a))

Accumulate maddubs results in 16-bit and widen to 32-bit only after each chunk, removing 2 vpmaddwd per inner loop iteration. Made-with: Cursor

Widen AVX2 implementation to 512-bit registers with deferred 16-to-32 bit widening. Bulk path uses batches=4 (vs 2 for AVX2) leveraging the 32 ZMM register file. Masked loads replace the scalar tail for clean handling of non-aligned dimensions. Made-with: Cursor

The target pragma for icelake-client prevents the compiler from inlining std::min, generating a PLT call with vzeroupper and register spills on every outer loop iteration. Replace with inline conditionals. Also replace std::copy_n for the same reason. Made-with: Cursor

Most CPUs have a single port for 512-bit integer multiply (vpmaddubsw zmm). With batches=4, 8 multiplies per inner iteration saturate this port without improving per-doc throughput. batches=2 gives identical data throughput with fewer instructions and better IPC. Made-with: Cursor

elasticsearchmachine · 2026-03-20T14:05:59Z

Hi @ldematte, I've created a changelog YAML for you.

elasticsearchmachine · 2026-03-20T14:14:44Z

Pinging @elastic/es-search-relevance (Team:Search Relevance)

ChrisHegarty

LGTM

…elasticsearch into native/int4-simd-optimizations

libs/simdvec/native/src/vec/c/amd64/vec_i4_2.cpp

PR elastic#144634 refactored mappers to return pointers directly instead of indices. Update vec_i4_2.cpp to use the new init_pointers, sequential_mapper, and offsets_mapper. Made-with: Cursor

Widen the bulk loop guard from 2*batches to batches remaining docs, and conditionally skip prefetch on the last iteration. This avoids falling through to the scalar tail for the final batch. Made-with: Cursor

…timizations

libs/simdvec/native/src/vec/c/amd64/vec_i4_2.cpp

thecoop

Couple of tweaks, but all good

…timizations

This PR introduces some smaller optimizations to the x64 int4 implementations. Now that elastic#144429 is merged, I resumed elastic#109238 and the detailed analysis I did there, and discovered that we were not using the optimal set of instructions. The older PR used a inner loop that was at the theoretical maximum for most of the processors, with a throughput of 32 elements per CPU cycle. I applied the same schema to the new implementations introduce in the previous PR; the bulk scoring paths show significant gains — +19% to +25% on the Bulk variants, and +9% to +19% on the non-bulk variants. Also, I implemented a AVX-512 variant; this should give us an additional theoretical speedup of 2x in the inner calculation loop (over the AVX2 implementation), which should translate to a 12-50% throughput increase depending on vector dimensions (higher dimensions --> more time spent in the inner loop).

ldematte added 3 commits March 20, 2026 12:01

Defer 16-to-32 bit widening in AVX2 int4 dot product

f796ff1

Accumulate maddubs results in 16-bit and widen to 32-bit only after each chunk, removing 2 vpmaddwd per inner loop iteration. Made-with: Cursor

elasticsearchmachine added the v9.4.0 label Mar 20, 2026

ldematte added >enhancement :Search Relevance/Vectors Vector search labels Mar 20, 2026

Update docs/changelog/144649.yaml

d603ad3

ldematte requested review from ChrisHegarty and thecoop March 20, 2026 14:06

ldematte marked this pull request as ready for review March 20, 2026 14:14

elasticsearchmachine added the Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch label Mar 20, 2026

ldematte mentioned this pull request Mar 20, 2026

Add AVX-512 optimised dot product distance function for int4 on x64 #109238

Closed

Merge branch 'main' into native/int4-simd-optimizations

c17d621

ChrisHegarty approved these changes Mar 20, 2026

View reviewed changes

ldematte added 2 commits March 20, 2026 16:03

Publish vec binaries + update version

60dcc7f

Merge branch 'native/int4-simd-optimizations' of github.com:ldematte/…

0de1ff7

…elasticsearch into native/int4-simd-optimizations

ldematte requested a review from a team as a code owner March 20, 2026 15:04

Merge branch 'main' into native/int4-simd-optimizations

07ed742

thecoop reviewed Mar 23, 2026

View reviewed changes

libs/simdvec/native/src/vec/c/amd64/vec_i4_2.cpp Show resolved Hide resolved

thecoop reviewed Mar 23, 2026

View reviewed changes

libs/simdvec/native/src/vec/c/amd64/vec_i4_2.cpp Show resolved Hide resolved

thecoop reviewed Mar 23, 2026

View reviewed changes

libs/simdvec/native/src/vec/c/amd64/vec_i4_2.cpp Show resolved Hide resolved

thecoop reviewed Mar 23, 2026

View reviewed changes

libs/simdvec/native/src/vec/c/amd64/vec_i4_2.cpp Outdated Show resolved Hide resolved

thecoop reviewed Mar 23, 2026

View reviewed changes

libs/simdvec/native/src/vec/c/amd64/vec_i4_2.cpp Show resolved Hide resolved

ldematte added 4 commits March 23, 2026 18:50

Align AVX-512 int4 bulk with new mapper convention

2cdcebe

PR elastic#144634 refactored mappers to return pointers directly instead of indices. Update vec_i4_2.cpp to use the new init_pointers, sequential_mapper, and offsets_mapper. Made-with: Cursor

Skip prefetch on last bulk batch in AVX-512 int4

dbded99

Widen the bulk loop guard from 2*batches to batches remaining docs, and conditionally skip prefetch on the last iteration. This avoids falling through to the scalar tail for the final batch. Made-with: Cursor

Merge remote-tracking branch 'upstream/main' into native/int4-simd-op…

b7e66b5

…timizations

Publish vec binaries + update version

6473704

Merge branch 'main' into native/int4-simd-optimizations

2414b2e

thecoop reviewed Mar 24, 2026

View reviewed changes

libs/simdvec/native/src/vec/c/amd64/vec_i4_2.cpp Show resolved Hide resolved

thecoop approved these changes Mar 24, 2026

View reviewed changes

ldematte added 2 commits March 25, 2026 10:49

Merge remote-tracking branch 'upstream/main' into native/int4-simd-op…

3fca122

…timizations

Adding comments

f0e38ce

ldematte enabled auto-merge (squash) March 25, 2026 13:17

ldematte merged commit dee07c7 into elastic:main Mar 25, 2026
36 of 37 checks passed

ldematte deleted the native/int4-simd-optimizations branch March 25, 2026 14:12

This was referenced Mar 27, 2026

Add native int4 operation benchmarks; fix JDK22+ test guards #145096

Merged

Comparative benchmarking and fine-tuning of optimized native scorers #145411

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Native] `int4` x86 SIMD optimizations#144649

[Native] `int4` x86 SIMD optimizations#144649
ldematte merged 16 commits intoelastic:mainfrom
ldematte:native/int4-simd-optimizations

ldematte commented Mar 20, 2026 •

edited

Loading

Uh oh!

elasticsearchmachine commented Mar 20, 2026

Uh oh!

elasticsearchmachine commented Mar 20, 2026

Uh oh!

ChrisHegarty left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

thecoop left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

ldematte commented Mar 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Single-vector score (ns/op, lower is better)

Bulk scorer (ops/s, dims=1024, numVectors=1500, bulkSize=32)

Uh oh!

elasticsearchmachine commented Mar 20, 2026

Uh oh!

elasticsearchmachine commented Mar 20, 2026

Uh oh!

ChrisHegarty left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

thecoop left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ldematte commented Mar 20, 2026 •

edited

Loading

Single-vector `score` (ns/op, lower is better)