AVX-512 int8 kernels with cascade unrolling by ldematte · Pull Request #145683 · elastic/elasticsearch

ldematte · 2026-04-03T14:17:55Z

This PR adds AVX-512 implementations for int8 (signed, full -128..127 range) operations: dot product, squared euclidean, and cosine.

This PR also applies consistent cascade unrolling (4/2/1 pattern) across all kernel types (i7u and i8) on both AVX2 and AVX-512. This "cascade unrolling" pattern proved to be 11-13% faster across functions and CPUs (AMD/Intel) over #pragma unroll. The gain comes from breaking the serial accumulator dependency chain.

New AVX-512 int8 kernels

Unlike i7u which uses maddubs (unsigned x signed, 64 bytes/iter), i8 requires
sign-extension to 16-bit before multiply (cvtepi8_epi16 from __m256i to __m512i,
32 bytes/iter).

vec_doti8_2 / vec_sqri8_2 / vec_cosi8_2 — single-pair operations
Bulk variants for all three (sequential, offsets, sparse)
Cosine includes full bulk with b_norm precomputation and vectorized finalization
Masked tail via _mm256_maskz_loadu_epi8 + sign-extend (no scalar loop)

Cascade unrolling (4/2/1)

Applied consistently to all kernel types:

AVX-512 i7u: 8/4/1 → 4/2/1 (less register pressure, same or better perf)
AVX-512 i8: 4/2/1 cascade with fmai8/sqri8 templates
AVX2 i7u: #pragma unroll → 4/2/1 cascade with fmai7u/sqri7u templates
AVX2 i8: #pragma unroll → 4/2/1 cascade with fmai8/sqri8 templates
All template functions use pass-by-reference for accumulator

Other improvements

AVX2 cosi8_inner: merged separate SIMD kernel + scalar tail into single function
(consistent with doti8_inner/sqri8_inner pattern). Removed cosine_results_t struct.
Cosine bulk tail uses inlinable cosi8_inner instead of EXPORT vec_cosi8.
Consistent naming: fmai7u/sqri7u/fmai8/sqri8 across AVX2 and AVX-512.

Benchmark results (GCC 14, to be re-run with Clang 21)

AVX-512 i8 vs AVX2 baseline, dot product:

Dims	AMD c8a (Zen 5)	Intel c8i (Sapphire Rapids)
384	1.33x	1.28x
768	1.71x	1.47x
1024	1.85x	1.35x
1536	2.02x	1.42x

Clang 21 should add another 8-12% on top of these numbers for AVX-512 (to be verified).

Relates to #145411

Test plan

Cross-compiles for all 3 platforms (publish_vec_binaries.sh --local)
JDKVectorLibraryInt8Tests pass on AMD c8a and Intel c8i
Re-run benchmarks with Clang 21

…ntation with wider registers and masked operations for tail processing.

…2 and AVX-512). Minor renames for uniformity.

…X2).

ldematte · 2026-04-03T14:25:35Z

libs/simdvec/native/src/vec/c/amd64/vec_1.cpp

-    // Init accumulator(s) with 0
-    __m256i acc1 = _mm256_setzero_si256();
-
+static inline int32_t doti7u_inner(const int8_t* a, const int8_t* b, const int32_t dims) {


Self note/ note to reviewers: the _inner functions have all the same pattern, even across AVX2 and AVX512. The only difference is the stride (register size/half register size) and the "kernel" (e.g. fmai8 etc.).
We saw that before, and we tried to unify, but the problem is that you cannot have "templates
of templates": you cannot have fmai7u, fmai8 etc. as a template function, because it's a template itself. But we should spend some time to see if we can figure out a good alternative, as the duplication is there and it's really just the same pattern. However, I do not want to do it here and now; IMO this is best done as a follow up.

libs/simdvec/native/src/vec/c/amd64/vec_1.cpp

elasticsearchmachine · 2026-04-03T15:26:33Z

Pinging @elastic/es-search-relevance (Team:Search Relevance)

elasticsearchmachine · 2026-04-03T15:26:33Z

Hi @ldematte, I've created a changelog YAML for you.

ChrisHegarty

LGTM

This PR adds AVX-512 implementations for int8 (signed, full -128..127 range) operations: dot product, squared euclidean, and cosine. This PR also applies consistent cascade unrolling (4/2/1 pattern) across all kernel types (i7u and i8) on both AVX2 and AVX-512. This "cascade unrolling" pattern proved to be 11-13% faster across functions and CPUs (AMD/Intel) over #pragma unroll. The gain comes from breaking the serial accumulator dependency chain.

ldematte added 4 commits April 2, 2026 18:24

AVX-512 implementation of int8 operations, mirroring the AVX2 impleme…

5ff8033

…ntation with wider registers and masked operations for tail processing.

Merge remote-tracking branch 'upstream/main' into native/avx512-int8

38dfd0d

Switch from #pragma unroll to explicit unroll + cascade for int8 (AVX…

ab9246c

…2 and AVX-512). Minor renames for uniformity.

Switch from #pragma unroll to explicit unroll + cascade for int7u (AV…

2c4a3b1

…X2).

elasticsearchmachine added the v9.4.0 label Apr 3, 2026

ldematte requested review from ChrisHegarty and thecoop April 3, 2026 14:18

ldematte added >enhancement :Search Relevance/Vectors Vector search WIP labels Apr 3, 2026

ldematte commented Apr 3, 2026

View reviewed changes

ChrisHegarty reviewed Apr 3, 2026

View reviewed changes

libs/simdvec/native/src/vec/c/amd64/vec_1.cpp Show resolved Hide resolved

ldematte added 2 commits April 3, 2026 17:18

Merge remote-tracking branch 'upstream/main' into native/avx512-int8

2ba52a6

Publish vec binaries + update version

4d5db84

ldematte changed the title ~~WIP: AVX-512 int8 kernels with cascade unrolling~~ AVX-512 int8 kernels with cascade unrolling Apr 3, 2026

ldematte removed the WIP label Apr 3, 2026

ldematte marked this pull request as ready for review April 3, 2026 15:26

elasticsearchmachine added the Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch label Apr 3, 2026

Update docs/changelog/145683.yaml

ca5eaf0

ldematte mentioned this pull request Apr 7, 2026

Comparative benchmarking and fine-tuning of optimized native scorers #145411

Open

10 tasks

ldematte added 2 commits April 8, 2026 08:11

Merge remote-tracking branch 'upstream/main' into native/avx512-int8

8d57335

Publish vec binaries + update version

5504376

ldematte requested a review from ChrisHegarty April 8, 2026 06:12

ChrisHegarty approved these changes Apr 8, 2026

View reviewed changes

ldematte enabled auto-merge (squash) April 8, 2026 08:04

ldematte merged commit 9449e47 into elastic:main Apr 8, 2026
36 checks passed

ldematte deleted the native/avx512-int8 branch April 8, 2026 08:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AVX-512 int8 kernels with cascade unrolling#145683

AVX-512 int8 kernels with cascade unrolling#145683
ldematte merged 9 commits intoelastic:mainfrom
ldematte:native/avx512-int8

ldematte commented Apr 3, 2026 •

edited

Loading

Uh oh!

ldematte Apr 3, 2026

Uh oh!

Uh oh!

elasticsearchmachine commented Apr 3, 2026

Uh oh!

elasticsearchmachine commented Apr 3, 2026

Uh oh!

ChrisHegarty left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

ldematte commented Apr 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

New AVX-512 int8 kernels

Cascade unrolling (4/2/1)

Other improvements

Benchmark results (GCC 14, to be re-run with Clang 21)

Test plan

Uh oh!

ldematte Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

elasticsearchmachine commented Apr 3, 2026

Uh oh!

elasticsearchmachine commented Apr 3, 2026

Uh oh!

ChrisHegarty left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ldematte commented Apr 3, 2026 •

edited

Loading