Skip to content

AVX-512 int8 kernels with cascade unrolling#145683

Merged
ldematte merged 9 commits intoelastic:mainfrom
ldematte:native/avx512-int8
Apr 8, 2026
Merged

AVX-512 int8 kernels with cascade unrolling#145683
ldematte merged 9 commits intoelastic:mainfrom
ldematte:native/avx512-int8

Conversation

@ldematte
Copy link
Copy Markdown
Contributor

@ldematte ldematte commented Apr 3, 2026

This PR adds AVX-512 implementations for int8 (signed, full -128..127 range) operations: dot product, squared euclidean, and cosine.

This PR also applies consistent cascade unrolling (4/2/1 pattern) across all kernel types (i7u and i8) on both AVX2 and AVX-512. This "cascade unrolling" pattern proved to be 11-13% faster across functions and CPUs (AMD/Intel) over #pragma unroll. The gain comes from breaking the serial accumulator dependency chain.

New AVX-512 int8 kernels

Unlike i7u which uses maddubs (unsigned x signed, 64 bytes/iter), i8 requires
sign-extension to 16-bit before multiply (cvtepi8_epi16 from __m256i to __m512i,
32 bytes/iter).

  • vec_doti8_2 / vec_sqri8_2 / vec_cosi8_2 — single-pair operations
  • Bulk variants for all three (sequential, offsets, sparse)
  • Cosine includes full bulk with b_norm precomputation and vectorized finalization
  • Masked tail via _mm256_maskz_loadu_epi8 + sign-extend (no scalar loop)

Cascade unrolling (4/2/1)

Applied consistently to all kernel types:

  • AVX-512 i7u: 8/4/1 → 4/2/1 (less register pressure, same or better perf)
  • AVX-512 i8: 4/2/1 cascade with fmai8/sqri8 templates
  • AVX2 i7u: #pragma unroll → 4/2/1 cascade with fmai7u/sqri7u templates
  • AVX2 i8: #pragma unroll → 4/2/1 cascade with fmai8/sqri8 templates
  • All template functions use pass-by-reference for accumulator

Other improvements

  • AVX2 cosi8_inner: merged separate SIMD kernel + scalar tail into single function
    (consistent with doti8_inner/sqri8_inner pattern). Removed cosine_results_t struct.
  • Cosine bulk tail uses inlinable cosi8_inner instead of EXPORT vec_cosi8.
  • Consistent naming: fmai7u/sqri7u/fmai8/sqri8 across AVX2 and AVX-512.

Benchmark results (GCC 14, to be re-run with Clang 21)

AVX-512 i8 vs AVX2 baseline, dot product:

Dims AMD c8a (Zen 5) Intel c8i (Sapphire Rapids)
384 1.33x 1.28x
768 1.71x 1.47x
1024 1.85x 1.35x
1536 2.02x 1.42x

Clang 21 should add another 8-12% on top of these numbers for AVX-512 (to be verified).

Relates to #145411

Test plan

  • Cross-compiles for all 3 platforms (publish_vec_binaries.sh --local)
  • JDKVectorLibraryInt8Tests pass on AMD c8a and Intel c8i
  • Re-run benchmarks with Clang 21

// Init accumulator(s) with 0
__m256i acc1 = _mm256_setzero_si256();

static inline int32_t doti7u_inner(const int8_t* a, const int8_t* b, const int32_t dims) {
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Self note/ note to reviewers: the _inner functions have all the same pattern, even across AVX2 and AVX512. The only difference is the stride (register size/half register size) and the "kernel" (e.g. fmai8 etc.).
We saw that before, and we tried to unify, but the problem is that you cannot have "templates
of templates": you cannot have fmai7u, fmai8 etc. as a template function, because it's a template itself. But we should spend some time to see if we can figure out a good alternative, as the duplication is there and it's really just the same pattern. However, I do not want to do it here and now; IMO this is best done as a follow up.

@ldematte ldematte changed the title WIP: AVX-512 int8 kernels with cascade unrolling AVX-512 int8 kernels with cascade unrolling Apr 3, 2026
@ldematte ldematte removed the WIP label Apr 3, 2026
@ldematte ldematte marked this pull request as ready for review April 3, 2026 15:26
@elasticsearchmachine
Copy link
Copy Markdown
Collaborator

Pinging @elastic/es-search-relevance (Team:Search Relevance)

@elasticsearchmachine
Copy link
Copy Markdown
Collaborator

Hi @ldematte, I've created a changelog YAML for you.

@elasticsearchmachine elasticsearchmachine added the Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch label Apr 3, 2026
@ldematte ldematte requested a review from ChrisHegarty April 8, 2026 06:12
Copy link
Copy Markdown
Contributor

@ChrisHegarty ChrisHegarty left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ldematte ldematte enabled auto-merge (squash) April 8, 2026 08:04
@ldematte ldematte merged commit 9449e47 into elastic:main Apr 8, 2026
36 checks passed
@ldematte ldematte deleted the native/avx512-int8 branch April 8, 2026 08:38
mromaios pushed a commit to mromaios/elasticsearch that referenced this pull request Apr 9, 2026
This PR adds AVX-512 implementations for int8 (signed, full -128..127 range) operations: dot product, squared euclidean, and cosine.

This PR also applies consistent cascade unrolling (4/2/1 pattern) across all kernel types (i7u and i8) on both AVX2 and AVX-512. This "cascade unrolling" pattern proved to be 11-13% faster across functions and CPUs (AMD/Intel) over #pragma unroll. The gain comes from breaking the serial accumulator dependency chain.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

>enhancement :Search Relevance/Vectors Vector search Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch v9.4.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants