AVX-512 int8 kernels with cascade unrolling#145683
Conversation
…ntation with wider registers and masked operations for tail processing.
…2 and AVX-512). Minor renames for uniformity.
| // Init accumulator(s) with 0 | ||
| __m256i acc1 = _mm256_setzero_si256(); | ||
|
|
||
| static inline int32_t doti7u_inner(const int8_t* a, const int8_t* b, const int32_t dims) { |
There was a problem hiding this comment.
Self note/ note to reviewers: the _inner functions have all the same pattern, even across AVX2 and AVX512. The only difference is the stride (register size/half register size) and the "kernel" (e.g. fmai8 etc.).
We saw that before, and we tried to unify, but the problem is that you cannot have "templates
of templates": you cannot have fmai7u, fmai8 etc. as a template function, because it's a template itself. But we should spend some time to see if we can figure out a good alternative, as the duplication is there and it's really just the same pattern. However, I do not want to do it here and now; IMO this is best done as a follow up.
|
Pinging @elastic/es-search-relevance (Team:Search Relevance) |
|
Hi @ldematte, I've created a changelog YAML for you. |
This PR adds AVX-512 implementations for int8 (signed, full -128..127 range) operations: dot product, squared euclidean, and cosine. This PR also applies consistent cascade unrolling (4/2/1 pattern) across all kernel types (i7u and i8) on both AVX2 and AVX-512. This "cascade unrolling" pattern proved to be 11-13% faster across functions and CPUs (AMD/Intel) over #pragma unroll. The gain comes from breaking the serial accumulator dependency chain.
This PR adds AVX-512 implementations for int8 (signed, full -128..127 range) operations: dot product, squared euclidean, and cosine.
This PR also applies consistent cascade unrolling (4/2/1 pattern) across all kernel types (i7u and i8) on both AVX2 and AVX-512. This "cascade unrolling" pattern proved to be 11-13% faster across functions and CPUs (AMD/Intel) over
#pragma unroll. The gain comes from breaking the serial accumulator dependency chain.New AVX-512 int8 kernels
Unlike i7u which uses
maddubs(unsigned x signed, 64 bytes/iter), i8 requiressign-extension to 16-bit before multiply (
cvtepi8_epi16from__m256ito__m512i,32 bytes/iter).
vec_doti8_2/vec_sqri8_2/vec_cosi8_2— single-pair operations_mm256_maskz_loadu_epi8+ sign-extend (no scalar loop)Cascade unrolling (4/2/1)
Applied consistently to all kernel types:
fmai8/sqri8templates#pragma unroll→ 4/2/1 cascade withfmai7u/sqri7utemplates#pragma unroll→ 4/2/1 cascade withfmai8/sqri8templatesOther improvements
cosi8_inner: merged separate SIMD kernel + scalar tail into single function(consistent with
doti8_inner/sqri8_innerpattern). Removedcosine_results_tstruct.cosi8_innerinstead of EXPORTvec_cosi8.fmai7u/sqri7u/fmai8/sqri8across AVX2 and AVX-512.Benchmark results (GCC 14, to be re-run with Clang 21)
AVX-512 i8 vs AVX2 baseline, dot product:
Clang 21 should add another 8-12% on top of these numbers for AVX-512 (to be verified).
Relates to #145411
Test plan
publish_vec_binaries.sh --local)JDKVectorLibraryInt8Testspass on AMD c8a and Intel c8i