Refactor x64 bulk scoring: shared template, masked AVX-512 tail, inlinable inner ops#145310
Refactor x64 bulk scoring: shared template, masked AVX-512 tail, inlinable inner ops#145310ldematte merged 11 commits intoelastic:mainfrom
Conversation
|
Pinging @elastic/es-search-relevance (Team:Search Relevance) |
|
|
||
| int i = 0; | ||
| const int blk = dims & ~(STRIDE_BYTES_LEN - 1); | ||
| #pragma GCC unroll 4 |
There was a problem hiding this comment.
We don't use #pragma unroll anywhere else, worth changing this to templates?
There was a problem hiding this comment.
It's a leftover from early days, and I did not want to touch it in this PR. I want to investigate this VS using templates VS nothing at all, but as a separate task.
…efactor-avx512-bulk
|
Is there method size considerations with inlining the scalar tail? Might the larger inlined method be too big to fit in the CPU all at once? (although 64k is a lot of instructions...) |
TL;DR: I don't have concerns. |
…nable inner ops (elastic#145310) This PR refactors the x64 bulk scoring infrastructure for int8 vectors: Shared call_i8_bulk template moved to amd64_vec_common.h, used by both AVX2 (vec_1.cpp) and AVX-512 (vec_2.cpp). Simplified: inner_op handles full dims including tail, so no scalar_op, bulk_tail, or stride parameters needed. Inner functions handle full dims including scalar tail (AVX2) or masked SIMD tail (AVX-512). EXPORT functions become thin wrappers over static inline inners, ensuring the bulk template can inline the full operation. AVX-512 masked tail: replaces scalar loop with a single masked _mm512_maskz_loadu_epi8 + SIMD operation for remaining elements (< 64 bytes). > → >= fix for SIMD stride checks across all files (amd64 and aarch64). Previously, when dims == stride_len, the SIMD path was skipped and all elements went through the scalar tail.
This PR refactors the x64 bulk scoring infrastructure for int8 vectors:
Shared
call_i8_bulktemplate moved toamd64_vec_common.h, used by both AVX2 (vec_1.cpp) and AVX-512 (vec_2.cpp). Simplified:inner_ophandles full dims including tail, so noscalar_op,bulk_tail, orstrideparameters needed.Inner functions handle full dims including scalar tail (AVX2) or masked SIMD tail (AVX-512). EXPORT functions become thin wrappers over
static inlineinners, ensuring the bulk template can inline the full operation.AVX-512 masked tail: replaces scalar loop with a single masked
_mm512_maskz_loadu_epi8+ SIMD operation for remaining elements (< 64 bytes).>→>=fix for SIMD stride checks across all files (amd64 and aarch64). Previously, whendims == stride_len, the SIMD path was skipped and all elements went through the scalar tail.Follows #144845 (needs new build process to avoid regression due to the GCC #pragma/inline bug)
Relates to #145411
Test plan
JDKVectorLibraryInt8Testspass on AMD c8a (x64 AVX-512)JDKVectorLibrary*Testspass locally on Apple Silicon (aarch64)