Refactor x64 bulk scoring: shared template, masked AVX-512 tail, inlinable inner ops by ldematte · Pull Request #145310 · elastic/elasticsearch

ldematte · 2026-03-31T11:12:01Z

This PR refactors the x64 bulk scoring infrastructure for int8 vectors:

Shared call_i8_bulk template moved to amd64_vec_common.h, used by both AVX2 (vec_1.cpp) and AVX-512 (vec_2.cpp). Simplified: inner_op handles full dims including tail, so no scalar_op, bulk_tail, or stride parameters needed.
Inner functions handle full dims including scalar tail (AVX2) or masked SIMD tail (AVX-512). EXPORT functions become thin wrappers over static inline inners, ensuring the bulk template can inline the full operation.
AVX-512 masked tail: replaces scalar loop with a single masked _mm512_maskz_loadu_epi8 + SIMD operation for remaining elements (< 64 bytes).
> → >= fix for SIMD stride checks across all files (amd64 and aarch64). Previously, when dims == stride_len, the SIMD path was skipped and all elements went through the scalar tail.

Follows #144845 (needs new build process to avoid regression due to the GCC #pragma/inline bug)

Relates to #145411

Test plan

JDKVectorLibraryInt8Tests pass on AMD c8a (x64 AVX-512)
JDKVectorLibrary*Tests pass locally on Apple Silicon (aarch64)
No performance regression on AMD c8a (bulk random 130k: 995 ops/s vs 988 baseline)

…n AVX-512

elasticsearchmachine · 2026-03-31T11:40:23Z

Pinging @elastic/es-search-relevance (Team:Search Relevance)

libs/simdvec/native/src/vec/c/amd64/vec_1.cpp

thecoop · 2026-03-31T12:28:31Z

libs/simdvec/native/src/vec/c/amd64/vec_1.cpp


+    int i = 0;
+    const int blk = dims & ~(STRIDE_BYTES_LEN - 1);
 #pragma GCC unroll 4


We don't use #pragma unroll anywhere else, worth changing this to templates?

It's a leftover from early days, and I did not want to touch it in this PR. I want to investigate this VS using templates VS nothing at all, but as a separate task.

…efactor-avx512-bulk

libs/simdvec/native/src/vec/c/amd64/vec_2.cpp

thecoop · 2026-03-31T13:08:32Z

Is there method size considerations with inlining the scalar tail? Might the larger inlined method be too big to fit in the CPU all at once? (although 64k is a lot of instructions...)

ldematte · 2026-03-31T13:30:15Z

Is there method size considerations with inlining the scalar tail? Might the larger inlined method be too big to fit in the CPU all at once? (although 64k is a lot of instructions...)

TL;DR: I don't have concerns.
The scalar tail (or masked tail) adds very few instructions, it's negligible wrt the total instruction count. The more relevant part is the SIMD body -- when call_i8_bulk inlines inner_op, up to 4 copies of the full inner function body in the batched loop, plus one more in the tail loop, end up in the function.
But even if fully inlined, the inner function is around 200-300 bytes of machine code. So 4 inlined copies = between 1 and 2KB. L1i is 32-64KB. We're well within budget.
Plus, compilers have heuristics for inlining -- as you have discovered, it can choose not to inline. It currently does, even without playing with parameters, so we are good so far. If the compiler stops inlining, we can review and see what we can do.

…512-bulk

ChrisHegarty

LGTM

…512-bulk

…nable inner ops (elastic#145310) This PR refactors the x64 bulk scoring infrastructure for int8 vectors: Shared call_i8_bulk template moved to amd64_vec_common.h, used by both AVX2 (vec_1.cpp) and AVX-512 (vec_2.cpp). Simplified: inner_op handles full dims including tail, so no scalar_op, bulk_tail, or stride parameters needed. Inner functions handle full dims including scalar tail (AVX2) or masked SIMD tail (AVX-512). EXPORT functions become thin wrappers over static inline inners, ensuring the bulk template can inline the full operation. AVX-512 masked tail: replaces scalar loop with a single masked _mm512_maskz_loadu_epi8 + SIMD operation for remaining elements (< 64 bytes). > → >= fix for SIMD stride checks across all files (amd64 and aarch64). Previously, when dims == stride_len, the SIMD path was skipped and all elements went through the scalar tail.

ldematte added 5 commits March 30, 2026 18:46

Templatize vec_2.cpp

97a73cd

Use a common call_i8_bulk for both AVX2 and AVX-512

9bde4a0

Move the scalar tail to inner_op, simplify call_i8_bulk

7e8a94a

Fix bulk condition (too much scalar); use masked SIMD ops for tails o…

e82dcc9

…n AVX-512

Re-expand inner template (no support for template templates in C++ yet)

9e9be9e

elasticsearchmachine added the v9.4.0 label Mar 31, 2026

ldematte added >refactoring :Search Relevance/Vectors Vector search labels Mar 31, 2026

ldematte requested review from ChrisHegarty and thecoop March 31, 2026 11:39

ldematte marked this pull request as ready for review March 31, 2026 11:40

elasticsearchmachine added the Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch label Mar 31, 2026