Skip to content

Refactor x64 bulk scoring: shared template, masked AVX-512 tail, inlinable inner ops#145310

Merged
ldematte merged 11 commits intoelastic:mainfrom
ldematte:native/refactor-avx512-bulk
Apr 1, 2026
Merged

Refactor x64 bulk scoring: shared template, masked AVX-512 tail, inlinable inner ops#145310
ldematte merged 11 commits intoelastic:mainfrom
ldematte:native/refactor-avx512-bulk

Conversation

@ldematte
Copy link
Copy Markdown
Contributor

@ldematte ldematte commented Mar 31, 2026

This PR refactors the x64 bulk scoring infrastructure for int8 vectors:

  • Shared call_i8_bulk template moved to amd64_vec_common.h, used by both AVX2 (vec_1.cpp) and AVX-512 (vec_2.cpp). Simplified: inner_op handles full dims including tail, so no scalar_op, bulk_tail, or stride parameters needed.

  • Inner functions handle full dims including scalar tail (AVX2) or masked SIMD tail (AVX-512). EXPORT functions become thin wrappers over static inline inners, ensuring the bulk template can inline the full operation.

  • AVX-512 masked tail: replaces scalar loop with a single masked _mm512_maskz_loadu_epi8 + SIMD operation for remaining elements (< 64 bytes).

  • >>= fix for SIMD stride checks across all files (amd64 and aarch64). Previously, when dims == stride_len, the SIMD path was skipped and all elements went through the scalar tail.

Follows #144845 (needs new build process to avoid regression due to the GCC #pragma/inline bug)

Relates to #145411

Test plan

  • JDKVectorLibraryInt8Tests pass on AMD c8a (x64 AVX-512)
  • JDKVectorLibrary*Tests pass locally on Apple Silicon (aarch64)
  • No performance regression on AMD c8a (bulk random 130k: 995 ops/s vs 988 baseline)

@ldematte ldematte marked this pull request as ready for review March 31, 2026 11:40
@elasticsearchmachine elasticsearchmachine added the Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch label Mar 31, 2026
@elasticsearchmachine
Copy link
Copy Markdown
Collaborator

Pinging @elastic/es-search-relevance (Team:Search Relevance)


int i = 0;
const int blk = dims & ~(STRIDE_BYTES_LEN - 1);
#pragma GCC unroll 4
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't use #pragma unroll anywhere else, worth changing this to templates?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a leftover from early days, and I did not want to touch it in this PR. I want to investigate this VS using templates VS nothing at all, but as a separate task.

@thecoop
Copy link
Copy Markdown
Member

thecoop commented Mar 31, 2026

Is there method size considerations with inlining the scalar tail? Might the larger inlined method be too big to fit in the CPU all at once? (although 64k is a lot of instructions...)

@ldematte
Copy link
Copy Markdown
Contributor Author

Is there method size considerations with inlining the scalar tail? Might the larger inlined method be too big to fit in the CPU all at once? (although 64k is a lot of instructions...)

TL;DR: I don't have concerns.
The scalar tail (or masked tail) adds very few instructions, it's negligible wrt the total instruction count. The more relevant part is the SIMD body -- when call_i8_bulk inlines inner_op, up to 4 copies of the full inner function body in the batched loop, plus one more in the tail loop, end up in the function.
But even if fully inlined, the inner function is around 200-300 bytes of machine code. So 4 inlined copies = between 1 and 2KB. L1i is 32-64KB. We're well within budget.
Plus, compilers have heuristics for inlining -- as you have discovered, it can choose not to inline. It currently does, even without playing with parameters, so we are good so far. If the compiler stops inlining, we can review and see what we can do.

Copy link
Copy Markdown
Contributor

@ChrisHegarty ChrisHegarty left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ldematte ldematte requested a review from a team as a code owner April 1, 2026 16:16
@ldematte ldematte enabled auto-merge (squash) April 1, 2026 16:18
@ldematte ldematte merged commit 27f6080 into elastic:main Apr 1, 2026
23 of 24 checks passed
@ldematte ldematte deleted the native/refactor-avx512-bulk branch April 1, 2026 20:13
mromaios pushed a commit to mromaios/elasticsearch that referenced this pull request Apr 9, 2026
…nable inner ops (elastic#145310)

This PR refactors the x64 bulk scoring infrastructure for int8 vectors:

    Shared call_i8_bulk template moved to amd64_vec_common.h, used by both AVX2 (vec_1.cpp) and AVX-512 (vec_2.cpp). Simplified: inner_op handles full dims including tail, so no scalar_op, bulk_tail, or stride parameters needed.

    Inner functions handle full dims including scalar tail (AVX2) or masked SIMD tail (AVX-512). EXPORT functions become thin wrappers over static inline inners, ensuring the bulk template can inline the full operation.

    AVX-512 masked tail: replaces scalar loop with a single masked _mm512_maskz_loadu_epi8 + SIMD operation for remaining elements (< 64 bytes).

    > → >= fix for SIMD stride checks across all files (amd64 and aarch64). Previously, when dims == stride_len, the SIMD path was skipped and all elements went through the scalar tail.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

>refactoring :Search Relevance/Vectors Vector search Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch v9.4.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants