Optimize ARM sqri8/sqri7u kernels using vabdq + vdotq#145116
Merged
ldematte merged 6 commits intoelastic:mainfrom Mar 30, 2026
Merged
Optimize ARM sqri8/sqri7u kernels using vabdq + vdotq#145116ldematte merged 6 commits intoelastic:mainfrom
ldematte merged 6 commits intoelastic:mainfrom
Conversation
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Collaborator
|
Hi @ldematte, I've created a changelog YAML for you. |
Collaborator
|
Pinging @elastic/es-search-relevance (Team:Search Relevance) |
ChrisHegarty
approved these changes
Mar 30, 2026
felixbarny
pushed a commit
to felixbarny/elasticsearch
that referenced
this pull request
Mar 30, 2026
Optimize the aarch64 sqri8_inner kernel and bulk path by replacing the slow widen-subtract-multiply approach with vabdq_s8 + vdotq_u32. Before: vsubl_s8 (widen i8→i16 subtract) → vmlal_s16 (widening multiply-accumulate), processing 16 bytes/iteration with int32x4x2_t accumulators. After: vabdq_s8 (absolute difference, stays in u8) → vdotq_u32 (dot product of abs diff with itself), processing 32 bytes/iteration with uint32x4_t accumulators — matching dot product throughput. Key insight: |a-b| for i8 fits in u8 (max |(-128)-127| = 255), so we can stay in 8-bit and use the dedicated dot product instruction. This also simplifies the code: removes sqri8_vector_acc, sqri8_vector_combine, sqri8_vector, and the int32x4x2_t accumulator specialization and helpers (apply, combine). Net -28 lines.
mamazzol
pushed a commit
to mamazzol/elasticsearch
that referenced
this pull request
Mar 30, 2026
Optimize the aarch64 sqri8_inner kernel and bulk path by replacing the slow widen-subtract-multiply approach with vabdq_s8 + vdotq_u32. Before: vsubl_s8 (widen i8→i16 subtract) → vmlal_s16 (widening multiply-accumulate), processing 16 bytes/iteration with int32x4x2_t accumulators. After: vabdq_s8 (absolute difference, stays in u8) → vdotq_u32 (dot product of abs diff with itself), processing 32 bytes/iteration with uint32x4_t accumulators — matching dot product throughput. Key insight: |a-b| for i8 fits in u8 (max |(-128)-127| = 255), so we can stay in 8-bit and use the dedicated dot product instruction. This also simplifies the code: removes sqri8_vector_acc, sqri8_vector_combine, sqri8_vector, and the int32x4x2_t accumulator specialization and helpers (apply, combine). Net -28 lines.
10 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Optimize the aarch64
sqri8_innerkernel and bulk path by replacing the slow widen-subtract-multiply approach withvabdq_s8+vdotq_u32.Before:
vsubl_s8(widen i8→i16 subtract) →vmlal_s16(widening multiply-accumulate), processing 16 bytes/iteration withint32x4x2_taccumulators.After:
vabdq_s8(absolute difference, stays in u8) →vdotq_u32(dot product of abs diff with itself), processing 32 bytes/iteration withuint32x4_taccumulators — matching dot product throughput.Key insight:
|a-b|for i8 fits in u8 (max |(-128)-127| = 255), so we can stay in 8-bit and use the dedicated dot product instruction.This also simplifies the code: removes
sqri8_vector_acc,sqri8_vector_combine,sqri8_vector, and theint32x4x2_taccumulator specialization and helpers (apply,combine). Net -28 lines.Benchmarks (1024 dims, single-pair)
Apple M4 Pro (NEON+SDOT):
AWS Graviton 4 (c8gd.xlarge, NEON+SDOT):
Test plan
JDKVectorLibraryInt8Testspasses locally