Skip to content

Optimize ARM sqri8/sqri7u kernels using vabdq + vdotq#145116

Merged
ldematte merged 6 commits intoelastic:mainfrom
ldematte:native/better-int8-sqr
Mar 30, 2026
Merged

Optimize ARM sqri8/sqri7u kernels using vabdq + vdotq#145116
ldematte merged 6 commits intoelastic:mainfrom
ldematte:native/better-int8-sqr

Conversation

@ldematte
Copy link
Copy Markdown
Contributor

@ldematte ldematte commented Mar 27, 2026

Summary

Optimize the aarch64 sqri8_inner kernel and bulk path by replacing the slow widen-subtract-multiply approach with vabdq_s8 + vdotq_u32.

Before: vsubl_s8 (widen i8→i16 subtract) → vmlal_s16 (widening multiply-accumulate), processing 16 bytes/iteration with int32x4x2_t accumulators.

After: vabdq_s8 (absolute difference, stays in u8) → vdotq_u32 (dot product of abs diff with itself), processing 32 bytes/iteration with uint32x4_t accumulators — matching dot product throughput.

Key insight: |a-b| for i8 fits in u8 (max |(-128)-127| = 255), so we can stay in 8-bit and use the dedicated dot product instruction.

This also simplifies the code: removes sqri8_vector_acc, sqri8_vector_combine, sqri8_vector, and the int32x4x2_t accumulator specialization and helpers (apply, combine). Net -28 lines.

Benchmarks (1024 dims, single-pair)

Apple M4 Pro (NEON+SDOT):

Before After Speedup
dot i8 (control) 25.7 ns 25.7 ns
sqeuclidean i8 46.0 ns 26.6 ns 1.73x

AWS Graviton 4 (c8gd.xlarge, NEON+SDOT):

Before After Speedup
dot i8 (control) 26.6 ns 26.6 ns
sqeuclidean i8 67.1 ns 30.2 ns 2.22x

Test plan

  • JDKVectorLibraryInt8Tests passes locally

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@elasticsearchmachine
Copy link
Copy Markdown
Collaborator

Hi @ldematte, I've created a changelog YAML for you.

@ldematte ldematte marked this pull request as ready for review March 27, 2026 17:34
@ldematte ldematte requested a review from a team as a code owner March 27, 2026 17:34
@elasticsearchmachine
Copy link
Copy Markdown
Collaborator

Pinging @elastic/es-search-relevance (Team:Search Relevance)

@elasticsearchmachine elasticsearchmachine added the Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch label Mar 27, 2026
@ldematte ldematte merged commit efeb009 into elastic:main Mar 30, 2026
36 checks passed
@ldematte ldematte deleted the native/better-int8-sqr branch March 30, 2026 09:07
felixbarny pushed a commit to felixbarny/elasticsearch that referenced this pull request Mar 30, 2026
Optimize the aarch64 sqri8_inner kernel and bulk path by replacing the slow widen-subtract-multiply approach with vabdq_s8 + vdotq_u32.

Before: vsubl_s8 (widen i8→i16 subtract) → vmlal_s16 (widening multiply-accumulate), processing 16 bytes/iteration with int32x4x2_t accumulators.

After: vabdq_s8 (absolute difference, stays in u8) → vdotq_u32 (dot product of abs diff with itself), processing 32 bytes/iteration with uint32x4_t accumulators — matching dot product throughput.

Key insight: |a-b| for i8 fits in u8 (max |(-128)-127| = 255), so we can stay in 8-bit and use the dedicated dot product instruction.

This also simplifies the code: removes sqri8_vector_acc, sqri8_vector_combine, sqri8_vector, and the int32x4x2_t accumulator specialization and helpers (apply, combine). Net -28 lines.
mamazzol pushed a commit to mamazzol/elasticsearch that referenced this pull request Mar 30, 2026
Optimize the aarch64 sqri8_inner kernel and bulk path by replacing the slow widen-subtract-multiply approach with vabdq_s8 + vdotq_u32.

Before: vsubl_s8 (widen i8→i16 subtract) → vmlal_s16 (widening multiply-accumulate), processing 16 bytes/iteration with int32x4x2_t accumulators.

After: vabdq_s8 (absolute difference, stays in u8) → vdotq_u32 (dot product of abs diff with itself), processing 32 bytes/iteration with uint32x4_t accumulators — matching dot product throughput.

Key insight: |a-b| for i8 fits in u8 (max |(-128)-127| = 255), so we can stay in 8-bit and use the dedicated dot product instruction.

This also simplifies the code: removes sqri8_vector_acc, sqri8_vector_combine, sqri8_vector, and the int32x4x2_t accumulator specialization and helpers (apply, combine). Net -28 lines.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

>enhancement :Search Relevance/Vectors Vector search Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch v9.4.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants