Optimize ARM sqri8/sqri7u kernels using vabdq + vdotq by ldematte · Pull Request #145116 · elastic/elasticsearch

ldematte · 2026-03-27T17:02:28Z

Summary

Optimize the aarch64 sqri8_inner kernel and bulk path by replacing the slow widen-subtract-multiply approach with vabdq_s8 + vdotq_u32.

Before: vsubl_s8 (widen i8→i16 subtract) → vmlal_s16 (widening multiply-accumulate), processing 16 bytes/iteration with int32x4x2_t accumulators.

After: vabdq_s8 (absolute difference, stays in u8) → vdotq_u32 (dot product of abs diff with itself), processing 32 bytes/iteration with uint32x4_t accumulators — matching dot product throughput.

Key insight: |a-b| for i8 fits in u8 (max |(-128)-127| = 255), so we can stay in 8-bit and use the dedicated dot product instruction.

This also simplifies the code: removes sqri8_vector_acc, sqri8_vector_combine, sqri8_vector, and the int32x4x2_t accumulator specialization and helpers (apply, combine). Net -28 lines.

Benchmarks (1024 dims, single-pair)

Apple M4 Pro (NEON+SDOT):

	Before	After	Speedup
dot i8 (control)	25.7 ns	25.7 ns	—
sqeuclidean i8	46.0 ns	26.6 ns	1.73x

AWS Graviton 4 (c8gd.xlarge, NEON+SDOT):

	Before	After	Speedup
dot i8 (control)	26.6 ns	26.6 ns	—
sqeuclidean i8	67.1 ns	30.2 ns	2.22x

Test plan

JDKVectorLibraryInt8Tests passes locally

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

elasticsearchmachine · 2026-03-27T17:34:28Z

Hi @ldematte, I've created a changelog YAML for you.

elasticsearchmachine · 2026-03-27T17:34:51Z

Pinging @elastic/es-search-relevance (Team:Search Relevance)

Optimize the aarch64 sqri8_inner kernel and bulk path by replacing the slow widen-subtract-multiply approach with vabdq_s8 + vdotq_u32. Before: vsubl_s8 (widen i8→i16 subtract) → vmlal_s16 (widening multiply-accumulate), processing 16 bytes/iteration with int32x4x2_t accumulators. After: vabdq_s8 (absolute difference, stays in u8) → vdotq_u32 (dot product of abs diff with itself), processing 32 bytes/iteration with uint32x4_t accumulators — matching dot product throughput. Key insight: |a-b| for i8 fits in u8 (max |(-128)-127| = 255), so we can stay in 8-bit and use the dedicated dot product instruction. This also simplifies the code: removes sqri8_vector_acc, sqri8_vector_combine, sqri8_vector, and the int32x4x2_t accumulator specialization and helpers (apply, combine). Net -28 lines.

Optimize aarch64 sqri8 kernel using vabdq + vdotq

6ee8d2a

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

ldematte added the :Search Relevance/Vectors Vector search label Mar 27, 2026

elasticsearchmachine added the v9.4.0 label Mar 27, 2026

Update bulk sqri8/i7u code too; publish vec binaries + update version

71b79cd

ldematte added the >enhancement label Mar 27, 2026

Update docs/changelog/145116.yaml

1c40514

ldematte requested review from ChrisHegarty and thecoop March 27, 2026 17:34

ldematte marked this pull request as ready for review March 27, 2026 17:34

ldematte requested a review from a team as a code owner March 27, 2026 17:34

elasticsearchmachine added the Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch label Mar 27, 2026

ldematte added 3 commits March 27, 2026 18:34

Merge branch 'main' into native/better-int8-sqr

da7bbb8

Merge branch 'main' into native/better-int8-sqr

1a6095a

Merge remote-tracking branch 'upstream/main' into native/better-int8-sqr

4084f45

ChrisHegarty approved these changes Mar 30, 2026

View reviewed changes

ldematte merged commit efeb009 into elastic:main Mar 30, 2026
36 checks passed

ldematte deleted the native/better-int8-sqr branch March 30, 2026 09:07

ldematte mentioned this pull request Apr 1, 2026

Comparative benchmarking and fine-tuning of optimized native scorers #145411

Open

10 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize ARM sqri8/sqri7u kernels using vabdq + vdotq#145116

Optimize ARM sqri8/sqri7u kernels using vabdq + vdotq#145116
ldematte merged 6 commits intoelastic:mainfrom
ldematte:native/better-int8-sqr

ldematte commented Mar 27, 2026 •

edited

Loading

Uh oh!

elasticsearchmachine commented Mar 27, 2026

Uh oh!

elasticsearchmachine commented Mar 27, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

ldematte commented Mar 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Benchmarks (1024 dims, single-pair)

Test plan

Uh oh!

elasticsearchmachine commented Mar 27, 2026

Uh oh!

elasticsearchmachine commented Mar 27, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ldematte commented Mar 27, 2026 •

edited

Loading