Add JDK22+ heap-backed native vector scorer suppliers by arup-chauhan · Pull Request #142812 · elastic/elasticsearch

arup-chauhan · 2026-02-22T08:50:32Z

Description

This PR implements an Elasticsearch-first fix for #142379 by enabling native vector scorer suppliers during the array-backed phase of HNSW graph building (JDK 22+), while preserving existing off-heap paths and Lucene fallback behavior.

Context from issue discussion:

During initial HNSW build, vectors may come from heap arrays (vectorValue), so the existing index-slice/off-heap native supplier path is not always used.
We add a JDK22+ heap-backed MemorySegment supplier path in Elasticsearch first, as requested in the issue thread.

Changes

Extended VectorScorerFactory API for array-backed suppliers:

getFloatVectorScorerSupplier(VectorSimilarityType, FloatVectorValues)
getByteVectorScorerSupplier(VectorSimilarityType, ByteVectorValues)

Implemented array-backed native suppliers in simdvec:

Heap float supplier
Heap byte supplier

Wired factory implementation:

VectorScorerFactoryImpl now returns heap-backed suppliers for array-backed values.

Updated ES scorer selection path:

ES93FlatVectorScorer now tries:
- index-slice native supplier (existing behavior)
- array-backed native supplier (new behavior)
- Lucene fallback supplier

Added test coverage for the new array-backed path:

FloatVectorScorerFactoryTests.testArrayBackedRandomSupplier
ByteVectorScorerFactoryTests.testArrayBackedRandomSupplier

Behavior / Safety

New array-backed native supplier path is explicitly gated to JDK 22+ (Runtime.version().feature() >= 22).
If unsupported/incompatible, behavior falls back to existing Lucene scorer path.
Existing off-heap/index-slice path remains unchanged.

Validation

Ran with runtime JDK 25 (JDK22+ path active):

./gradlew :libs:simdvec:compileMain21Java :server:compileJava
./gradlew :libs:simdvec:test \
  --tests org.elasticsearch.simdvec.FloatVectorScorerFactoryTests.testArrayBackedRandomSupplier \
  --tests org.elasticsearch.simdvec.ByteVectorScorerFactoryTests.testArrayBackedRandomSupplier
./gradlew :server:test \
  --tests org.elasticsearch.index.mapper.vectors.DenseVectorFieldMapperTests.testKnnQuantizedFlatVectorsFormat \
  --tests org.elasticsearch.index.mapper.vectors.DenseVectorFieldMapperTests.testKnnQuantizedHNSWVectorsFormat
./gradlew :libs:simdvec:spotlessApply
./gradlew :libs:simdvec:spotlessJavaCheck :server:spotlessJavaCheck

All above commands completed successfully.

benwtrent

please benchmark
bulk scoring actually needs to be bulk scoring

thecoop · 2026-02-23T12:26:26Z

...simdvec/src/main21/java/org/elasticsearch/simdvec/internal/HeapByteVectorScorerSupplier.java

+        }
+
+        @Override
+        HeapByteVectorScorerSupplier copyInternal() {


why doesn't this just override copy directly?

@thecoop I removed the copyInternal() indirection and now each concrete heap supplier overrides copy() directly

elasticsearchmachine · 2026-02-23T12:29:58Z

Pinging @elastic/es-search-relevance (Team:Search Relevance)

arup-chauhan · 2026-03-02T10:48:28Z

please benchmark

bulk scoring actually needs to be bulk scoring

@benwtrent thanks, this is addressed now.

Bulk scoring is now truly bulk. I updated the heap-backed path so bulkScore(...) no longer loops over score(...) one-by-one.

It now packs the selected vectors into a contiguous buffer, calls the native bulk functions, and then applies the similarity-specific normalization step.

I also ran a focused indexing benchmark with qa/vector (HNSW, byte vectors, 128 dims, 100k docs):

JDK21: 2336 ms
JDK25: 1668 ms
about 28.6% faster

While running this, I also found a bug in ordinal handling during incremental HNSW build (we were effectively treating values.size() as fixed). I fixed that by checking
ordinals against the current values.size() at runtime.

Signed-off-by: Arup Chauhan <arupchauhan.connect@gmail.com>

ldematte · 2026-03-02T12:04:13Z

Hello @arup-chauhan, thanks for the benchmarks. Can you add some more details on how you run them?
I do not question your numbers, but I got opposite results when I tried an approach similar to yours, where indexing times got much worse (a 50% or more increase in indexing times wrt the default Lucene implementation).
My benchmarks were different though - I used float32 and higher dimensions, so vectors where definitely larger (probably x16 times larger than the ones you used), which might explain the big difference. Also, I run benchmarks on ARM.

arup-chauhan · 2026-03-02T14:01:03Z

Hey @ldematte, thanks for checking this.

You’re right that my run is not directly comparable to yours. Here is exactly what I ran:

Command:
./gradlew :qa:vector:checkVec --args="qa/vector/configs/my-config.json"

Config:
{
"doc_vectors": ["target/knn_data/docs-128d-120k.bvec"],
"num_docs": 100000,
"index_type": "hnsw",
"hnsw_m": 16,
"hnsw_ef_construction": 200,
"vector_encoding": "byte",
"dimensions": -1,
"reindex": true
}

Results (indexing only):

previous run: doc_add_time=1635ms, total_index_time=3753ms
v25 run: doc_add_time=351ms, total_index_time=2268ms

Here is my hardware:

CPU: Apple M4
Arch: arm64
Cores: 10 (4P + 6E)
Memory: 16 GB
Java: 25.0.2 (LTS) and Java 21
Branch: native-scorers-hnsw-build

This is BYTE vectors, 128 dims, 100k docs, and search/query was not executed in this config (search metrics are zero). So this does not cover float32 + higher dimensions, where behavior can differ significantly.

I agree ARM + larger float vectors may change the outcome materially.

benwtrent · 2026-03-02T14:56:13Z

@arup-chauhan isn't it obvious that you need to benchmark with float and byte? you only did byte.

elasticsearchmachine added v9.4.0 needs:triage Requires assignment of a team area label external-contributor Pull request authored by a developer outside the Elasticsearch team labels Feb 22, 2026

arup-chauhan mentioned this pull request Feb 22, 2026

Use native vector scorers when building the HNSW graph #142379

Open

benwtrent requested changes Feb 23, 2026

View reviewed changes

thecoop reviewed Feb 23, 2026

View reviewed changes

thecoop added :Search Relevance/Vectors Vector search and removed needs:triage Requires assignment of a team area label labels Feb 23, 2026

elasticsearchmachine added the Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch label Feb 23, 2026

thecoop added the >enhancement label Feb 23, 2026

ldematte mentioned this pull request Feb 26, 2026

Use native vector scorers for on-heap data during indexing #143040

Draft

5 tasks

arup-chauhan added 2 commits March 2, 2026 05:01

Add JDK22+ heap-backed native vector scorer suppliers

8bf1fb9

Signed-off-by: Arup Chauhan <arupchauhan.connect@gmail.com>

simdvec: use true bulk scoring for heap-backed vector suppliers

50955d4

Signed-off-by: Arup Chauhan <arupchauhan.connect@gmail.com>

arup-chauhan force-pushed the native-scorers-hnsw-build branch from ae0440b to 50955d4 Compare March 2, 2026 11:15

brianseeders added v9.5.0 and removed v9.4.0 labels Apr 10, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add JDK22+ heap-backed native vector scorer suppliers#142812

Add JDK22+ heap-backed native vector scorer suppliers#142812
arup-chauhan wants to merge 2 commits intoelastic:mainfrom
arup-chauhan:native-scorers-hnsw-build

arup-chauhan commented Feb 22, 2026

Uh oh!

benwtrent left a comment

Uh oh!

thecoop Feb 23, 2026

Uh oh!

arup-chauhan Mar 2, 2026

Uh oh!

elasticsearchmachine commented Feb 23, 2026

Uh oh!

arup-chauhan commented Mar 2, 2026

Uh oh!

ldematte commented Mar 2, 2026

Uh oh!

arup-chauhan commented Mar 2, 2026 •

edited

Loading

Uh oh!

benwtrent commented Mar 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Conversation

arup-chauhan commented Feb 22, 2026

Description

Changes

Behavior / Safety

Validation

Uh oh!

benwtrent left a comment

Choose a reason for hiding this comment

Uh oh!

thecoop Feb 23, 2026

Choose a reason for hiding this comment

Uh oh!

arup-chauhan Mar 2, 2026

Choose a reason for hiding this comment

Uh oh!

elasticsearchmachine commented Feb 23, 2026

Uh oh!

arup-chauhan commented Mar 2, 2026

Uh oh!

ldematte commented Mar 2, 2026

Uh oh!

arup-chauhan commented Mar 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

benwtrent commented Mar 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

arup-chauhan commented Mar 2, 2026 •

edited

Loading