Use native vector scorers for on-heap data during indexing by ldematte · Pull Request #143040 · elastic/elasticsearch

ldematte · 2026-02-25T11:12:47Z

Summary

Enable native vector scoring during HNSW graph construction by storing vectors off-heap in MemorySegments and providing direct native access to them, bypassing Lucene's opaque wrappers. This yields a reduction in indexing time.

Motivation

Previously, native vector scoring (via VectorScorerFactoryImpl) was only available for the search path, where vectors are memory-mapped from disk via MemorySegmentAccessInput. During indexing, HNSW graph
construction used Lucene's DefaultFlatVectorScorer with on-heap float[] arrays -- missing out on our optimized SIMD implementations.
The challenge: Lucene's Lucene99HnswVectorsWriter calls FloatVectorValues.fromFloats(flatFieldVectorsWriter.getVectors()), creating an anonymous wrapper that hides the underlying vector storage. We needed to
(a) control the storage, (b) make it off-heap so native code can access it, and (c) bridge through Lucene's opaque wrappers.

Approach

The work progressed through several layered steps:
Step 0: prepare FloatVectorScorerSupplier to work without MemorySegmentAccessInput -- Introduce a MemorySegmentAccessor abstraction in FloatVectorScorerSupplier to decouple it from MemorySegmentAccessInput, allowing native vector scoring for both memory-mapped and heap-resident data, with adapters for concrete cases. Update ES93FlatVectorScorer to attempt native scoring even when HasIndexSlice is not available (i.e., on-heap vectors during initial HNSW graph construction), and delegate raw float scoring to ES93FlatVectorScorer instead of Lucene's default scorer.
Step 1: Own the indexing path -- Rather than forking Lucene99FlatVectorsWriter entirely, we created ES93FlatVectorsWriter as a delegating wrapper that uses VarHandles to access the delegate's private
meta/vectorData IndexOutput fields. This lets us override addField() and flush() while keeping mergeOneField/mergeOneFieldToIndex delegated. A writer instance either does the indexing path or the
merge path, never both.
Step 2: Refactor BQ writer hierarchy -- Converted BinaryQuantizedVectorsFieldWriter from an interface to an abstract class extending FlatFieldVectorsWriter<float[]>, adding a createFieldWriter factory
method to ES818BinaryQuantizedVectorsWriter for subclass overrides (used by ES93BinaryQuantizedVectorsWriter).
Step 3: Off-heap vector storage -- Created OffHeapFloatVectorStore and OffHeapByteVectorStore using the MRJar pattern (src/main stubs, src/main21 MemorySegment-backed implementations in simdvec).
Each vector is individually allocated in a shared Arena. ES93FlatFieldVectorsWriter uses these stores, with in-place normalizeByMagnitudes operating directly on off-heap segments.
Step 4: Bridge off-heap store to scorer -- The key insight: getVectors() returns named AbstractList wrappers (WrappedNativeFloatVectors, WrappedNativeByteVectors in simdvec) that expose the underlying
store via getStore(). VectorScorerFactoryImpl uses reflection to look through Lucene's anonymous FloatVectorValues wrapper, find the captured list, and extract the OffHeapFloatVectorStore directly --
eliminating the round-trip of materializing float[] from off-heap just to wrap them back into MemorySegment.
Step 5: Shared MemorySegmentAccessor -- Extracted from FloatVectorScorerSupplier's nested interface to a top-level interface shared by both float and byte scorer suppliers. Refactored
ByteVectorScorerSupplier from MemorySegmentAccessInput to MemorySegmentAccessor, enabling the same three-tier resolution (IndexInput -> off-heap store -> heap fallback) for byte vectors.
Step 6: Native sparse bulk scoring -- Added vec_sqrf32_bulk_sparse, a native function taking f32_t* const* (array of pointers to individually-allocated vectors) instead of a contiguous block. This is the
"scatter" pattern: templatized sqrf32_inner_bulk on TData and a new mapper signature. The heap adapter can't use this path (JDK docs explicitly state MemorySegment.ofArray addresses can't populate a native
pointer struct), but OffHeapFloatStoreAdapter can because its vectors are truly off-heap. MemorySegmentAccessor.segmentForEntriesOrNull(int[], Arena) builds the pointer array; the Arena parameter is
caller-managed for clean lifecycle control.

Design decisions

Delegation + VarHandle over full copy: We initially considered forking Lucene99FlatVectorsWriter entirely, but the VarHandle approach is less code and easier to maintain across Lucene upgrades -- we only
own the indexing path.
Reflection bridge over Lucene fork: Using reflection to unwrap Lucene's anonymous FloatVectorValues is pragmatic. A cleaner long-term solution (e.g., custom HnswVectorsWriter) may replace it later.
MRJar for MemorySegment APIs: java.lang.foreign is preview in JDK 21, finalized in JDK 22. OffHeapFloatVectorStore, OffHeapByteVectorStore, and ArenaUtil use src/main stubs +
src/main21/src/main22 implementations.
Individual segment allocation: Each vector gets its own MemorySegment (vs one large contiguous segment) because vectors arrive one at a time during indexing. This naturally leads to the sparse bulk scoring
approach.

Benchmark results

GIST-1M, 200k docs, float32, euclidean, HNSW(m=16, efC=200), BQ:

Before (heap, non-native scoring): 225s indexing
After (off-heap, native sparse bulk scoring): 162s indexing
Speedup: ~28%

Known limitations

Sparse bulk scoring currently implemented only for Euclidean distance; DotProduct and MaxInnerProduct fall back to per-pair scoring.
Reflection-based store extraction is a pragmatic bridge; a cleaner approach may replace it later.
Further work may be needed to address other implementations (e.g., for Int7SQVectorScorerSupplier -- but we need to check if we need it).

Test plan

All libs:simdvec tests pass
All libs:native tests pass
ES93 vector format tests pass (updated testToString expectations)
Verified via async-profiler flamegraph that native sparse bulk scoring path is used during HNSW graph construction
Benchmarked with KnnIndexTester on GIST-1M (gives a ~30% indexing speedup)

Co-authored with Cursor

thecoop · 2026-02-25T11:32:53Z

libs/simdvec/src/main21/java/org/elasticsearch/simdvec/VectorScorerFactoryImpl.java

+
+        @Override
+        public MemorySegment entireSegmentOrNull() throws IOException {
+            return null;


Yeah, this is tricky. FloatVectorValues could be a discontinuous List<float[]>, which makes mapping it difficult.

Yep. The last commit implements one idea we discussed, which copies vector data to make it contiguous, but copy dominates (it is actually more expensive than computing the distance, which is not surprising).
The other idea (changing native code to accept sparse arrays) works on the native side, but fails on the Java side; Java does not support "indirect" on-heap MemorySegments, so you cannot compile an "array of pointers".

The final version, after a lot of wrestling with Lucene, is to have a List of off-heap MemorySegments during indexing, and use that, so we can compile and array of pointers.

libs/simdvec/src/main21/java/org/elasticsearch/simdvec/internal/FloatVectorScorerSupplier.java

ldematte · 2026-02-26T09:29:06Z

This is partially duplicated by #142812; we can choose which one we want to pursue, but both will need further and deeper work (see next comment).

ldematte · 2026-02-26T09:38:06Z

Just enabling native operations with a simple wrapper for MemorySegment.ofArray makes the code fall back to the scoreIndividually path.

It works, but there is no performance benefit (times are roughly the same to the default Lucene bulk scorer)

ldematte · 2026-02-26T09:42:04Z

The last commit tries out one idea we discussed: copy on-heap data to off-heap, compacting it (make it contiguous). The scoring part becomes measurably faster, but the overhead is huge, so in total we loose wrt single scoring.

We can get rid of part of it by re-using the off-heap buffer, but the great majority of the overhead comes from actually copying the bytes (the massive chunk in the middle of the flamegraph). Not surprising.

thecoop · 2026-02-26T10:34:13Z

Eeeeeh, that sucks. If we can't use bulk scoring, and the single scoring is kinda the same speed as original Lucene, then is this worth it?

Interpose an ES-owned FlatVectorsWriter between ES93GenericFlatVectorsWriter and Lucene99FlatVectorsWriter for the default (FLOAT/BYTE) vector format. All methods delegate to the inner Lucene writer, enabling future customization of addField without copying the entire Lucene implementation. Made-with: Cursor

Replace delegation of addField/flush with ES-owned implementations. ES93FlatFieldVectorsWriter manages its own vector storage, and ES93FlatVectorsWriter uses VarHandles to access the delegate's IndexOutputs for flush. Merge, finish, and close remain delegated to Lucene99FlatVectorsWriter. Made-with: Cursor

ldematte · 2026-02-26T17:39:43Z

Eeeeeh, that sucks. If we can't use bulk scoring, and the single scoring is kinda the same speed as original Lucene, then is this worth it?

IMO no. I'm exploring one last option, but for now I think it's better to leave this as-is.

Convert from interface to abstract class extending FlatFieldVectorsWriter<float[]> and add createFieldWriter factory method to ES818BinaryQuantizedVectorsWriter so a future es93 subclass can override field writer creation. Made-with: Cursor

Introduce ES93BinaryQuantizedVectorsWriter and ES93BinaryQuantizedFieldWriter so the BQ flush path is wired through ES93FlatFieldVectorsWriter. When that class migrates to MemorySegment storage, only ES93BinaryQuantizedFieldWriter needs to adapt. Made-with: Cursor

Introduce OffHeapFloatVectorStore and OffHeapByteVectorStore in simdvec using the MRJar pattern (stub in src/main, MemorySegment implementation in src/main21). ES93FlatFieldVectorsWriter now delegates vector storage to these stores, keeping data out of the Java heap. BQ normalization is performed in-place on the off-heap segments. Made-with: Cursor

Move WrappedNativeFloatVectors and WrappedNativeByteVectors to simdvec so the scorer can detect them. VectorScorerFactoryImpl now uses reflection to extract the captured list from Lucene's anonymous FloatVectorValues and, when it wraps an off-heap store, creates an OffHeapStoreAdapter that returns MemorySegments directly -- avoiding the MemorySegment-to-float[]-to-MemorySegment round-trip during HNSW graph construction. Made-with: Cursor

Move MemorySegmentAccessor from a nested interface in FloatVectorScorerSupplier to a top-level interface shared by both float and byte scorer suppliers. Rename float-specific adapters (OffHeapFloatStoreAdapter, FloatMemorySegmentHeapAdapter, etc.) and add equivalent byte adapters. Refactor ByteVectorScorerSupplier to use MemorySegmentAccessor, enabling off-heap and heap fallback paths matching the float scorer pattern. Made-with: Cursor

Introduce vec_sqrf32_bulk_sparse, a native function that takes an array of pointers to individually-allocated vectors instead of a contiguous block. This enables bulk scoring directly from off-heap MemorySegments without materializing a contiguous buffer. Templatize sqrf32_inner_bulk in both aarch64 and amd64 to support the new sparse mapper alongside existing identity and array mappers. Wire through JdkVectorLibrary, Similarities, and into FloatVectorScorerSupplier via MemorySegmentAccessor's new segmentForEntriesOrNull method. OffHeapFloatStoreAdapter builds the native pointer array from its off-heap vector segments. Made-with: Cursor

ldematte · 2026-02-27T17:47:00Z

Well, I tried the "last option". It's.. extensive :D
It does open a path towards native bulk scoring though!

arup-chauhan · 2026-03-02T11:35:23Z

@ldematte @thecoop

I have just updated #142812 based on review feedback (including true bulk scoring, benchmark numbers, and a fix found during benchmarking for incremental ordinal handling).

Since there’s overlap between the two efforts, I would love to collaborate and help converge on whichever path the team prefers.

If the team is going with this as the main path, I’m happy to help port over relevant pieces from #142812.

MemorySegment access via an adapter for input or heap

ea2624f

ldematte added the >refactoring label Feb 25, 2026

elasticsearchmachine added the v9.4.0 label Feb 25, 2026

ldematte added the :Search Relevance/Vectors Vector search label Feb 25, 2026

[CI] Auto commit changes from spotless

31ded0e

thecoop reviewed Feb 25, 2026

View reviewed changes

libs/simdvec/src/main21/java/org/elasticsearch/simdvec/internal/FloatVectorScorerSupplier.java Outdated Show resolved Hide resolved

Batch multiple on-heap vectors by copy to a MemorySegment.

6d0497a

ldematte added 2 commits February 26, 2026 12:26

ldematte added 8 commits February 27, 2026 14:31

Merge remote-tracking branch 'upstream/main' into segments-adapter

118293e

Publish vec binaries + update version

2ba142c

Add @SuppressForbidden

5512afb

arup-chauhan mentioned this pull request Mar 2, 2026

Use native vector scorers when building the HNSW graph #142379

Open

ldematte mentioned this pull request Apr 1, 2026

Comparative benchmarking and fine-tuning of optimized native scorers #145411

Open

10 tasks

brianseeders added v9.5.0 and removed v9.4.0 labels Apr 10, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use native vector scorers for on-heap data during indexing#143040

Use native vector scorers for on-heap data during indexing#143040
ldematte wants to merge 14 commits intoelastic:mainfrom
ldematte:segments-adapter

ldematte commented Feb 25, 2026 •

edited

Loading

Uh oh!

thecoop Feb 25, 2026

Uh oh!

ldematte Feb 26, 2026

Uh oh!

ldematte Feb 27, 2026

Uh oh!

Uh oh!

ldematte commented Feb 26, 2026

Uh oh!

ldematte commented Feb 26, 2026

Uh oh!

ldematte commented Feb 26, 2026 •

edited

Loading

Uh oh!

thecoop commented Feb 26, 2026

Uh oh!

ldematte commented Feb 26, 2026

Uh oh!

ldematte commented Feb 27, 2026

Uh oh!

arup-chauhan commented Mar 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

ldematte commented Feb 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Motivation

Approach

Design decisions

Benchmark results

Known limitations

Test plan

Uh oh!

thecoop Feb 25, 2026

Choose a reason for hiding this comment

Uh oh!

ldematte Feb 26, 2026

Choose a reason for hiding this comment

Uh oh!

ldematte Feb 27, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ldematte commented Feb 26, 2026

Uh oh!

ldematte commented Feb 26, 2026

Uh oh!

ldematte commented Feb 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

thecoop commented Feb 26, 2026

Uh oh!

ldematte commented Feb 26, 2026

Uh oh!

ldematte commented Feb 27, 2026

Uh oh!

arup-chauhan commented Mar 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

ldematte commented Feb 25, 2026 •

edited

Loading

ldematte commented Feb 26, 2026 •

edited

Loading