Skip to content

Use native vector scorers for on-heap data during indexing#143040

Draft
ldematte wants to merge 14 commits intoelastic:mainfrom
ldematte:segments-adapter
Draft

Use native vector scorers for on-heap data during indexing#143040
ldematte wants to merge 14 commits intoelastic:mainfrom
ldematte:segments-adapter

Conversation

@ldematte
Copy link
Copy Markdown
Contributor

@ldematte ldematte commented Feb 25, 2026

Summary

Enable native vector scoring during HNSW graph construction by storing vectors off-heap in MemorySegments and providing direct native access to them, bypassing Lucene's opaque wrappers. This yields a reduction in indexing time.

Motivation

Previously, native vector scoring (via VectorScorerFactoryImpl) was only available for the search path, where vectors are memory-mapped from disk via MemorySegmentAccessInput. During indexing, HNSW graph
construction used Lucene's DefaultFlatVectorScorer with on-heap float[] arrays -- missing out on our optimized SIMD implementations.
The challenge: Lucene's Lucene99HnswVectorsWriter calls FloatVectorValues.fromFloats(flatFieldVectorsWriter.getVectors()), creating an anonymous wrapper that hides the underlying vector storage. We needed to
(a) control the storage, (b) make it off-heap so native code can access it, and (c) bridge through Lucene's opaque wrappers.

Approach

The work progressed through several layered steps:
Step 0: prepare FloatVectorScorerSupplier to work without MemorySegmentAccessInput -- Introduce a MemorySegmentAccessor abstraction in FloatVectorScorerSupplier to decouple it from MemorySegmentAccessInput, allowing native vector scoring for both memory-mapped and heap-resident data, with adapters for concrete cases. Update ES93FlatVectorScorer to attempt native scoring even when HasIndexSlice is not available (i.e., on-heap vectors during initial HNSW graph construction), and delegate raw float scoring to ES93FlatVectorScorer instead of Lucene's default scorer.
Step 1: Own the indexing path -- Rather than forking Lucene99FlatVectorsWriter entirely, we created ES93FlatVectorsWriter as a delegating wrapper that uses VarHandles to access the delegate's private
meta/vectorData IndexOutput fields. This lets us override addField() and flush() while keeping mergeOneField/mergeOneFieldToIndex delegated. A writer instance either does the indexing path or the
merge path, never both.
Step 2: Refactor BQ writer hierarchy -- Converted BinaryQuantizedVectorsFieldWriter from an interface to an abstract class extending FlatFieldVectorsWriter<float[]>, adding a createFieldWriter factory
method to ES818BinaryQuantizedVectorsWriter for subclass overrides (used by ES93BinaryQuantizedVectorsWriter).
Step 3: Off-heap vector storage -- Created OffHeapFloatVectorStore and OffHeapByteVectorStore using the MRJar pattern (src/main stubs, src/main21 MemorySegment-backed implementations in simdvec).
Each vector is individually allocated in a shared Arena. ES93FlatFieldVectorsWriter uses these stores, with in-place normalizeByMagnitudes operating directly on off-heap segments.
Step 4: Bridge off-heap store to scorer -- The key insight: getVectors() returns named AbstractList wrappers (WrappedNativeFloatVectors, WrappedNativeByteVectors in simdvec) that expose the underlying
store via getStore(). VectorScorerFactoryImpl uses reflection to look through Lucene's anonymous FloatVectorValues wrapper, find the captured list, and extract the OffHeapFloatVectorStore directly --
eliminating the round-trip of materializing float[] from off-heap just to wrap them back into MemorySegment.
Step 5: Shared MemorySegmentAccessor -- Extracted from FloatVectorScorerSupplier's nested interface to a top-level interface shared by both float and byte scorer suppliers. Refactored
ByteVectorScorerSupplier from MemorySegmentAccessInput to MemorySegmentAccessor, enabling the same three-tier resolution (IndexInput -> off-heap store -> heap fallback) for byte vectors.
Step 6: Native sparse bulk scoring -- Added vec_sqrf32_bulk_sparse, a native function taking f32_t* const* (array of pointers to individually-allocated vectors) instead of a contiguous block. This is the
"scatter" pattern: templatized sqrf32_inner_bulk on TData and a new mapper signature. The heap adapter can't use this path (JDK docs explicitly state MemorySegment.ofArray addresses can't populate a native
pointer struct), but OffHeapFloatStoreAdapter can because its vectors are truly off-heap. MemorySegmentAccessor.segmentForEntriesOrNull(int[], Arena) builds the pointer array; the Arena parameter is
caller-managed for clean lifecycle control.

Design decisions

  • Delegation + VarHandle over full copy: We initially considered forking Lucene99FlatVectorsWriter entirely, but the VarHandle approach is less code and easier to maintain across Lucene upgrades -- we only
    own the indexing path.
  • Reflection bridge over Lucene fork: Using reflection to unwrap Lucene's anonymous FloatVectorValues is pragmatic. A cleaner long-term solution (e.g., custom HnswVectorsWriter) may replace it later.
  • MRJar for MemorySegment APIs: java.lang.foreign is preview in JDK 21, finalized in JDK 22. OffHeapFloatVectorStore, OffHeapByteVectorStore, and ArenaUtil use src/main stubs +
    src/main21/src/main22 implementations.
  • Individual segment allocation: Each vector gets its own MemorySegment (vs one large contiguous segment) because vectors arrive one at a time during indexing. This naturally leads to the sparse bulk scoring
    approach.

Benchmark results

GIST-1M, 200k docs, float32, euclidean, HNSW(m=16, efC=200), BQ:

  • Before (heap, non-native scoring): 225s indexing
  • After (off-heap, native sparse bulk scoring): 162s indexing
  • Speedup: ~28%

Known limitations

  • Sparse bulk scoring currently implemented only for Euclidean distance; DotProduct and MaxInnerProduct fall back to per-pair scoring.
  • Reflection-based store extraction is a pragmatic bridge; a cleaner approach may replace it later.
  • Further work may be needed to address other implementations (e.g., for Int7SQVectorScorerSupplier -- but we need to check if we need it).

Test plan

  • All libs:simdvec tests pass
  • All libs:native tests pass
  • ES93 vector format tests pass (updated testToString expectations)
  • Verified via async-profiler flamegraph that native sparse bulk scoring path is used during HNSW graph construction
  • Benchmarked with KnnIndexTester on GIST-1M (gives a ~30% indexing speedup)

Co-authored with Cursor


@Override
public MemorySegment entireSegmentOrNull() throws IOException {
return null;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this is tricky. FloatVectorValues could be a discontinuous List<float[]>, which makes mapping it difficult.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep. The last commit implements one idea we discussed, which copies vector data to make it contiguous, but copy dominates (it is actually more expensive than computing the distance, which is not surprising).
The other idea (changing native code to accept sparse arrays) works on the native side, but fails on the Java side; Java does not support "indirect" on-heap MemorySegments, so you cannot compile an "array of pointers".

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The final version, after a lot of wrestling with Lucene, is to have a List of off-heap MemorySegments during indexing, and use that, so we can compile and array of pointers.

@ldematte
Copy link
Copy Markdown
Contributor Author

This is partially duplicated by #142812; we can choose which one we want to pursue, but both will need further and deeper work (see next comment).

@ldematte
Copy link
Copy Markdown
Contributor Author

Just enabling native operations with a simple wrapper for MemorySegment.ofArray makes the code fall back to the scoreIndividually path.
image

It works, but there is no performance benefit (times are roughly the same to the default Lucene bulk scorer)

@ldematte
Copy link
Copy Markdown
Contributor Author

ldematte commented Feb 26, 2026

The last commit tries out one idea we discussed: copy on-heap data to off-heap, compacting it (make it contiguous). The scoring part becomes measurably faster, but the overhead is huge, so in total we loose wrt single scoring.

image

We can get rid of part of it by re-using the off-heap buffer, but the great majority of the overhead comes from actually copying the bytes (the massive chunk in the middle of the flamegraph). Not surprising.

@thecoop
Copy link
Copy Markdown
Member

thecoop commented Feb 26, 2026

Eeeeeh, that sucks. If we can't use bulk scoring, and the single scoring is kinda the same speed as original Lucene, then is this worth it?

Interpose an ES-owned FlatVectorsWriter between
ES93GenericFlatVectorsWriter and Lucene99FlatVectorsWriter for the
default (FLOAT/BYTE) vector format. All methods delegate to the
inner Lucene writer, enabling future customization of addField
without copying the entire Lucene implementation.

Made-with: Cursor
Replace delegation of addField/flush with ES-owned
implementations. ES93FlatFieldVectorsWriter manages its own
vector storage, and ES93FlatVectorsWriter uses VarHandles to
access the delegate's IndexOutputs for flush. Merge, finish,
and close remain delegated to Lucene99FlatVectorsWriter.

Made-with: Cursor
@ldematte
Copy link
Copy Markdown
Contributor Author

Eeeeeh, that sucks. If we can't use bulk scoring, and the single scoring is kinda the same speed as original Lucene, then is this worth it?

IMO no. I'm exploring one last option, but for now I think it's better to leave this as-is.

Convert from interface to abstract class extending
FlatFieldVectorsWriter<float[]> and add createFieldWriter
factory method to ES818BinaryQuantizedVectorsWriter so a
future es93 subclass can override field writer creation.

Made-with: Cursor
Introduce ES93BinaryQuantizedVectorsWriter and
ES93BinaryQuantizedFieldWriter so the BQ flush path
is wired through ES93FlatFieldVectorsWriter. When
that class migrates to MemorySegment storage, only
ES93BinaryQuantizedFieldWriter needs to adapt.

Made-with: Cursor
Introduce OffHeapFloatVectorStore and OffHeapByteVectorStore in
simdvec using the MRJar pattern (stub in src/main, MemorySegment
implementation in src/main21). ES93FlatFieldVectorsWriter now
delegates vector storage to these stores, keeping data out of
the Java heap. BQ normalization is performed in-place on the
off-heap segments.

Made-with: Cursor
Move WrappedNativeFloatVectors and WrappedNativeByteVectors to
simdvec so the scorer can detect them. VectorScorerFactoryImpl
now uses reflection to extract the captured list from Lucene's
anonymous FloatVectorValues and, when it wraps an off-heap store,
creates an OffHeapStoreAdapter that returns MemorySegments
directly -- avoiding the MemorySegment-to-float[]-to-MemorySegment
round-trip during HNSW graph construction.

Made-with: Cursor
Move MemorySegmentAccessor from a nested interface in
FloatVectorScorerSupplier to a top-level interface shared by both
float and byte scorer suppliers. Rename float-specific adapters
(OffHeapFloatStoreAdapter, FloatMemorySegmentHeapAdapter, etc.)
and add equivalent byte adapters. Refactor ByteVectorScorerSupplier
to use MemorySegmentAccessor, enabling off-heap and heap fallback
paths matching the float scorer pattern.

Made-with: Cursor
Introduce vec_sqrf32_bulk_sparse, a native function that takes an
array of pointers to individually-allocated vectors instead of a
contiguous block. This enables bulk scoring directly from off-heap
MemorySegments without materializing a contiguous buffer.

Templatize sqrf32_inner_bulk in both aarch64 and amd64 to support
the new sparse mapper alongside existing identity and array mappers.
Wire through JdkVectorLibrary, Similarities, and into
FloatVectorScorerSupplier via MemorySegmentAccessor's new
segmentForEntriesOrNull method. OffHeapFloatStoreAdapter builds the
native pointer array from its off-heap vector segments.

Made-with: Cursor
@ldematte
Copy link
Copy Markdown
Contributor Author

Well, I tried the "last option". It's.. extensive :D
It does open a path towards native bulk scoring though!

image

@arup-chauhan
Copy link
Copy Markdown

@ldematte @thecoop

I have just updated #142812 based on review feedback (including true bulk scoring, benchmark numbers, and a fix found during benchmarking for incremental ordinal handling).

Since there’s overlap between the two efforts, I would love to collaborate and help converge on whichever path the team prefers.

If the team is going with this as the main path, I’m happy to help port over relevant pieces from #142812.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants