Use native vector scorers for on-heap data during indexing#143040
Use native vector scorers for on-heap data during indexing#143040ldematte wants to merge 14 commits intoelastic:mainfrom
Conversation
|
|
||
| @Override | ||
| public MemorySegment entireSegmentOrNull() throws IOException { | ||
| return null; |
There was a problem hiding this comment.
Yeah, this is tricky. FloatVectorValues could be a discontinuous List<float[]>, which makes mapping it difficult.
There was a problem hiding this comment.
Yep. The last commit implements one idea we discussed, which copies vector data to make it contiguous, but copy dominates (it is actually more expensive than computing the distance, which is not surprising).
The other idea (changing native code to accept sparse arrays) works on the native side, but fails on the Java side; Java does not support "indirect" on-heap MemorySegments, so you cannot compile an "array of pointers".
There was a problem hiding this comment.
The final version, after a lot of wrestling with Lucene, is to have a List of off-heap MemorySegments during indexing, and use that, so we can compile and array of pointers.
libs/simdvec/src/main21/java/org/elasticsearch/simdvec/internal/FloatVectorScorerSupplier.java
Outdated
Show resolved
Hide resolved
|
This is partially duplicated by #142812; we can choose which one we want to pursue, but both will need further and deeper work (see next comment). |
|
Eeeeeh, that sucks. If we can't use bulk scoring, and the single scoring is kinda the same speed as original Lucene, then is this worth it? |
Interpose an ES-owned FlatVectorsWriter between ES93GenericFlatVectorsWriter and Lucene99FlatVectorsWriter for the default (FLOAT/BYTE) vector format. All methods delegate to the inner Lucene writer, enabling future customization of addField without copying the entire Lucene implementation. Made-with: Cursor
Replace delegation of addField/flush with ES-owned implementations. ES93FlatFieldVectorsWriter manages its own vector storage, and ES93FlatVectorsWriter uses VarHandles to access the delegate's IndexOutputs for flush. Merge, finish, and close remain delegated to Lucene99FlatVectorsWriter. Made-with: Cursor
IMO no. I'm exploring one last option, but for now I think it's better to leave this as-is. |
Convert from interface to abstract class extending FlatFieldVectorsWriter<float[]> and add createFieldWriter factory method to ES818BinaryQuantizedVectorsWriter so a future es93 subclass can override field writer creation. Made-with: Cursor
Introduce ES93BinaryQuantizedVectorsWriter and ES93BinaryQuantizedFieldWriter so the BQ flush path is wired through ES93FlatFieldVectorsWriter. When that class migrates to MemorySegment storage, only ES93BinaryQuantizedFieldWriter needs to adapt. Made-with: Cursor
Introduce OffHeapFloatVectorStore and OffHeapByteVectorStore in simdvec using the MRJar pattern (stub in src/main, MemorySegment implementation in src/main21). ES93FlatFieldVectorsWriter now delegates vector storage to these stores, keeping data out of the Java heap. BQ normalization is performed in-place on the off-heap segments. Made-with: Cursor
Move WrappedNativeFloatVectors and WrappedNativeByteVectors to simdvec so the scorer can detect them. VectorScorerFactoryImpl now uses reflection to extract the captured list from Lucene's anonymous FloatVectorValues and, when it wraps an off-heap store, creates an OffHeapStoreAdapter that returns MemorySegments directly -- avoiding the MemorySegment-to-float[]-to-MemorySegment round-trip during HNSW graph construction. Made-with: Cursor
Move MemorySegmentAccessor from a nested interface in FloatVectorScorerSupplier to a top-level interface shared by both float and byte scorer suppliers. Rename float-specific adapters (OffHeapFloatStoreAdapter, FloatMemorySegmentHeapAdapter, etc.) and add equivalent byte adapters. Refactor ByteVectorScorerSupplier to use MemorySegmentAccessor, enabling off-heap and heap fallback paths matching the float scorer pattern. Made-with: Cursor
Introduce vec_sqrf32_bulk_sparse, a native function that takes an array of pointers to individually-allocated vectors instead of a contiguous block. This enables bulk scoring directly from off-heap MemorySegments without materializing a contiguous buffer. Templatize sqrf32_inner_bulk in both aarch64 and amd64 to support the new sparse mapper alongside existing identity and array mappers. Wire through JdkVectorLibrary, Similarities, and into FloatVectorScorerSupplier via MemorySegmentAccessor's new segmentForEntriesOrNull method. OffHeapFloatStoreAdapter builds the native pointer array from its off-heap vector segments. Made-with: Cursor
|
I have just updated #142812 based on review feedback (including true bulk scoring, benchmark numbers, and a fix found during benchmarking for incremental ordinal handling). Since there’s overlap between the two efforts, I would love to collaborate and help converge on whichever path the team prefers. If the team is going with this as the main path, I’m happy to help port over relevant pieces from #142812. |



Summary
Enable native vector scoring during HNSW graph construction by storing vectors off-heap in
MemorySegments and providing direct native access to them, bypassing Lucene's opaque wrappers. This yields a reduction in indexing time.Motivation
Previously, native vector scoring (via
VectorScorerFactoryImpl) was only available for the search path, where vectors are memory-mapped from disk viaMemorySegmentAccessInput. During indexing, HNSW graphconstruction used Lucene's
DefaultFlatVectorScorerwith on-heapfloat[]arrays -- missing out on our optimized SIMD implementations.The challenge: Lucene's
Lucene99HnswVectorsWritercallsFloatVectorValues.fromFloats(flatFieldVectorsWriter.getVectors()), creating an anonymous wrapper that hides the underlying vector storage. We needed to(a) control the storage, (b) make it off-heap so native code can access it, and (c) bridge through Lucene's opaque wrappers.
Approach
The work progressed through several layered steps:
Step 0: prepare FloatVectorScorerSupplier to work without MemorySegmentAccessInput -- Introduce a
MemorySegmentAccessorabstraction inFloatVectorScorerSupplierto decouple it fromMemorySegmentAccessInput, allowing native vector scoring for both memory-mapped and heap-resident data, with adapters for concrete cases. UpdateES93FlatVectorScorerto attempt native scoring even whenHasIndexSliceis not available (i.e., on-heap vectors during initial HNSW graph construction), and delegate raw float scoring toES93FlatVectorScorerinstead of Lucene's default scorer.Step 1: Own the indexing path -- Rather than forking
Lucene99FlatVectorsWriterentirely, we createdES93FlatVectorsWriteras a delegating wrapper that uses VarHandles to access the delegate's privatemeta/vectorDataIndexOutputfields. This lets us overrideaddField()andflush()while keepingmergeOneField/mergeOneFieldToIndexdelegated. A writer instance either does the indexing path or themerge path, never both.
Step 2: Refactor BQ writer hierarchy -- Converted
BinaryQuantizedVectorsFieldWriterfrom an interface to an abstract class extendingFlatFieldVectorsWriter<float[]>, adding acreateFieldWriterfactorymethod to
ES818BinaryQuantizedVectorsWriterfor subclass overrides (used byES93BinaryQuantizedVectorsWriter).Step 3: Off-heap vector storage -- Created
OffHeapFloatVectorStoreandOffHeapByteVectorStoreusing the MRJar pattern (src/mainstubs,src/main21MemorySegment-backed implementations insimdvec).Each vector is individually allocated in a shared
Arena.ES93FlatFieldVectorsWriteruses these stores, with in-placenormalizeByMagnitudesoperating directly on off-heap segments.Step 4: Bridge off-heap store to scorer -- The key insight:
getVectors()returns namedAbstractListwrappers (WrappedNativeFloatVectors,WrappedNativeByteVectorsinsimdvec) that expose the underlyingstore via
getStore().VectorScorerFactoryImpluses reflection to look through Lucene's anonymousFloatVectorValueswrapper, find the captured list, and extract theOffHeapFloatVectorStoredirectly --eliminating the round-trip of materializing
float[]from off-heap just to wrap them back intoMemorySegment.Step 5: Shared
MemorySegmentAccessor-- Extracted fromFloatVectorScorerSupplier's nested interface to a top-level interface shared by both float and byte scorer suppliers. RefactoredByteVectorScorerSupplierfromMemorySegmentAccessInputtoMemorySegmentAccessor, enabling the same three-tier resolution (IndexInput -> off-heap store -> heap fallback) for byte vectors.Step 6: Native sparse bulk scoring -- Added
vec_sqrf32_bulk_sparse, a native function takingf32_t* const*(array of pointers to individually-allocated vectors) instead of a contiguous block. This is the"scatter" pattern: templatized
sqrf32_inner_bulkonTDataand a new mapper signature. The heap adapter can't use this path (JDK docs explicitly stateMemorySegment.ofArrayaddresses can't populate a nativepointer struct), but
OffHeapFloatStoreAdaptercan because its vectors are truly off-heap.MemorySegmentAccessor.segmentForEntriesOrNull(int[], Arena)builds the pointer array; theArenaparameter iscaller-managed for clean lifecycle control.
Design decisions
Lucene99FlatVectorsWriterentirely, but the VarHandle approach is less code and easier to maintain across Lucene upgrades -- we onlyown the indexing path.
FloatVectorValuesis pragmatic. A cleaner long-term solution (e.g., customHnswVectorsWriter) may replace it later.MemorySegmentAPIs:java.lang.foreignis preview in JDK 21, finalized in JDK 22.OffHeapFloatVectorStore,OffHeapByteVectorStore, andArenaUtilusesrc/mainstubs +src/main21/src/main22implementations.MemorySegment(vs one large contiguous segment) because vectors arrive one at a time during indexing. This naturally leads to the sparse bulk scoringapproach.
Benchmark results
GIST-1M, 200k docs, float32, euclidean, HNSW(m=16, efC=200), BQ:
Known limitations
Int7SQVectorScorerSupplier-- but we need to check if we need it).Test plan
libs:simdvectests passlibs:nativetests passtestToStringexpectations)Co-authored with Cursor