Move max vector dims limit to Codec by mayya-sharipova · Pull Request #12436 · apache/lucene

mayya-sharipova · 2023-07-13T00:02:38Z

Move vector max dimension limits enforcement into the default Codec's
KnnVectorsFormat implementation. This allows different implementation
of knn search algorithms define their own limits of a maximum
vector dimensions that they can handle.

Closes #12309

Move vector max dimension limits enforcement into the default Codec's KnnVectorsFormat implementation. This allows different implementation of knn search algorithms define their own limits of a maximum vector dimenstions that they can handle. Closes apache#12309

jpountz · 2023-07-20T08:45:34Z

lucene/core/src/java/org/apache/lucene/index/IndexingChain.java

+          pf.fieldName,
+          s.vectorDimension,
+          indexWriterConfig.getCodec().knnVectorsFormat().getMaxDimensions());
+    }


This is probably not going to do what we want when a PerFieldKnnVectorsFormat is used, as this would check the limit on PerFieldKnnVectorsFormat, rather than on the actual format that is used for pf.fieldName. Maybe getMaxDimensions should be on KnnVectorsWriter and we could forward to pf.knnVectorsWrtier here for checking?

Actually my suggestion wouldn't work, as the writer would already be created with the number of dimensions of the field type when we run the check. So I guess we either need to add the field name to getMaxDimensions or make the codec responsible for performing the check rather than IndexingChain.

@jpountz Thanks for your feedback, you are completely right, I missed about PerFieldKnnVectorsFormat.

In 21ebd51, I've added a field name to the getMaxDimensions method.

I worry that this adds a hashtable lookup on a hot code path. Maybe it's not that bad for vectors, which are slow to index anyway, but I'd rather avoid it. What about making the codec responsible for checking the limit? Something like below:

diff --git a/lucene/core/src/java/org/apache/lucene/codecs/lucene95/Lucene95HnswVectorsFormat.java b/lucene/core/src/java/org/apache/lucene/codecs/lucene95/Lucene95HnswVectorsFormat.java index cb3e5ef8b10..6c365e53528 100644 --- a/lucene/core/src/java/org/apache/lucene/codecs/lucene95/Lucene95HnswVectorsFormat.java +++ b/lucene/core/src/java/org/apache/lucene/codecs/lucene95/Lucene95HnswVectorsFormat.java @@ -108,6 +108,9 @@ public final class Lucene95HnswVectorsFormat extends KnnVectorsFormat { public static final int VERSION_START = 0; public static final int VERSION_CURRENT = VERSION_START; + /** A maximum number of vector dimensions supported by this codeс */ + public static final int MAX_DIMENSIONS = 1024; + /** * A maximum configurable maximum max conn. * @@ -177,7 +180,7 @@ public final class Lucene95HnswVectorsFormat extends KnnVectorsFormat { @Override public KnnVectorsWriter fieldsWriter(SegmentWriteState state) throws IOException { - return new Lucene95HnswVectorsWriter(state, maxConn, beamWidth); + return new Lucene95HnswVectorsWriter(state, maxConn, beamWidth, MAX_DIMENSIONS); } @Override diff --git a/lucene/core/src/java/org/apache/lucene/codecs/lucene95/Lucene95HnswVectorsWriter.java b/lucene/core/src/java/org/apache/lucene/codecs/lucene95/Lucene95HnswVectorsWriter.java index 5358d66f16e..196f12a21ad 100644 --- a/lucene/core/src/java/org/apache/lucene/codecs/lucene95/Lucene95HnswVectorsWriter.java +++ b/lucene/core/src/java/org/apache/lucene/codecs/lucene95/Lucene95HnswVectorsWriter.java @@ -60,13 +60,15 @@ public final class Lucene95HnswVectorsWriter extends KnnVectorsWriter { private final IndexOutput meta, vectorData, vectorIndex; private final int M; private final int beamWidth; + private final int maxDimension; private final List<FieldWriter<?>> fields = new ArrayList<>(); private boolean finished; - Lucene95HnswVectorsWriter(SegmentWriteState state, int M, int beamWidth) throws IOException { + Lucene95HnswVectorsWriter(SegmentWriteState state, int M, int beamWidth, int maxDimension) throws IOException { this.M = M; this.beamWidth = beamWidth; + this.maxDimension = maxDimension; segmentWriteState = state; String metaFileName = IndexFileNames.segmentFileName( @@ -117,6 +119,9 @@ public final class Lucene95HnswVectorsWriter extends KnnVectorsWriter { @Override public KnnFieldVectorsWriter<?> addField(FieldInfo fieldInfo) throws IOException { + if (fieldInfo.getVectorDimension() > maxDimension) { + throw new IllegalArgumentException("Number of dimensions " + fieldInfo.getVectorDimension() + " for field " + fieldInfo.name + " exceeds the limit of " + maxDimension); + } FieldWriter<?> newField = FieldWriter.create(fieldInfo, M, beamWidth, segmentWriteState.infoStream); fields.add(newField);

@jpountz Thank you for the additional feedback.

I worry that this adds a hashtable lookup on a hot code path. Maybe it's not that bad for vectors, which are slow to index anyway, but I'd rather avoid it.

This is not really a hot code path. We ask for getCodec().knnVectorsFormat().getMaxDimensions in the initializeFieldInfo function, that happens only once per a new field per segment.

What about making the codec responsible for checking the limit?

Thanks for the suggestion, I experimented with this idea, and encountered the following difficulty with it:

we need to create a new FieldInfo before passing it to KnnFieldVectorsWriter<?> addField(FieldInfo fieldInfo).

The way we create it is : FieldInfo fi = fieldInfos.add( by adding to the global fieldInfos. This means that if FieldInfo contains incorrect number of dimensions, it will be stored like this in the global fieldInfos, and we can't change it (for example with a second document with correct number of dims).

May be as an alternative we can do a validation as a separate method of KnnVectorsWriter:

public void validateFieldDims(int dims)

What do you think?

Ohhh thanks for explaining, I had not fully understood how your change worked. I like that we're retaining the property that the field info doesn't even get created if its number of dimensions is above the limit.

Yes this looks fine. The hashtable lookup is no issue at all.

As getMaxDimensions() is a public codec method, we can let us do the check here. A separate validateFieldDims() is not needed.

I only don't like the long call chain to actually get the vectors format from the indexWriterConfig. But I can live with that.

Add a field name for getMaxDimensions function

jpountz

Thanks @mayya-sharipova, it took me time to understand how your change works but I think it's good, in particular the fact that it's transactional: a field will not make it to the IndexWriter's field infos if it has a number of dimensions above the limit.

@uschindler I'm curious what you think of this change since you seemed to support the idea of moving the limit to the codec in the past.

lucene/core/src/java/org/apache/lucene/codecs/lucene95/Lucene95HnswVectorsFormat.java

uschindler · 2023-07-27T11:45:31Z

lucene/core/src/java/org/apache/lucene/codecs/KnnVectorsFormat.java

+   * @param fieldName the field name
+   * @return the maximum number of vector dimensions.
+   */
+  public int getMaxDimensions(String fieldName) {


In main branch this should be abstract, only in 9.x we should have a default impl.

uschindler · 2023-07-27T11:47:13Z

lucene/core/src/java/org/apache/lucene/codecs/perfield/PerFieldKnnVectorsFormat.java

    return new FieldsReader(state);
  }

+  @Override


is this the only subclass of KnnVectorsFormat that we have?

In addition, we should also add explicit number of dimensions in backwards codecs, because at the time when it was implemented 1024 was their default. In backwards codec the method should be final, IMHO.

We have SimpleTextKnnVectorsFormat too.

uschindler · 2023-07-27T11:51:24Z

lucene/test-framework/src/java/org/apache/lucene/tests/index/BaseKnnVectorsFormatTestCase.java

+              "f",
+              new float[getVectorsMaxDimensions("f") + 1],
+              VectorSimilarityFunction.DOT_PRODUCT));
+      Exception exc = expectThrows(IllegalArgumentException.class, () -> w.addDocument(doc));


So basically the check is now delayed to addDocument()? Great!

I was afraid that the check comes delayed while indexing of flushing is happening. So to me this looks good.

uschindler · 2023-07-27T11:57:40Z

lucene/core/src/java/org/apache/lucene/index/IndexingChain.java

+          pf.fieldName,
+          s.vectorDimension,
+          indexWriterConfig.getCodec().knnVectorsFormat().getMaxDimensions());
+    }


Yes this looks fine. The hashtable lookup is no issue at all.

As getMaxDimensions() is a public codec method, we can let us do the check here. A separate validateFieldDims() is not needed.

I only don't like the long call chain to actually get the vectors format from the indexWriterConfig. But I can live with that.

uschindler · 2023-07-27T12:02:22Z

This looks fine to me. I am happy that we do not have stupid system properties.

Somebody who wants to raise the limit (or lower it to 32 like in the test) can simply implement an own codec.

One question I have: What happens if you open an index with a higher limit in field infos and you use default codec? I think this is unsupported, but in that case the implementor of the codec should possibly use own codec name.

mayya-sharipova · 2023-07-27T18:34:05Z

Thanks @jpountz and @uschindler for the reviews. I will do the following:

merge this PR in the current form to main and branch_9x.
create a separate PR on top of it for the main branch only to make KnnVectorsFormat#getMaxDimensions abstract and all codecs provide their own implementation of it.

Move vector max dimension limits enforcement into the default Codec's KnnVectorsFormat implementation. This allows different implementation of knn search algorithms define their own limits of a maximum vector dimensions that they can handle. Closes #12309

…ensionTooLarge Depending whether a document with dimensions > maxDims created on a new segment or already existing segment, we may get different error messages. This fix adds another possible error message we may get. Relates to apache#12436

…ensionTooLarge (#12467) Depending whether a document with dimensions > maxDims created on a new segment or already existing segment, we may get different error messages. This fix adds another possible error message we may get. Relates to #12436

jpountz · 2023-07-28T14:47:43Z

One question I have: What happens if you open an index with a higher limit in field infos and you use default codec? I think this is unsupported, but in that case the implementor of the codec should possibly use own codec name.

This change doesn't touch the read logic, it only adds write-time validation. So if you first index vectors above the default limit using a custom codec that reuses the same codec name as the default codec and then open this index with the default codec, you will be able to perform reads, but writes will fail. Using a different codec name would certainly make things simpler, to force Lucene to use the same codec at read time instead of the default codec.

uschindler · 2023-07-28T16:07:28Z

One question I have: What happens if you open an index with a higher limit in field infos and you use default codec? I think this is unsupported, but in that case the implementor of the codec should possibly use own codec name.

This change doesn't touch the read logic, it only adds write-time validation. So if you first index vectors above the default limit using a custom codec that reuses the same codec name as the default codec and then open this index with the default codec, you will be able to perform reads, but writes will fail. Using a different codec name would certainly make things simpler, to force Lucene to use the same codec at read time instead of the default codec.

That's what I expected. Thanks.

cpoerschke · 2023-09-29T09:57:01Z

lucene/core/src/java/org/apache/lucene/index/IndexingChain.java

+              + "]"
+              + "vector's dimensions must be <= ["


minor: #12605 proposes to have a space before the vector's dimensions words here.

jpountz reviewed Jul 20, 2023

View reviewed changes

Address Adrien's feedback

21ebd51

Add a field name for getMaxDimensions function

jpountz approved these changes Jul 27, 2023

View reviewed changes

lucene/core/src/java/org/apache/lucene/codecs/lucene95/Lucene95HnswVectorsFormat.java Show resolved Hide resolved

uschindler approved these changes Jul 27, 2023

View reviewed changes

Remove Lucene95HnswVectorsFormat#MAX_DIMENSIONS constant

f43f45b

mayya-sharipova merged commit 98320d7 into apache:main Jul 27, 2023

mayya-sharipova deleted the max-vector-dims-to-codec branch July 27, 2023 18:50

mayya-sharipova mentioned this pull request Jul 27, 2023

Make KnnVectorsFormat#getMaxDimensions abstract #12466

Merged

mayya-sharipova mentioned this pull request Jul 28, 2023

Fix failure in BaseKnnVectorsFormatTestCase#testIllegalDimensionTooLarge #12467

Merged

mayya-sharipova mentioned this pull request Aug 2, 2023

LUCENE-10471 Increse max dims for vectors to 2048 #874

Closed

zhaih added this to the 9.8.0 milestone Sep 20, 2023

sam-herman mentioned this pull request Sep 21, 2023

[FEATURE] Support higher vector dimension limit for lucene opensearch-project/k-NN#925

Closed

uschindler mentioned this pull request Sep 25, 2023

Increase the number of dims for KNN vectors to 2048 [LUCENE-10471] #11507

Closed

cpoerschke mentioned this pull request Sep 28, 2023

SOLR-16985 Upgrade Lucene to 9.8.0 apache/solr#1958

Merged

7 tasks

cpoerschke reviewed Sep 29, 2023

View reviewed changes

mkleen mentioned this pull request Oct 2, 2023

Update Lucene to 9.8.0 crate/crate#14782

Merged

5 tasks

Conversation

mayya-sharipova commented Jul 13, 2023

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mayya-sharipova Jul 21, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mayya-sharipova Jul 26, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jpountz left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

uschindler commented Jul 27, 2023

Uh oh!

mayya-sharipova commented Jul 27, 2023

Uh oh!

jpountz commented Jul 28, 2023

Uh oh!

uschindler commented Jul 28, 2023

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

mayya-sharipova Jul 21, 2023 •

edited

Loading

mayya-sharipova Jul 26, 2023 •

edited

Loading