Move max vector dims limit to Codec#12436
Conversation
Move vector max dimension limits enforcement into the default Codec's KnnVectorsFormat implementation. This allows different implementation of knn search algorithms define their own limits of a maximum vector dimenstions that they can handle. Closes apache#12309
| pf.fieldName, | ||
| s.vectorDimension, | ||
| indexWriterConfig.getCodec().knnVectorsFormat().getMaxDimensions()); | ||
| } |
There was a problem hiding this comment.
This is probably not going to do what we want when a PerFieldKnnVectorsFormat is used, as this would check the limit on PerFieldKnnVectorsFormat, rather than on the actual format that is used for pf.fieldName. Maybe getMaxDimensions should be on KnnVectorsWriter and we could forward to pf.knnVectorsWrtier here for checking?
There was a problem hiding this comment.
Actually my suggestion wouldn't work, as the writer would already be created with the number of dimensions of the field type when we run the check. So I guess we either need to add the field name to getMaxDimensions or make the codec responsible for performing the check rather than IndexingChain.
There was a problem hiding this comment.
I worry that this adds a hashtable lookup on a hot code path. Maybe it's not that bad for vectors, which are slow to index anyway, but I'd rather avoid it. What about making the codec responsible for checking the limit? Something like below:
diff --git a/lucene/core/src/java/org/apache/lucene/codecs/lucene95/Lucene95HnswVectorsFormat.java b/lucene/core/src/java/org/apache/lucene/codecs/lucene95/Lucene95HnswVectorsFormat.java
index cb3e5ef8b10..6c365e53528 100644
--- a/lucene/core/src/java/org/apache/lucene/codecs/lucene95/Lucene95HnswVectorsFormat.java
+++ b/lucene/core/src/java/org/apache/lucene/codecs/lucene95/Lucene95HnswVectorsFormat.java
@@ -108,6 +108,9 @@ public final class Lucene95HnswVectorsFormat extends KnnVectorsFormat {
public static final int VERSION_START = 0;
public static final int VERSION_CURRENT = VERSION_START;
+ /** A maximum number of vector dimensions supported by this codeс */
+ public static final int MAX_DIMENSIONS = 1024;
+
/**
* A maximum configurable maximum max conn.
*
@@ -177,7 +180,7 @@ public final class Lucene95HnswVectorsFormat extends KnnVectorsFormat {
@Override
public KnnVectorsWriter fieldsWriter(SegmentWriteState state) throws IOException {
- return new Lucene95HnswVectorsWriter(state, maxConn, beamWidth);
+ return new Lucene95HnswVectorsWriter(state, maxConn, beamWidth, MAX_DIMENSIONS);
}
@Override
diff --git a/lucene/core/src/java/org/apache/lucene/codecs/lucene95/Lucene95HnswVectorsWriter.java b/lucene/core/src/java/org/apache/lucene/codecs/lucene95/Lucene95HnswVectorsWriter.java
index 5358d66f16e..196f12a21ad 100644
--- a/lucene/core/src/java/org/apache/lucene/codecs/lucene95/Lucene95HnswVectorsWriter.java
+++ b/lucene/core/src/java/org/apache/lucene/codecs/lucene95/Lucene95HnswVectorsWriter.java
@@ -60,13 +60,15 @@ public final class Lucene95HnswVectorsWriter extends KnnVectorsWriter {
private final IndexOutput meta, vectorData, vectorIndex;
private final int M;
private final int beamWidth;
+ private final int maxDimension;
private final List<FieldWriter<?>> fields = new ArrayList<>();
private boolean finished;
- Lucene95HnswVectorsWriter(SegmentWriteState state, int M, int beamWidth) throws IOException {
+ Lucene95HnswVectorsWriter(SegmentWriteState state, int M, int beamWidth, int maxDimension) throws IOException {
this.M = M;
this.beamWidth = beamWidth;
+ this.maxDimension = maxDimension;
segmentWriteState = state;
String metaFileName =
IndexFileNames.segmentFileName(
@@ -117,6 +119,9 @@ public final class Lucene95HnswVectorsWriter extends KnnVectorsWriter {
@Override
public KnnFieldVectorsWriter<?> addField(FieldInfo fieldInfo) throws IOException {
+ if (fieldInfo.getVectorDimension() > maxDimension) {
+ throw new IllegalArgumentException("Number of dimensions " + fieldInfo.getVectorDimension() + " for field " + fieldInfo.name + " exceeds the limit of " + maxDimension);
+ }
FieldWriter<?> newField =
FieldWriter.create(fieldInfo, M, beamWidth, segmentWriteState.infoStream);
fields.add(newField);There was a problem hiding this comment.
@jpountz Thank you for the additional feedback.
I worry that this adds a hashtable lookup on a hot code path. Maybe it's not that bad for vectors, which are slow to index anyway, but I'd rather avoid it.
This is not really a hot code path. We ask for getCodec().knnVectorsFormat().getMaxDimensions in the initializeFieldInfo function, that happens only once per a new field per segment.
What about making the codec responsible for checking the limit?
Thanks for the suggestion, I experimented with this idea, and encountered the following difficulty with it:
- we need to create a new
FieldInfobefore passing it toKnnFieldVectorsWriter<?> addField(FieldInfo fieldInfo). - The way we create it is :
FieldInfo fi = fieldInfos.add(by adding to the global fieldInfos. This means that ifFieldInfocontains incorrect number of dimensions, it will be stored like this in the global fieldInfos, and we can't change it (for example with a second document with correct number of dims).
May be as an alternative we can do a validation as a separate method of KnnVectorsWriter:
public void validateFieldDims(int dims)What do you think?
There was a problem hiding this comment.
Ohhh thanks for explaining, I had not fully understood how your change worked. I like that we're retaining the property that the field info doesn't even get created if its number of dimensions is above the limit.
There was a problem hiding this comment.
Yes this looks fine. The hashtable lookup is no issue at all.
As getMaxDimensions() is a public codec method, we can let us do the check here. A separate validateFieldDims() is not needed.
I only don't like the long call chain to actually get the vectors format from the indexWriterConfig. But I can live with that.
Add a field name for getMaxDimensions function
jpountz
left a comment
There was a problem hiding this comment.
Thanks @mayya-sharipova, it took me time to understand how your change works but I think it's good, in particular the fact that it's transactional: a field will not make it to the IndexWriter's field infos if it has a number of dimensions above the limit.
@uschindler I'm curious what you think of this change since you seemed to support the idea of moving the limit to the codec in the past.
lucene/core/src/java/org/apache/lucene/codecs/lucene95/Lucene95HnswVectorsFormat.java
Show resolved
Hide resolved
| * @param fieldName the field name | ||
| * @return the maximum number of vector dimensions. | ||
| */ | ||
| public int getMaxDimensions(String fieldName) { |
There was a problem hiding this comment.
In main branch this should be abstract, only in 9.x we should have a default impl.
| return new FieldsReader(state); | ||
| } | ||
|
|
||
| @Override |
There was a problem hiding this comment.
is this the only subclass of KnnVectorsFormat that we have?
In addition, we should also add explicit number of dimensions in backwards codecs, because at the time when it was implemented 1024 was their default. In backwards codec the method should be final, IMHO.
There was a problem hiding this comment.
We have SimpleTextKnnVectorsFormat too.
| "f", | ||
| new float[getVectorsMaxDimensions("f") + 1], | ||
| VectorSimilarityFunction.DOT_PRODUCT)); | ||
| Exception exc = expectThrows(IllegalArgumentException.class, () -> w.addDocument(doc)); |
There was a problem hiding this comment.
So basically the check is now delayed to addDocument()? Great!
I was afraid that the check comes delayed while indexing of flushing is happening. So to me this looks good.
| pf.fieldName, | ||
| s.vectorDimension, | ||
| indexWriterConfig.getCodec().knnVectorsFormat().getMaxDimensions()); | ||
| } |
There was a problem hiding this comment.
Yes this looks fine. The hashtable lookup is no issue at all.
As getMaxDimensions() is a public codec method, we can let us do the check here. A separate validateFieldDims() is not needed.
I only don't like the long call chain to actually get the vectors format from the indexWriterConfig. But I can live with that.
|
This looks fine to me. I am happy that we do not have stupid system properties. Somebody who wants to raise the limit (or lower it to 32 like in the test) can simply implement an own codec. One question I have: What happens if you open an index with a higher limit in field infos and you use default codec? I think this is unsupported, but in that case the implementor of the codec should possibly use own codec name. |
|
Thanks @jpountz and @uschindler for the reviews. I will do the following:
|
Move vector max dimension limits enforcement into the default Codec's KnnVectorsFormat implementation. This allows different implementation of knn search algorithms define their own limits of a maximum vector dimensions that they can handle. Closes #12309
…ensionTooLarge Depending whether a document with dimensions > maxDims created on a new segment or already existing segment, we may get different error messages. This fix adds another possible error message we may get. Relates to apache#12436
…ensionTooLarge Depending whether a document with dimensions > maxDims created on a new segment or already existing segment, we may get different error messages. This fix adds another possible error message we may get. Relates to apache#12436
This change doesn't touch the read logic, it only adds write-time validation. So if you first index vectors above the default limit using a custom codec that reuses the same codec name as the default codec and then open this index with the default codec, you will be able to perform reads, but writes will fail. Using a different codec name would certainly make things simpler, to force Lucene to use the same codec at read time instead of the default codec. |
That's what I expected. Thanks. |
| + "]" | ||
| + "vector's dimensions must be <= [" |
There was a problem hiding this comment.
minor: #12605 proposes to have a space before the vector's dimensions words here.
Move vector max dimension limits enforcement into the default Codec's
KnnVectorsFormat implementation. This allows different implementation
of knn search algorithms define their own limits of a maximum
vector dimensions that they can handle.
Closes #12309