Extend dense_vector to support indexing vectors#78491
Extend dense_vector to support indexing vectors#78491jtibshirani merged 7 commits intoelastic:masterfrom
dense_vector to support indexing vectors#78491Conversation
|
Some notes on follow-ups:
I also ran a quick benchmark and noticed that script scoring with indexed vectors is significantly slower than vectors based on doc values. For example on a small dataset QPS went from |
|
Pinging @elastic/es-search (Team:Search) |
| } | ||
|
|
||
| private Field parseKnnVector(DocumentParserContext context) throws IOException { | ||
| float[] vector = new float[dims]; |
There was a problem hiding this comment.
it would be nice (may be for future) to create float[] vector = new float[dims] once in DenseVectorFieldMapper constructor and they reuse it instead of creating a new array each time during parsing. Especially considering that Lucene's VectorValuesWriter would anyway make a separate copy: vectors.add(ArrayUtil.copyOfSubArray(vectorValue, 0, vectorValue.length)). But I don't know how if all this is done within a single thread.
There was a problem hiding this comment.
I guess we can keep an eye on this in indexing benchmarks. Like you said, the indexing logic often copies field values, so I don't expect this array creation to be a bottleneck.
mayya-sharipova
left a comment
There was a problem hiding this comment.
@jtibshirani Thanks Julie, great work! This looks almost perfect to me. I've just left a couple of questions to clarify things.
One thing about API that is a little bothering me is how we define similarity in a mapping for a vector field, but at the same time can apply different similarity functions in scripts for it. This looks to be confusing, but may be good documentation should be enough to resolve this confusion.
"index": true,
"similarity": "l2_norm"
|
|
||
| @Override | ||
| public void setNextDocId(int docId) throws IOException { | ||
| int currentDoc = in.docID(); |
There was a problem hiding this comment.
very nice implementation of advanceExact!
There was a problem hiding this comment.
This was a little tricky to get right, I wonder if it'd be helpful to add it to VectorValues in Lucene.
| return new DenseVectorScriptDocValues(values, indexVersion, dims); | ||
| if (indexed) { | ||
| VectorValues values = reader.getVectorValues(field); | ||
| if (values == null || values == VectorValues.EMPTY) { |
There was a problem hiding this comment.
Should values be always not null and always be not empty if a vector field is indexed?
There was a problem hiding this comment.
Even if there's a vector field in the mapping, the segment could have no documents with that field. Depending on the exact context, Lucene could return either null or an empty instance in this case.
There was a problem hiding this comment.
Thanks for explanation! I can see how we can get null. For this may be later, we can do something similar to DocValues.getBinary that never returns null.
But I think we should never get VectorValues.EMPTY ? No?
There was a problem hiding this comment.
Lucene might return VectorValues.EMPTY e.g. if a field has been indexed with vectors and later all docs that had a vector indexed got merged away. Since the field exists in FieldInfos and has vectors enabled, it is illegal for Lucene to return null in that case.
.../vectors/src/main/java/org/elasticsearch/xpack/vectors/query/DenseVectorScriptDocValues.java
Show resolved
Hide resolved
x-pack/plugin/vectors/src/main/java/org/elasticsearch/xpack/vectors/query/ScoreScriptUtils.java
Show resolved
Hide resolved
| import java.io.IOException; | ||
| import java.util.Objects; | ||
|
|
||
| public class KnnVectorFieldExistsQuery extends Query { |
There was a problem hiding this comment.
I was also thinking of moving this to Lucene eventually.
There was a problem hiding this comment.
Now that we enforce schema consistency, we should hopefully be able to have a single FieldExistsQuery that would look at doc values / norms / vectors depending on the FieldInfo to know how to iterate over documents that have a value.
There was a problem hiding this comment.
Good idea. I'll revisit this at a later time.
|
Thanks @mayya-sharipova for reviewing, I tried to address all your comments.
I agree it could be confusing. Another option is to rename |
mayya-sharipova
left a comment
There was a problem hiding this comment.
@jtibshirani This overall LGTM!
Just one thing, I think in VectorDVLeafFieldData we should never get VectorValues.EMPTY? NO?
We can indeed receive |
|
@jtibshirani Thanks for the explanation, Julie, makes sense! All is good from my side! |
This PR extends the
dense_vectortype to allow vectors to be added to an ANN index:A description of the parameters:
index. Setting this totrueindicates the field should be added to the ANN index. The values will be parsed as aKnnVectorFieldinstead of a doc values field. By defaultindex: falseto provide a smooth transition from 7.x, where vectors are not indexed.similarity. Whenindex: true, it's required to specify what similarity to use when indexing the vectors. Right now the accepted values arel2_normanddot_product, which matches the Lucene options. (We decided to requiresimilarityto be set since there's no default choice that works in general, and it's easy to overlook and accidentally get poor results.)Indexed vectors still support the same functionality as vectors based on doc values -- they work with vector script functions and
existsqueries.Relates to #78473.