Extend `dense_vector` to support indexing vectors by jtibshirani · Pull Request #78491 · elastic/elasticsearch

jtibshirani · 2021-09-29T23:49:35Z

This PR extends the dense_vector type to allow vectors to be added to an ANN index:

"mappings": {
  "properties": {
    "my_vector": {
      "type": "dense_vector",
      "dims": 128,
      "index": true,
      "similarity": "l2_norm"
    }
  }
}

A description of the parameters:

index. Setting this to true indicates the field should be added to the ANN index. The values will be parsed as a KnnVectorField instead of a doc values field. By default index: false to provide a smooth transition from 7.x, where vectors are not indexed.
similarity. When index: true, it's required to specify what similarity to use when indexing the vectors. Right now the accepted values are l2_norm and dot_product, which matches the Lucene options. (We decided to require similarity to be set since there's no default choice that works in general, and it's easy to overlook and accidentally get poor results.)

Indexed vectors still support the same functionality as vectors based on doc values -- they work with vector script functions and exists queries.

Relates to #78473.

jtibshirani · 2021-09-29T23:59:45Z

Some notes on follow-ups:

Right now similarity can be l2_norm or dot_product, matching the Lucene options. I'd like to investigate whether it'd be better to support cosine similarity. I think the idea in Lucene is that dot product should be used to implement a fast cosine similarity, by normalizing all vectors to unit length beforehand. My concern is that dot product is not a true metric, and users can receive confusing results if they forget to normalize the vectors.
Text fields also have an option called similarity. I think this is fine, and even provides a nice analogy. But we'll need to adjust the docs to clarify that similarity doesn't just apply to text types.
It proved tricky to allow the HNSW parameters to be configured through the mapping. I'll tackle that in a future PR.

I also ran a quick benchmark and noticed that script scoring with indexed vectors is significantly slower than vectors based on doc values. For example on a small dataset QPS went from 455.143 to 273.596 when scoring all vectors. I don't think this is a big deal, since this feature is focused on ANN and not exact-match kNN. But it did raise some ideas for speeding up the distance computations used in Lucene.

elasticmachine · 2021-09-30T11:47:10Z

Pinging @elastic/es-search (Team:Search)

mayya-sharipova · 2021-10-01T21:33:37Z

...gin/vectors/src/main/java/org/elasticsearch/xpack/vectors/mapper/DenseVectorFieldMapper.java

+    }
+
+    private Field parseKnnVector(DocumentParserContext context) throws IOException {
+        float[] vector = new float[dims];


it would be nice (may be for future) to create float[] vector = new float[dims] once in DenseVectorFieldMapper constructor and they reuse it instead of creating a new array each time during parsing. Especially considering that Lucene's VectorValuesWriter would anyway make a separate copy: vectors.add(ArrayUtil.copyOfSubArray(vectorValue, 0, vectorValue.length)). But I don't know how if all this is done within a single thread.

I guess we can keep an eye on this in indexing benchmarks. Like you said, the indexing logic often copies field values, so I don't expect this array creation to be a bottleneck.

mayya-sharipova

@jtibshirani Thanks Julie, great work! This looks almost perfect to me. I've just left a couple of questions to clarify things.

One thing about API that is a little bothering me is how we define similarity in a mapping for a vector field, but at the same time can apply different similarity functions in scripts for it. This looks to be confusing, but may be good documentation should be enough to resolve this confusion.

      "index": true,
      "similarity": "l2_norm"

mayya-sharipova · 2021-10-02T22:23:45Z

...ctors/src/main/java/org/elasticsearch/xpack/vectors/query/KnnDenseVectorScriptDocValues.java

+
+    @Override
+    public void setNextDocId(int docId) throws IOException {
+        int currentDoc = in.docID();


very nice implementation of advanceExact!

This was a little tricky to get right, I wonder if it'd be helpful to add it to VectorValues in Lucene.

mayya-sharipova · 2021-10-02T22:46:47Z

...lugin/vectors/src/main/java/org/elasticsearch/xpack/vectors/query/VectorDVLeafFieldData.java

-            return new DenseVectorScriptDocValues(values, indexVersion, dims);
+            if (indexed) {
+                VectorValues values = reader.getVectorValues(field);
+                if (values == null || values == VectorValues.EMPTY) {


Should values be always not null and always be not empty if a vector field is indexed?

Even if there's a vector field in the mapping, the segment could have no documents with that field. Depending on the exact context, Lucene could return either null or an empty instance in this case.

Thanks for explanation! I can see how we can get null. For this may be later, we can do something similar to DocValues.getBinary that never returns null.

But I think we should never get VectorValues.EMPTY ? No?

Lucene might return VectorValues.EMPTY e.g. if a field has been indexed with vectors and later all docs that had a vector indexed got merged away. Since the field exists in FieldInfos and has vectors enabled, it is illegal for Lucene to return null in that case.

.../vectors/src/main/java/org/elasticsearch/xpack/vectors/query/DenseVectorScriptDocValues.java

x-pack/plugin/vectors/src/main/java/org/elasticsearch/xpack/vectors/query/ScoreScriptUtils.java

jtibshirani · 2021-10-04T18:03:31Z

...n/vectors/src/main/java/org/elasticsearch/xpack/vectors/query/KnnVectorFieldExistsQuery.java

+import java.io.IOException;
+import java.util.Objects;
+
+public class KnnVectorFieldExistsQuery extends Query {


I was also thinking of moving this to Lucene eventually.

Now that we enforce schema consistency, we should hopefully be able to have a single FieldExistsQuery that would look at doc values / norms / vectors depending on the FieldInfo to know how to iterate over documents that have a value.

Good idea. I'll revisit this at a later time.

jtibshirani · 2021-10-04T18:19:12Z

Thanks @mayya-sharipova for reviewing, I tried to address all your comments.

One thing about API that is a little bothering me is how we define similarity in a mapping for a vector field, but at the same time can apply different similarity functions in scripts for it.

I agree it could be confusing. Another option is to rename similarity to something like index_similarity, or move it under a new index_options section. But I like the current naming, it is simple and corresponds nicely to similarity for text fields. Maybe we can just clearly document this? Going forward it may be less common to use vector script functions, in favor of ANN plus built-in ways for accessing vector similarity scores... so it could become less of a concern.

mayya-sharipova

@jtibshirani This overall LGTM!

Just one thing, I think in VectorDVLeafFieldData we should never get VectorValues.EMPTY? NO?

jtibshirani · 2021-10-04T23:43:11Z

Just one thing, I think in VectorDVLeafFieldData we should never get VectorValues.EMPTY? NO?

We can indeed receive VectorValues.EMPTY (it popped up in tests). My understanding is that if a segment contains no documents with a field, then LeafReader#getVectorValues is allowed to return either null or VectorValues.EMPTY. This applies to other data structures too, the accessor methods can return either null or an empty instance.

mayya-sharipova · 2021-10-04T23:58:53Z

@jtibshirani Thanks for the explanation, Julie, makes sense! All is good from my side!

jtibshirani added 6 commits September 29, 2021 15:18

Add missing .vex file extension

afad19a

Add an 'index' option to enable KNN indexing

88a7e95

Move vector function logic to doc values class

2863fde

Implement vector script functions

76adc9f

Add similarity parameter

97795f4

Support exists queries

326427b

elasticsearchmachine added the v8.0.0 label Sep 29, 2021

jtibshirani mentioned this pull request Sep 30, 2021

Integrate ANN search #78473

Closed

17 tasks

jtibshirani marked this pull request as ready for review September 30, 2021 01:02

pgomulka added the :Search Foundations/Mapping Index mappings, including merging and defining field types label Sep 30, 2021

elasticmachine added the Team:Search Meta label for search team label Sep 30, 2021

Merge remote-tracking branch 'upstream/master' into dense-vector

7871394

jtibshirani requested review from mayya-sharipova and romseygeek October 1, 2021 17:42

mayya-sharipova reviewed Oct 1, 2021

View reviewed changes

mayya-sharipova reviewed Oct 2, 2021

View reviewed changes

jtibshirani commented Oct 4, 2021

View reviewed changes

mayya-sharipova approved these changes Oct 4, 2021

View reviewed changes

jtibshirani merged commit 9ce1a29 into elastic:master Oct 5, 2021

jtibshirani deleted the dense-vector branch October 5, 2021 00:11

jakelandis added v8.0.0-beta1 >enhancement and removed v8.0.0 labels Oct 27, 2021

Conversation

jtibshirani commented Sep 29, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jtibshirani commented Sep 29, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

elasticmachine commented Sep 30, 2021

Uh oh!

mayya-sharipova Oct 1, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mayya-sharipova left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jpountz Oct 12, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jtibshirani commented Oct 4, 2021

Uh oh!

mayya-sharipova left a comment

Choose a reason for hiding this comment

Uh oh!

jtibshirani commented Oct 4, 2021

Uh oh!

mayya-sharipova commented Oct 4, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

jtibshirani commented Sep 29, 2021 •

edited

Loading

jtibshirani commented Sep 29, 2021 •

edited

Loading

mayya-sharipova Oct 1, 2021 •

edited

Loading

jpountz Oct 12, 2021 •

edited

Loading

mayya-sharipova commented Oct 4, 2021 •

edited

Loading