Expose FlatVectorsFormat by msokolov · Pull Request #13469 · apache/lucene

msokolov · 2024-06-07T16:37:04Z

just posting this here for discussion purposes; API is up for discussion

msokolov · 2024-06-07T17:55:18Z

lucene/core/src/java/org/apache/lucene/codecs/lucene99/Lucene99FlatVectorsReader.java

  }

+  @Override
+  public void search(String field, float[] target, KnnCollector knnCollector, Bits acceptDocs) throws IOException {


hmm as I work with this I am realizing this search interface is not really what I'm after ... basically I would like to be able to combine a DocIdSet from a Query with a FlatVectorScorer so that I can score the document using a quantized dot-product. So I think what I need is a RandomVectorScorer returned from the FlatVectorsReader and some way to go from docid->ord for my vector field. But it seems we only support ordToDoc now and not the other way?

@msokolov getByteVectorValues and getFloatVectorValues already do this. They provide an iterator that will return VectorScorer which can score and forward iterate via doc ids (supports advance, next, etc.).

Also, with the "flat codecs" it doesn't make much sense to gather up the accepted docs, place the results in a collector, etc. You want to score while you iterate. It would likely be way cheaper than doing it all up front like knn is doing.

Yes, of course, OK. So I think this will work like this if I leave the search() methods as throwing UnsupportedOperationException, then use PerFieldKnnVectorValues to tie a field to a Lucene99ScalarQuantizedVectorsFormat, at which point I can call getFloatVectorValues and use the returned scorer's score() method. Or perhaps call getQuantizedVectorValues directly and use its score method. I'll give it a try.

I'm OK with all that. If this becomes popular we might want to find a way that doesn't require using codec-specific classes, but we can let that sit for now I think. Thank you!

@msokolov I suppose that is ok for folks that understand the depths of codec formats. But we are seriously complicating somebody's life that wants to use kNN and scalar quantized flat codec. Now they cannot use the knn query, but instead have to use some specialized query they come up with themselves.

I think we at least should have a "flat vector query"? or something? Something that always does exact search so that typical users don't have to do this iterating & casting stuff?

@msokolov I suppose that is ok for folks that understand the depths of codec formats. But we are seriously complicating somebody's life that wants to use kNN and scalar quantized flat codec. Now they cannot use the knn query, but instead have to use some specialized query they come up with themselves.

+1 on this comment.

I think we at least should have a "flat vector query"? or something? Something that always does exact search so that typical users don't have to do this iterating & casting stuff?

Rather than having a new query can we just reuse the old query where for a field we get the right KNNVectorsReader and hit the search interface of the reader that does the exact search or HNSW based search based on whatever implementation the KNNVectorsReader have used.

I hear you all, this seems a little weird. One thing is ... you cannot get access to this stuff at all unless you are willing to create a custom Codec. So by definition we have weeded out people who don't want to mess around with Codecs. Another thing is ... we don't really want to create a trap for people who are trying to find top N vector-distance documents matching some constraint. They might think by blindly following the API trail/trap that they should create a BitSet representing their constraint and pass that in to a Knn*VectorQuery? When instead they would be better off executing their constraint as the primary Query and using the VectorValues API to do the scoring. I'm struggling to see the use case for this brute-force query; it just seems like a trap to me

Maybe we can make a KnnVectorScoreQuery that wraps another Query, supplying the vector score. Actually I think you can already do this with FunctionScoreQuery + a DoubleValuesSource wrapping a KnnVectorScorer? Perhaps we should provide that as a single Query?

benwtrent · 2024-06-10T14:44:14Z

...ne/core/src/java/org/apache/lucene/codecs/lucene99/Lucene99ScalarQuantizedVectorsReader.java

    return rawVectorsReader.getByteVectorValues(field);
  }

+  @Override


So, doing this means the KnnFloatVectorQuery will throw an error if its used against a field with a codec using this reader.

Traditionally, query's were fairly ignorant of the underlying codec used to store the field.

It does seem weird to have a codec that the Knn queries could be executed against throw an exception.

I think either:

The knn queries should automatically do the correct things (e.g. not eagerly rewrite themselves, etc.)

We don't throw on search like this.

The knn queries should automatically do the correct things (e.g. not eagerly rewrite themselves, etc.)

I am not sure how to do this other than having something that says "Supports approximate search" to the leaf readers. Or a try{}catch() in the query, but then we are using exceptions as a query flow, which seems trappy and bad.

I did toy with adding a "supportsSearch" flag when this problem surfaced due to BaseKnnVectorsFormatTestCase. Potentially we could return no results. I feel like this is the moral equivalent of a stored Field with IndexOptions.NONE (Like StoredField.TYPE)? But we have tended to not want to make such vector-representation choices a part of the Fields API, so it looks different.

…tion

msokolov · 2024-06-13T15:43:22Z

I changed this to return zero results from search() rather than throwing UnsupportedOperationException to be more in line with how we treat searches over other kinds of fields that only have stored data that is not indexed.

I also proposed a query (VectorScoreQuery?) that could be used to wrap another Query and provide the score using a vector scorer, but I think we can handle that as a separate issue.

benwtrent

Some concerns. With the desire of having a better brute-force query, I am coming around to the idea :)

benwtrent · 2024-06-13T19:00:36Z

lucene/core/src/java/org/apache/lucene/codecs/lucene99/Lucene99FlatVectorsReader.java

+  @Override
+  public void search(String field, float[] target, KnnCollector knnCollector, Bits acceptDocs)
+      throws IOException {
+    // don't scan stored field data. If we didn't index it, produce no search results
+  }


Since this is the case for all FlatVectorsReader objects, think we should put this override there?

benwtrent · 2024-06-13T19:02:00Z

...ne/core/src/java/org/apache/lucene/codecs/lucene99/Lucene99ScalarQuantizedVectorsReader.java

+  public void search(String field, float[] target, KnnCollector knnCollector, Bits acceptDocs)
+      throws IOException {
+    // don't scan stored field data. If we didn't index it, produce no search results
+  }
+
+  @Override
+  public void search(String field, byte[] target, KnnCollector knnCollector, Bits acceptDocs)
+      throws IOException {
+    // don't scan stored field data. If we didn't index it, produce no search results
+  }


Maybe this should be in FlatVectorsReader

benwtrent · 2024-06-13T19:04:13Z

lucene/core/src/java/org/apache/lucene/codecs/lucene99/Lucene99FlatVectorsFormat.java

+  public Lucene99FlatVectorsFormat(String name, FlatVectorsScorer vectorsScorer) {
+    super(name);


I wonder why for Lucene99FlatVectorsFormat we allow an external name to be provided. Is this because it isn't really ever loaded via the SPI?

Seems like Lucene99FlatVectorsFormat shouldn't accept a name as a parameter and simply provide super with the correct name.

Makes sense. TBH I kind of just filled out the blanks with this one without thinking much about it, but I agree the name is really only useful for the SPI interface.

benwtrent · 2024-06-13T19:06:15Z

lucene/codecs/src/java/org/apache/lucene/codecs/bitvectors/HnswBitVectorsFormat.java

      this.mergeExec = null;
    }
-    this.flatVectorsFormat = new Lucene99FlatVectorsFormat(new FlatBitVectorsScorer());
+    this.flatVectorsFormat = new Lucene99FlatVectorsFormat(NAME_FLAT, new FlatBitVectorsScorer());


I wonder why a unique name here is required.

The top level format name is HnswBitVectorsFormat, and it doesn't really do anything to the inner format, it simply overrides the scoring mechanism.

on the subject of the brute-force query I started to look at adding one and found we already have o.a.l.queries.function.valuesource.VectorSimilarityFunction -- I think maybe all we need here is a static convenience method that produces a FunctionQuery wrapping one of these things although I'll confess I'm not entirely conversant with this API. It seems like it might want to be rewritten in terms of VectorScorer.

msokolov · 2024-06-13T19:45:21Z

for the failed test here I'll add a test assumption that the randomly-selected vector format be one that supports search (ie not a FlatVectorsFormat) since the test requires that.

…tVectorsFormat.

msokolov · 2024-06-17T15:16:07Z

I will add a CHANGES entry and backport to 9.x

navneet1v · 2024-08-14T20:17:32Z

@msokolov and @benwtrent I see that we exposed the FlatVectorsFormat but I am not seeing entry for this format in this file: https://github.com/apache/lucene/blob/main/lucene/core/src/resources/META-INF/services/org.apache.lucene.codecs.KnnVectorsFormat which is used by SPI to load the format. I am missing something here?

msokolov · 2024-08-15T14:10:04Z

Hi @navneet1v, I think you may be right although TBH I find the way Lucene handles formats with SPI to be somewhat confusing. We're using this in our own service with a custom Codec that seems to not require the META-INF/SPI lookup, bnut maybe there is some shortcoming to that approach

navneet1v · 2024-08-15T17:55:25Z

Hi @msokolov thanks for sharing your thoughts.

I find the way Lucene handles formats with SPI to be somewhat confusing.

+1. but I always feel my experience with SPI is low hence I might feel that.

We're using this in our own service with a custom Codec that seems to not require the META-INF/SPI lookup, bnut maybe there is some shortcoming to that approach

Would like to know how you are doing it if there is some reference. Because I am also using a custom codec. The one way I figured out to use this is I create my own KNNVectorsFormat and then used the FlatWriters and Readers directly. Let me know if there is any better way. But this is best I can think of.

msokolov · 2024-08-16T16:33:50Z

Would like to know how you are doing it if there is some reference. Because I am also using a custom codec. The one way I figured out to use this is I create my own KNNVectorsFormat and then used the FlatWriters and Readers directly. Let me know if there is any better way. But this is best I can think of.

I think we need to add the format to SPI. That way we will be able to evolve it going forward and let the index reading code pick the proper implementation corresponding to what was written to the index. Making the Codec figure this out is probably just a short-term hack that won't scale well.

WIP expose FlatVectorsFormat

02e9b07

msokolov mentioned this pull request Jun 7, 2024

Expose flat vectors in "user space" #13468

Closed

msokolov commented Jun 7, 2024

View reviewed changes

Michael Sokolov added 3 commits June 8, 2024 18:25

tests of flat vectors format

b01800b

fix up inheritance and style checks

c5389c9

add module export statement for Lucene99ScalarQuantizedVectorsFormat

1c87c5d

msokolov added this to the 9.12.0 milestone Jun 10, 2024

msokolov changed the title ~~WIP expose FlatVectorsFormat~~ Expose FlatVectorsFormat Jun 10, 2024

benwtrent reviewed Jun 10, 2024

View reviewed changes

return no results for flat vectors search() instead of throwing excep…

66e8b34

…tion

fewer spots

6262835

benwtrent reviewed Jun 13, 2024

View reviewed changes

remove unneeded format name parameters; move empty search() up to Fla…

6167d84

…tVectorsFormat.

benwtrent approved these changes Jun 13, 2024

View reviewed changes

msokolov merged commit 487d24a into apache:main Jun 13, 2024

msokolov deleted the flat-vectors-format branch June 15, 2024 17:33

msokolov added a commit to msokolov/lucene that referenced this pull request Jun 17, 2024

Expose FlatVectorsFormat (apache#13469)

719a7b7

navneet1v mentioned this pull request Oct 16, 2024

Performance difference between files getting opened with IOContext.RANDOM vs IOContext.READ during merges #13920

Open

suraj-subrahmanyan mentioned this pull request Sep 13, 2025

Upgrade Lucene to version 9.12.2 castorini/anserini#2971

Closed

		public Lucene99FlatVectorsFormat(String name, FlatVectorsScorer vectorsScorer) {
		super(name);

Conversation

msokolov commented Jun 7, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

msokolov commented Jun 13, 2024

Uh oh!

benwtrent left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

msokolov commented Jun 13, 2024

Uh oh!

msokolov commented Jun 17, 2024

Uh oh!

navneet1v commented Aug 14, 2024

Uh oh!

msokolov commented Aug 15, 2024

Uh oh!

navneet1v commented Aug 15, 2024

Uh oh!

msokolov commented Aug 16, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants