Conversation
| } | ||
|
|
||
| @Override | ||
| public void search(String field, float[] target, KnnCollector knnCollector, Bits acceptDocs) throws IOException { |
There was a problem hiding this comment.
hmm as I work with this I am realizing this search interface is not really what I'm after ... basically I would like to be able to combine a DocIdSet from a Query with a FlatVectorScorer so that I can score the document using a quantized dot-product. So I think what I need is a RandomVectorScorer returned from the FlatVectorsReader and some way to go from docid->ord for my vector field. But it seems we only support ordToDoc now and not the other way?
There was a problem hiding this comment.
@msokolov getByteVectorValues and getFloatVectorValues already do this. They provide an iterator that will return VectorScorer which can score and forward iterate via doc ids (supports advance, next, etc.).
Also, with the "flat codecs" it doesn't make much sense to gather up the accepted docs, place the results in a collector, etc. You want to score while you iterate. It would likely be way cheaper than doing it all up front like knn is doing.
There was a problem hiding this comment.
Yes, of course, OK. So I think this will work like this if I leave the search() methods as throwing UnsupportedOperationException, then use PerFieldKnnVectorValues to tie a field to a Lucene99ScalarQuantizedVectorsFormat, at which point I can call getFloatVectorValues and use the returned scorer's score() method. Or perhaps call getQuantizedVectorValues directly and use its score method. I'll give it a try.
I'm OK with all that. If this becomes popular we might want to find a way that doesn't require using codec-specific classes, but we can let that sit for now I think. Thank you!
There was a problem hiding this comment.
@msokolov I suppose that is ok for folks that understand the depths of codec formats. But we are seriously complicating somebody's life that wants to use kNN and scalar quantized flat codec. Now they cannot use the knn query, but instead have to use some specialized query they come up with themselves.
I think we at least should have a "flat vector query"? or something? Something that always does exact search so that typical users don't have to do this iterating & casting stuff?
There was a problem hiding this comment.
@msokolov I suppose that is ok for folks that understand the depths of codec formats. But we are seriously complicating somebody's life that wants to use kNN and scalar quantized flat codec. Now they cannot use the knn query, but instead have to use some specialized query they come up with themselves.
+1 on this comment.
I think we at least should have a "flat vector query"? or something? Something that always does exact search so that typical users don't have to do this iterating & casting stuff?
Rather than having a new query can we just reuse the old query where for a field we get the right KNNVectorsReader and hit the search interface of the reader that does the exact search or HNSW based search based on whatever implementation the KNNVectorsReader have used.
There was a problem hiding this comment.
I hear you all, this seems a little weird. One thing is ... you cannot get access to this stuff at all unless you are willing to create a custom Codec. So by definition we have weeded out people who don't want to mess around with Codecs. Another thing is ... we don't really want to create a trap for people who are trying to find top N vector-distance documents matching some constraint. They might think by blindly following the API trail/trap that they should create a BitSet representing their constraint and pass that in to a Knn*VectorQuery? When instead they would be better off executing their constraint as the primary Query and using the VectorValues API to do the scoring. I'm struggling to see the use case for this brute-force query; it just seems like a trap to me
There was a problem hiding this comment.
Maybe we can make a KnnVectorScoreQuery that wraps another Query, supplying the vector score. Actually I think you can already do this with FunctionScoreQuery + a DoubleValuesSource wrapping a KnnVectorScorer? Perhaps we should provide that as a single Query?
| return rawVectorsReader.getByteVectorValues(field); | ||
| } | ||
|
|
||
| @Override |
There was a problem hiding this comment.
So, doing this means the KnnFloatVectorQuery will throw an error if its used against a field with a codec using this reader.
Traditionally, query's were fairly ignorant of the underlying codec used to store the field.
It does seem weird to have a codec that the Knn queries could be executed against throw an exception.
I think either:
- The knn queries should automatically do the correct things (e.g. not eagerly rewrite themselves, etc.)
- We don't throw on
searchlike this.
There was a problem hiding this comment.
The knn queries should automatically do the correct things (e.g. not eagerly rewrite themselves, etc.)
I am not sure how to do this other than having something that says "Supports approximate search" to the leaf readers. Or a try{}catch() in the query, but then we are using exceptions as a query flow, which seems trappy and bad.
There was a problem hiding this comment.
I did toy with adding a "supportsSearch" flag when this problem surfaced due to BaseKnnVectorsFormatTestCase. Potentially we could return no results. I feel like this is the moral equivalent of a stored Field with IndexOptions.NONE (Like StoredField.TYPE)? But we have tended to not want to make such vector-representation choices a part of the Fields API, so it looks different.
|
I changed this to return zero results from search() rather than throwing I also proposed a query (VectorScoreQuery?) that could be used to wrap another Query and provide the score using a vector scorer, but I think we can handle that as a separate issue. |
benwtrent
left a comment
There was a problem hiding this comment.
Some concerns. With the desire of having a better brute-force query, I am coming around to the idea :)
| @Override | ||
| public void search(String field, float[] target, KnnCollector knnCollector, Bits acceptDocs) | ||
| throws IOException { | ||
| // don't scan stored field data. If we didn't index it, produce no search results | ||
| } |
There was a problem hiding this comment.
Since this is the case for all FlatVectorsReader objects, think we should put this override there?
| public void search(String field, float[] target, KnnCollector knnCollector, Bits acceptDocs) | ||
| throws IOException { | ||
| // don't scan stored field data. If we didn't index it, produce no search results | ||
| } | ||
|
|
||
| @Override | ||
| public void search(String field, byte[] target, KnnCollector knnCollector, Bits acceptDocs) | ||
| throws IOException { | ||
| // don't scan stored field data. If we didn't index it, produce no search results | ||
| } |
There was a problem hiding this comment.
Maybe this should be in FlatVectorsReader
| public Lucene99FlatVectorsFormat(String name, FlatVectorsScorer vectorsScorer) { | ||
| super(name); |
There was a problem hiding this comment.
I wonder why for Lucene99FlatVectorsFormat we allow an external name to be provided. Is this because it isn't really ever loaded via the SPI?
Seems like Lucene99FlatVectorsFormat shouldn't accept a name as a parameter and simply provide super with the correct name.
There was a problem hiding this comment.
Makes sense. TBH I kind of just filled out the blanks with this one without thinking much about it, but I agree the name is really only useful for the SPI interface.
| this.mergeExec = null; | ||
| } | ||
| this.flatVectorsFormat = new Lucene99FlatVectorsFormat(new FlatBitVectorsScorer()); | ||
| this.flatVectorsFormat = new Lucene99FlatVectorsFormat(NAME_FLAT, new FlatBitVectorsScorer()); |
There was a problem hiding this comment.
I wonder why a unique name here is required.
The top level format name is HnswBitVectorsFormat, and it doesn't really do anything to the inner format, it simply overrides the scoring mechanism.
There was a problem hiding this comment.
on the subject of the brute-force query I started to look at adding one and found we already have o.a.l.queries.function.valuesource.VectorSimilarityFunction -- I think maybe all we need here is a static convenience method that produces a FunctionQuery wrapping one of these things although I'll confess I'm not entirely conversant with this API. It seems like it might want to be rewritten in terms of VectorScorer.
|
for the failed test here I'll add a test assumption that the randomly-selected vector format be one that supports search (ie not a FlatVectorsFormat) since the test requires that. |
|
I will add a CHANGES entry and backport to 9.x |
|
@msokolov and @benwtrent I see that we exposed the FlatVectorsFormat but I am not seeing entry for this format in this file: https://github.com/apache/lucene/blob/main/lucene/core/src/resources/META-INF/services/org.apache.lucene.codecs.KnnVectorsFormat which is used by SPI to load the format. I am missing something here? |
|
Hi @navneet1v, I think you may be right although TBH I find the way Lucene handles formats with SPI to be somewhat confusing. We're using this in our own service with a custom Codec that seems to not require the META-INF/SPI lookup, bnut maybe there is some shortcoming to that approach |
|
Hi @msokolov thanks for sharing your thoughts.
+1. but I always feel my experience with SPI is low hence I might feel that.
Would like to know how you are doing it if there is some reference. Because I am also using a custom codec. The one way I figured out to use this is I create my own KNNVectorsFormat and then used the FlatWriters and Readers directly. Let me know if there is any better way. But this is best I can think of. |
I think we need to add the format to SPI. That way we will be able to evolve it going forward and let the index reading code pick the proper implementation corresponding to what was written to the index. Making the Codec figure this out is probably just a short-term hack that won't scale well. |
just posting this here for discussion purposes; API is up for discussion