Conversation
| () -> | ||
| leaf.searchNearestVectors( | ||
| "vector", new float[dimension], 5, leaf.getLiveDocs(), Integer.MAX_VALUE)); | ||
| "vector", target, 5, leaf.getLiveDocs(), Integer.MAX_VALUE)); |
There was a problem hiding this comment.
There is a problem, because cosine similarity is not specified for zero vectors. As a result we have NaN score. I thought that it would be better not to handle that special case and get rid of docs with score equals NaN. We had the same behavior earlier, except the case with start point:
We check acceptOrds.get here and raise an exception. But actually we don't need to check acceptOrds in the case of zero vector and the cosine similarity. So I think it would be better just not to consider that case if we want to test ExitingReaderException and just define the target as a random float vector.|
how common is this use-case? This change is fairly invasive... adding method signatures to e.g. LeafReader. |
|
i'm also concerned about committing to providing this API for the future. eventually, we'll move away from HNSW to something that actually scales, and it may not support this thresholding? |
It is difficult for me to judge in general, but I face with such tasks quite often. Here is the start of the discussion about that functionality: https://lists.apache.org/list?dev@lucene.apache.org:lte=1M:HNSW%20search%20with%20threshold. The typical case: suppose we have a recommendation system. We have a huge collection of items and we want to give user recommendation of items which would be suitable for him/her. Ranking models, which can provide high quality, can be quite complex and resource consuming. So we can build several layers of models. The most complex ranking model is the last level. Each previous level are easier than previous one, and it selects candidates for the next level. If we have good embeddings for items, then we can build the first layer in the following way. We can calculate similarity between some embedding of user and embeddings of items and compare the similarity value with threshold. If the similarity value exceeds threshold then we consider such item as candidate for next level. This approach can be very productive in practice. But complexity is a problem in this approach. Because we have to calculate cosine between user' embedding and all embeddings of items. I think the proposed functionality would help with this kind of tasks. |
It is a very good point, thanks! But I can't come up with idea of popular ann algorithm in which it would be impossible to support that functionality. Nevertheless, concerning about that, it may be worth to move this functionality in another class, not KnnVectorQuery... But I'm not sure. It seems that it can make api too complicated. |
msokolov
left a comment
There was a problem hiding this comment.
What I have in mind would be to implement entirely in the
KnnVectorQuery. Since results are sorted by score, they can easily be
post-filtered there: no need to implement anything at the codec layer
I think. Am I missing something?
|
If we use only post-filter in KnnVectorQuery, then we have to set k = Integer.MAX_VALUE (or another very big value) and calculate similarity with all vectors. So the complexity would be O(n). I had another idea: we can check the similarity while we are traversing the graph. If similarity is less then threshold, we can get rid of this node and stop to explore this path. In that case we set k = Integer.MAX_VALUE, set similarityThreshold value, but the time complexity would be between O(log(n)) and O(n) (it depends on number of vectors with similarity greater than threshold). I hope that it allow us to solve task like the ones I described above (#11946 (comment)) more efficiently. |
No, we don't have to do that. We can simply post-filter. Think of it like this - we want K matches with score > T. So we get the K top-scoring matches. If any have score less than T, we drop them. It's the same result as if we did the thresholding while collecting. |
|
But we don't know K - that's the problem. The task which I want to solve sounds like this: find documents with similarity >= 0.76 (for example). We don't have the number of such documents in advance. |
|
OK, can we start by providing post-filter? I think this will be a more
common use case. I want to find the best docs, and ensure that none of them
are terrible. It is less disruptive, doesn't require changes to the codec.
Can you explain why you want the "find all docs with score > T"? That is
going to be a scary thing. What if someone asks for T==0? Then the
computation and memory requirements are unbounded. I don't think this is a
search use case - it's some kind of analytics thing that you should do in
Spark or some kind of off-line computation system.
…On Fri, Nov 18, 2022 at 2:01 PM Alexey Gorlenko ***@***.***> wrote:
But we don't know K - that's the problem. The task which I want to solve
sounds like this: find documents with similarity >= 0.76 (for example). We
don't have the number of such documents in advance.
—
Reply to this email directly, view it on GitHub
<#11946 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAHHUQIDSRWIV4ZCGO375ITWI7HB7ANCNFSM6AAAAAASDGO4FQ>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
For example, we want to give user only suitable for him/her documents. We have a custom scorer (based on ml-model, for example) which calculates a score. Next, we compare that score with the threshold to determine whether this document is suitable for the user or not. But usually that scorer too computationally complex to compute it for every document which passed filters. In order to deal with this problem we can build another model, much simpler. That new model would select candidates for the heavy model. One of the basic approaches for building that light model is knn: we have a vector (embedding) for user or users' query and we have a vector (embedding) for every document. So we just find the nearest documents and pass them to the heavy scorer. But we don't know K in that case, we know only the threshold. This threshold is defined during the development of the ranking model. Such tasks naturally arise in recommendation systems and ranking as well.
The same result can be achieved by setting K = 1000...00. I think we don't add the new vulnerability here. Maybe it is worth to add a warning to the documentation (for K and for similarityThreshold). If you still think that it's a bad idea to support such functionality in Lucene, I will rewrite this PR to the post-filter case. But I think it can be useful for people who add ML-ranking to search systems based on Lucene. |
|
@msokolov looking forward to your decision |
|
Hi, I was taking time off for a few days, back now. Have you tried post-filtering? When we added support for existing pre-filter (accepting Query) there was some extensive testing to determine when it is better to pre-filter vs post-filter. The answer is not always so clear-cut. If the filter is not so restrictive (matches > 90% of docs, say), you are probably better off post-filtering. If it is highly restrictive then prefiltering will likely offer performance gains. If it's possible in your application to precompute the filter and cache it for some time (eg in a user session), then you can use the existing prefiltering operation by creating a BitSet matching docs that meet the threshold criterion. So I would suggest trying a large K and post-filtering and see if you get reasonable results? In short, I think this is too risky/trappy for most users. Using a highly-restrictive scoring threshold is really not the same as using a large K from a user perspective since the cost is predictable with K (not very data dependent), but not so with the score (as a user I don't know what the score distribution is, a priori), so providing a score threshold is definitely more dangerous/trappy. |
benwtrent
left a comment
There was a problem hiding this comment.
A couple of thoughts:
- I do think something akin to this is necessary. Post-filtering may indeed work. If it doesn't a more fundamental change than this may be required.
- I also am not 100% convinced there is any academic research that indicates allowing "all k" but with a given radius actually works in HNSW. Is there any prior art to support this idea?
- There needs to be more thought and research on the effects on Recall and runtime.
I think we will find that if we allow this, but for some "unbounded K", that the searching mechanism will have to fundamentally change.
| * @param <T> the type of query vector | ||
| */ | ||
| public class HnswGraphSearcher<T> { | ||
| private final int UNBOUNDED_QUEUE_INIT_SIZE = 10_000; |
There was a problem hiding this comment.
Any research to indicate why this number was chosen? It seems silly that if a user provides k = 10_001 it would have a queue bigger than k = Integer.MAX_VALUE.
Technically, the max value here should be something like ArrayUtil.MAX_ARRAY_LENGTH But this eagerly allocates a new long[heapSize];. This is VERY costly.
I would prefer a number with some significant reason behind it or some better way of queueing neighbors.
There was a problem hiding this comment.
I wanted to set some quite big value of heap's initial size in order to reduce number of possible heap's grows. But it seems that post-filtering would be better: #11946 (comment)
In this case we don't have to modify HnswGraphSearcher at all.
| * @param field the vector field to search | ||
| * @param target the vector-valued query | ||
| * @param k the number of docs to return (the upper bound) | ||
| * @param similarityThreshold the minimum acceptable value of similarity |
There was a problem hiding this comment.
Would it be possible for this threshold to be an actual distance? My concern here is that for things like byteVectors, dot-product scores are insanely small (I think this is a design flaw in itself) and may be confusing to users who want a given "radius" but instead have to figure out a score related to their radius.
It would be prudent that IF we provided some filtering on a threshold within the search, that this threshold reflects vector distance directly.
| if (topCandidateSimilarity < minAcceptedSimilarity && results.size() >= topK) { | ||
| break; | ||
| } |
There was a problem hiding this comment.
I am not sure about this. This stops gathering results once its filled. This defeats the purpose of exploring the graph.
Have you seen how this effects recall?
| int k, | ||
| float similarityThreshold, | ||
| Bits acceptDocs, | ||
| int visitedLimit) |
There was a problem hiding this comment.
please overload the method, and tag all the APIs experimental. I'm really concerned about us locking ourselves into HNSW, and we must...must get away from it (its like 1000x slower than it should be).
the alternative is to feature-freeze vectors completely until they scale. so i think this is a reasonable compromise.
is there any possibility other than adding all these LeafReader/IndexReader signatures? Currently I'm -1 to the change from an API persective. It is too invasive. |
No, I haven't seen such academic research. I've done some experiments with real data and it seems that it really doesn't work as I expected. If number of docs which exceed threshold is significant (for example 20% or more of previously accepted docs), the query works slow and it is better to perform exact search. And unfortunately it happens quite often. So I agree with @msokolov and I think I should rewrite this PR with post-filtering approach. It will allow us to preserve predictable performance and not modify LeafReader/IndexReader (just filter TopDocs in KnnVectorQuery). |
1243012 to
618843e
Compare
|
I've rewritten this PR with post-filtering approach, sorry for the delay. |
|
It seems there are conflicts due to a recent refactor of this query - would you mind merging from main and resolving those please? |
| * @param field a field that has been indexed as a {@link KnnVectorField}. | ||
| * @param target the target of the search | ||
| * @param k the number of documents to find (the upper bound) | ||
| * @param similarityThreshold the minimum acceptable value of similarity |
There was a problem hiding this comment.
So, right now, similarityThreshold is really scoreThreshold. I would prefer it to be the actual similarity of the vectors and NOT how we translate it to scores (which for ByteVectors has some surprising behavior).
@msokolov What do you think here?
Its already tricky that "similarity implies score". But truly, similarity != score.
There was a problem hiding this comment.
As it is, callers of this method need to know the inner nuances of how we calculate the score given the similarity. I would prefer them not having to know that.
There was a problem hiding this comment.
Well, the scores we are talking about here are at least always in [0, 1]. I'm not sure what you mean by the actual similarity of vectors. We used to have a two-step process where we would compute the similarity and then convert to a query score, but I think it's unified today and they are the same? Aren't the scores being thresholded here the output of VectorSimilarityFunction.compare? I may have missed something along the way?
There was a problem hiding this comment.
@msokolov you haven't missed anything. I am specifically talking about users providing similarityThreshold to the query. If they have calculating that they want a specific cosine or dotProduct similarity, they would then need to adjust that to match Lucene's scoring transformation.
I think that similarityThreshold should mean vector similarities. We can transform it for the user to reflect the score that similarity represents (given vector encoding type and similarity function).
An example here is dotProduct. The user knows they want FLOAT32 vectors within a dotProduct of 0.7. With this API that ACTUALLY means they want to limit the scores to .85 ((1 + dotProduct)/2). How is the user supposed to know that?
This seems really weird to me.
This doesn't take into account the different scoring methods between vector types as well, which can get even more confusing.
There was a problem hiding this comment.
OK, with the current CR, orthogonal vectors will have a DOT_PRODUCT "score" of 0.5, which could be surprising. However, this is similar to how result scores are treated elsewhere in Lucene - their value ranges are not well-defined; the only guarantee is that higher scores are "more relevant". I guess practically speaking, as a user, I think I am going to have to do empirical work to know what threshold to use; these are not likely going to be motivated by some a priori knowledge of what a "good" dot-product is, and given that I'd like to just be able to work with some kind of abstracted score in a known range (0 = worst, 1 = best).Conversely, if we were to switch to using vector similarities that would correspond more directly to the underlying functions, we would have to clearly define them (today we don't actually explain this anywhere, I guess we'd need to document) and maybe provide methods for computing them. Also they would be weird too, just in a different way. For example, how would we explain 8-bit dot-product? Would it be the 8-bit dot-product score normalized by 2^15?
There was a problem hiding this comment.
Tl;dr
Thank you for bearing with me! I think this is a good change.
I would be happy with the JavaDocs, etc. clearly indicating that this threshold relates to the un-boosted vector score, not the raw similarity calculation. Dot-product, cosine, and euclidean are well defined concepts outside of Lucene. Lucene mangles (for undoubtably good reasons) the output of these similarities in undocumented ways to fit within boundaries.
with the current CR,
I don't know what CR means. Change request?
However, this is similar to how result scores are treated elsewhere in Lucene - their value ranges are not well-defined;
Agreed, ranges are usually predicated on term statistics, etc. and can potentially be considered "unbounded" as the corpus changes.
However, does Lucene require that all unboosted BM25 scores are between 0-1? It does seem like an "arbitrary" decision (to me, I don't know the full-breadth of Lucene optimizations, etc. when it comes to scores) to restrict vector similarity in this way. But that is a broader conversation. I have some learning to do.
I guess practically speaking, as a user, I think I am going to have to do empirical work to know what threshold to use; these are not likely going to be motivated by some a priori knowledge of what a "good" dot-product is
I would argue that a user could have a priori knowledge here. Think of it in the use case when the user knows their model used to make the vectors. At that point, they 100% know what is considered relevant based on their loss function and training + test data. Choosing a dot-product or cosine threshold that fits within 90% percentile or something given their test data results.
I agree that this would be different if users were using an "off the shelf" model. In that case, they would probably require hybrid-search and combining with BM25 to get anything like relevant results (boosting various queries accordingly). Thus, learning what settings are required in an unfiltered case.
if we were to switch to using vector similarities that would correspond more directly to the underlying functions, we would have to clearly define them
Cosine, dot-product, euclidean, are all already well defined. The functions to calculate them are universally recognized. Where Lucene separates itself is the manipulation of the similarity output to fit into a range [0, 1]. I guess this is cost of doing business in Lucene.
I am not suggesting that all scoring of vector document searches changes. Simply that "similarity" and "score" are related, but are different things.
There was a problem hiding this comment.
I don't know what CR means. Change request?
sorry, yes like a PR but from a parallel universe (code review actually)
So .. theoretical considerations aside, what's the alternative here -- we would treat the threshold as a "vector similarity" and internally convert it to a score. I mean that seems to make sense -- all the conversions are invertible, right? I think we'd want to add a normalize method to VectorSimilarity for this internal use.
There was a problem hiding this comment.
ben for the normal scoring, you can look at tests for similarities package. none of these have any 0 to 1 range or anything like that. instead requirements are that score increases semimonotonically as term frequency increases, decreases wrt documents length, etc. these guarantees allow optimizations such as block max wand to be applied safely. but theres no defined range at all. instead lots of crazy floating point hacks so that we can safely get really good performance.
There was a problem hiding this comment.
still don't have any explanation here as to why we'd do this for vector search query. we avoided any such thresholds or normalization in any of lucene's scoring for decades: if we didn't do that, we would have never been able to implement block-max WAND or other algorithms because they'd be incompatible.
please see:
- https://cwiki.apache.org/confluence/display/LUCENE/LuceneFAQ#LuceneFAQ-CanIfilterbyscore?
- https://cwiki.apache.org/confluence/display/LUCENE/ScoresAsPercentages
I don't mind being the bad guy blocking this change because it seems like it has not been thought thru.
You must convince me.
618843e to
211748e
Compare
Done |
benwtrent
left a comment
There was a problem hiding this comment.
First, thank you for iterating on this and the contribution. It has proved to spur some fruitful conversation.
My overall thoughts on the change, for whatever they are worth.
- Since this is a very basic post filter, I am not sure it is even necessary in Lucene. It adds some complexity with not a ton of value.
- I agree that this kind of thing is valuable in KNN. KNN is unique when compared to sparse retrieval as you always retrieve K results (unless using a restrictive filter). In some cases, the K retrieved can be irrelevant, especially in the case when a filter is used. That said, it seems better fit outside of Lucene.
- As for the original implementation (filtering while searching the graph), this kind of thing will probably exist in the future. But we would need a new vector data structure. Additionally, it probably wouldn't be an add-on to the existing KNN queries, but a completely new query (VectorSimilarityQuery? VectorRadiusQuery???). HNSW doesn't work for this kind of thing. Once we have changed the underlying data structure, this should be revisited.
Not being a committer, and newer to Lucene, I can understand if my opinion is ignored :).
If this is committed, I would at least like similarityThreshold to be clearly document as to what it means.
this isn't any different than BM25 search. Nothing special about KNN. Still no justification to filter by score / scores as percentages. |
|
Ok, it seems that I should close this PR, shouldn't I? It is not difficult to implement such functionality in the code which uses lucene if it is necessary (in contrast to the first implementation). @msokolov what do you think? In any case, I thank you all for the discussion. |
|
This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the dev@lucene.apache.org list. Thank you for your contribution! |
|
There is the new vector similarity query that handles this overall request. @agorlenko #12679 do you think this covers your use case? I am thinking we should close this PR in deference to that query. |
|
It is not exactly the same, but I don't mind ) |
|
closed, because of the similar functionality |
Add similarity threshold for KnnVectorQuery. If that threshold is specified, only documents with similarity score greater than or equal to the threshold value will be added to the result.