Skip to content

add similarity threshold for hnsw#11946

Closed
agorlenko wants to merge 1 commit intoapache:mainfrom
agorlenko:knn-similarity-threshold
Closed

add similarity threshold for hnsw#11946
agorlenko wants to merge 1 commit intoapache:mainfrom
agorlenko:knn-similarity-threshold

Conversation

@agorlenko
Copy link

@agorlenko agorlenko commented Nov 17, 2022

Add similarity threshold for KnnVectorQuery. If that threshold is specified, only documents with similarity score greater than or equal to the threshold value will be added to the result.

() ->
leaf.searchNearestVectors(
"vector", new float[dimension], 5, leaf.getLiveDocs(), Integer.MAX_VALUE));
"vector", target, 5, leaf.getLiveDocs(), Integer.MAX_VALUE));
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a problem, because cosine similarity is not specified for zero vectors. As a result we have NaN score. I thought that it would be better not to handle that special case and get rid of docs with score equals NaN. We had the same behavior earlier, except the case with start point:

if (acceptOrds == null || acceptOrds.get(ep)) {
results.add(ep, score);
}
We check acceptOrds.get here and raise an exception. But actually we don't need to check acceptOrds in the case of zero vector and the cosine similarity. So I think it would be better just not to consider that case if we want to test ExitingReaderException and just define the target as a random float vector.

@rmuir
Copy link
Member

rmuir commented Nov 17, 2022

how common is this use-case? This change is fairly invasive... adding method signatures to e.g. LeafReader.

@rmuir
Copy link
Member

rmuir commented Nov 17, 2022

i'm also concerned about committing to providing this API for the future. eventually, we'll move away from HNSW to something that actually scales, and it may not support this thresholding?

@agorlenko
Copy link
Author

how common is this use-case? This change is fairly invasive... adding method signatures to e.g. LeafReader.

It is difficult for me to judge in general, but I face with such tasks quite often. Here is the start of the discussion about that functionality: https://lists.apache.org/list?dev@lucene.apache.org:lte=1M:HNSW%20search%20with%20threshold.

The typical case: suppose we have a recommendation system. We have a huge collection of items and we want to give user recommendation of items which would be suitable for him/her. Ranking models, which can provide high quality, can be quite complex and resource consuming. So we can build several layers of models. The most complex ranking model is the last level. Each previous level are easier than previous one, and it selects candidates for the next level. If we have good embeddings for items, then we can build the first layer in the following way. We can calculate similarity between some embedding of user and embeddings of items and compare the similarity value with threshold. If the similarity value exceeds threshold then we consider such item as candidate for next level. This approach can be very productive in practice. But complexity is a problem in this approach. Because we have to calculate cosine between user' embedding and all embeddings of items.

I think the proposed functionality would help with this kind of tasks.

@agorlenko
Copy link
Author

i'm also concerned about committing to providing this API for the future. eventually, we'll move away from HNSW to something that actually scales, and it may not support this thresholding?

It is a very good point, thanks! But I can't come up with idea of popular ann algorithm in which it would be impossible to support that functionality. Nevertheless, concerning about that, it may be worth to move this functionality in another class, not KnnVectorQuery... But I'm not sure. It seems that it can make api too complicated.

Copy link
Contributor

@msokolov msokolov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I have in mind would be to implement entirely in the
KnnVectorQuery. Since results are sorted by score, they can easily be
post-filtered there: no need to implement anything at the codec layer
I think. Am I missing something?

@agorlenko
Copy link
Author

If we use only post-filter in KnnVectorQuery, then we have to set k = Integer.MAX_VALUE (or another very big value) and calculate similarity with all vectors. So the complexity would be O(n).

I had another idea: we can check the similarity while we are traversing the graph. If similarity is less then threshold, we can get rid of this node and stop to explore this path. In that case we set k = Integer.MAX_VALUE, set similarityThreshold value, but the time complexity would be between O(log(n)) and O(n) (it depends on number of vectors with similarity greater than threshold). I hope that it allow us to solve task like the ones I described above (#11946 (comment)) more efficiently.

@msokolov
Copy link
Contributor

If we use only post-filter in KnnVectorQuery, then we have to set k = Integer.MAX_VALUE (or another very big value) and calculate similarity with all vectors. So the complexity would be O(n).

No, we don't have to do that. We can simply post-filter. Think of it like this - we want K matches with score > T. So we get the K top-scoring matches. If any have score less than T, we drop them. It's the same result as if we did the thresholding while collecting.

@agorlenko
Copy link
Author

But we don't know K - that's the problem. The task which I want to solve sounds like this: find documents with similarity >= 0.76 (for example). We don't have the number of such documents in advance.

@msokolov
Copy link
Contributor

msokolov commented Nov 18, 2022 via email

@agorlenko
Copy link
Author

agorlenko commented Nov 18, 2022

Can you explain why you want the "find all docs with score > T"?

For example, we want to give user only suitable for him/her documents. We have a custom scorer (based on ml-model, for example) which calculates a score. Next, we compare that score with the threshold to determine whether this document is suitable for the user or not. But usually that scorer too computationally complex to compute it for every document which passed filters. In order to deal with this problem we can build another model, much simpler. That new model would select candidates for the heavy model. One of the basic approaches for building that light model is knn: we have a vector (embedding) for user or users' query and we have a vector (embedding) for every document. So we just find the nearest documents and pass them to the heavy scorer. But we don't know K in that case, we know only the threshold. This threshold is defined during the development of the ranking model. Such tasks naturally arise in recommendation systems and ranking as well.

That is going to be a scary thing. What if someone asks for T==0? Then the computation and memory requirements are unbounded.

The same result can be achieved by setting K = 1000...00. I think we don't add the new vulnerability here. Maybe it is worth to add a warning to the documentation (for K and for similarityThreshold).

If you still think that it's a bad idea to support such functionality in Lucene, I will rewrite this PR to the post-filter case. But I think it can be useful for people who add ML-ranking to search systems based on Lucene.

@agorlenko
Copy link
Author

@msokolov looking forward to your decision

@msokolov
Copy link
Contributor

Hi, I was taking time off for a few days, back now. Have you tried post-filtering? When we added support for existing pre-filter (accepting Query) there was some extensive testing to determine when it is better to pre-filter vs post-filter. The answer is not always so clear-cut. If the filter is not so restrictive (matches > 90% of docs, say), you are probably better off post-filtering. If it is highly restrictive then prefiltering will likely offer performance gains. If it's possible in your application to precompute the filter and cache it for some time (eg in a user session), then you can use the existing prefiltering operation by creating a BitSet matching docs that meet the threshold criterion.

So I would suggest trying a large K and post-filtering and see if you get reasonable results?

In short, I think this is too risky/trappy for most users. Using a highly-restrictive scoring threshold is really not the same as using a large K from a user perspective since the cost is predictable with K (not very data dependent), but not so with the score (as a user I don't know what the score distribution is, a priori), so providing a score threshold is definitely more dangerous/trappy.

Copy link
Member

@benwtrent benwtrent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple of thoughts:

  • I do think something akin to this is necessary. Post-filtering may indeed work. If it doesn't a more fundamental change than this may be required.
  • I also am not 100% convinced there is any academic research that indicates allowing "all k" but with a given radius actually works in HNSW. Is there any prior art to support this idea?
  • There needs to be more thought and research on the effects on Recall and runtime.

I think we will find that if we allow this, but for some "unbounded K", that the searching mechanism will have to fundamentally change.

* @param <T> the type of query vector
*/
public class HnswGraphSearcher<T> {
private final int UNBOUNDED_QUEUE_INIT_SIZE = 10_000;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any research to indicate why this number was chosen? It seems silly that if a user provides k = 10_001 it would have a queue bigger than k = Integer.MAX_VALUE.

Technically, the max value here should be something like ArrayUtil.MAX_ARRAY_LENGTH But this eagerly allocates a new long[heapSize];. This is VERY costly.

I would prefer a number with some significant reason behind it or some better way of queueing neighbors.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted to set some quite big value of heap's initial size in order to reduce number of possible heap's grows. But it seems that post-filtering would be better: #11946 (comment)

In this case we don't have to modify HnswGraphSearcher at all.

* @param field the vector field to search
* @param target the vector-valued query
* @param k the number of docs to return (the upper bound)
* @param similarityThreshold the minimum acceptable value of similarity
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be possible for this threshold to be an actual distance? My concern here is that for things like byteVectors, dot-product scores are insanely small (I think this is a design flaw in itself) and may be confusing to users who want a given "radius" but instead have to figure out a score related to their radius.

It would be prudent that IF we provided some filtering on a threshold within the search, that this threshold reflects vector distance directly.

Comment on lines 315 to 248
if (topCandidateSimilarity < minAcceptedSimilarity && results.size() >= topK) {
break;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure about this. This stops gathering results once its filled. This defeats the purpose of exploring the graph.

Have you seen how this effects recall?

int k,
float similarityThreshold,
Bits acceptDocs,
int visitedLimit)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please overload the method, and tag all the APIs experimental. I'm really concerned about us locking ourselves into HNSW, and we must...must get away from it (its like 1000x slower than it should be).

the alternative is to feature-freeze vectors completely until they scale. so i think this is a reasonable compromise.

@rmuir
Copy link
Member

rmuir commented Dec 6, 2022

What I have in mind would be to implement entirely in the
KnnVectorQuery. Since results are sorted by score, they can easily be
post-filtered there: no need to implement anything at the codec layer
I think. Am I missing something?

is there any possibility other than adding all these LeafReader/IndexReader signatures?

Currently I'm -1 to the change from an API persective. It is too invasive.

@agorlenko
Copy link
Author

agorlenko commented Dec 6, 2022

I also am not 100% convinced there is any academic research that indicates allowing "all k" but with a given radius actually works in HNSW. Is there any prior art to support this idea?

No, I haven't seen such academic research.

I've done some experiments with real data and it seems that it really doesn't work as I expected. If number of docs which exceed threshold is significant (for example 20% or more of previously accepted docs), the query works slow and it is better to perform exact search. And unfortunately it happens quite often.

So I agree with @msokolov and I think I should rewrite this PR with post-filtering approach. It will allow us to preserve predictable performance and not modify LeafReader/IndexReader (just filter TopDocs in KnnVectorQuery).

@agorlenko agorlenko force-pushed the knn-similarity-threshold branch from 1243012 to 618843e Compare December 12, 2022 15:41
@agorlenko
Copy link
Author

I've rewritten this PR with post-filtering approach, sorry for the delay.

Copy link
Contributor

@msokolov msokolov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! Thank you

@msokolov
Copy link
Contributor

It seems there are conflicts due to a recent refactor of this query - would you mind merging from main and resolving those please?

* @param field a field that has been indexed as a {@link KnnVectorField}.
* @param target the target of the search
* @param k the number of documents to find (the upper bound)
* @param similarityThreshold the minimum acceptable value of similarity
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, right now, similarityThreshold is really scoreThreshold. I would prefer it to be the actual similarity of the vectors and NOT how we translate it to scores (which for ByteVectors has some surprising behavior).

@msokolov What do you think here?

Its already tricky that "similarity implies score". But truly, similarity != score.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As it is, callers of this method need to know the inner nuances of how we calculate the score given the similarity. I would prefer them not having to know that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, the scores we are talking about here are at least always in [0, 1]. I'm not sure what you mean by the actual similarity of vectors. We used to have a two-step process where we would compute the similarity and then convert to a query score, but I think it's unified today and they are the same? Aren't the scores being thresholded here the output of VectorSimilarityFunction.compare? I may have missed something along the way?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@msokolov you haven't missed anything. I am specifically talking about users providing similarityThreshold to the query. If they have calculating that they want a specific cosine or dotProduct similarity, they would then need to adjust that to match Lucene's scoring transformation.

I think that similarityThreshold should mean vector similarities. We can transform it for the user to reflect the score that similarity represents (given vector encoding type and similarity function).

An example here is dotProduct. The user knows they want FLOAT32 vectors within a dotProduct of 0.7. With this API that ACTUALLY means they want to limit the scores to .85 ((1 + dotProduct)/2). How is the user supposed to know that?

This seems really weird to me.

This doesn't take into account the different scoring methods between vector types as well, which can get even more confusing.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, with the current CR, orthogonal vectors will have a DOT_PRODUCT "score" of 0.5, which could be surprising. However, this is similar to how result scores are treated elsewhere in Lucene - their value ranges are not well-defined; the only guarantee is that higher scores are "more relevant". I guess practically speaking, as a user, I think I am going to have to do empirical work to know what threshold to use; these are not likely going to be motivated by some a priori knowledge of what a "good" dot-product is, and given that I'd like to just be able to work with some kind of abstracted score in a known range (0 = worst, 1 = best).Conversely, if we were to switch to using vector similarities that would correspond more directly to the underlying functions, we would have to clearly define them (today we don't actually explain this anywhere, I guess we'd need to document) and maybe provide methods for computing them. Also they would be weird too, just in a different way. For example, how would we explain 8-bit dot-product? Would it be the 8-bit dot-product score normalized by 2^15?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tl;dr

Thank you for bearing with me! I think this is a good change.

I would be happy with the JavaDocs, etc. clearly indicating that this threshold relates to the un-boosted vector score, not the raw similarity calculation. Dot-product, cosine, and euclidean are well defined concepts outside of Lucene. Lucene mangles (for undoubtably good reasons) the output of these similarities in undocumented ways to fit within boundaries.

with the current CR,

I don't know what CR means. Change request?

However, this is similar to how result scores are treated elsewhere in Lucene - their value ranges are not well-defined;

Agreed, ranges are usually predicated on term statistics, etc. and can potentially be considered "unbounded" as the corpus changes.

However, does Lucene require that all unboosted BM25 scores are between 0-1? It does seem like an "arbitrary" decision (to me, I don't know the full-breadth of Lucene optimizations, etc. when it comes to scores) to restrict vector similarity in this way. But that is a broader conversation. I have some learning to do.

I guess practically speaking, as a user, I think I am going to have to do empirical work to know what threshold to use; these are not likely going to be motivated by some a priori knowledge of what a "good" dot-product is

I would argue that a user could have a priori knowledge here. Think of it in the use case when the user knows their model used to make the vectors. At that point, they 100% know what is considered relevant based on their loss function and training + test data. Choosing a dot-product or cosine threshold that fits within 90% percentile or something given their test data results.

I agree that this would be different if users were using an "off the shelf" model. In that case, they would probably require hybrid-search and combining with BM25 to get anything like relevant results (boosting various queries accordingly). Thus, learning what settings are required in an unfiltered case.

if we were to switch to using vector similarities that would correspond more directly to the underlying functions, we would have to clearly define them

Cosine, dot-product, euclidean, are all already well defined. The functions to calculate them are universally recognized. Where Lucene separates itself is the manipulation of the similarity output to fit into a range [0, 1]. I guess this is cost of doing business in Lucene.

I am not suggesting that all scoring of vector document searches changes. Simply that "similarity" and "score" are related, but are different things.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know what CR means. Change request?

sorry, yes like a PR but from a parallel universe (code review actually)

So .. theoretical considerations aside, what's the alternative here -- we would treat the threshold as a "vector similarity" and internally convert it to a score. I mean that seems to make sense -- all the conversions are invertible, right? I think we'd want to add a normalize method to VectorSimilarity for this internal use.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ben for the normal scoring, you can look at tests for similarities package. none of these have any 0 to 1 range or anything like that. instead requirements are that score increases semimonotonically as term frequency increases, decreases wrt documents length, etc. these guarantees allow optimizations such as block max wand to be applied safely. but theres no defined range at all. instead lots of crazy floating point hacks so that we can safely get really good performance.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

still don't have any explanation here as to why we'd do this for vector search query. we avoided any such thresholds or normalization in any of lucene's scoring for decades: if we didn't do that, we would have never been able to implement block-max WAND or other algorithms because they'd be incompatible.

please see:

I don't mind being the bad guy blocking this change because it seems like it has not been thought thru.

You must convince me.

@agorlenko agorlenko force-pushed the knn-similarity-threshold branch from 618843e to 211748e Compare December 16, 2022 22:08
@agorlenko
Copy link
Author

It seems there are conflicts due to a recent refactor of this query - would you mind merging from main and resolving those please?

Done

Copy link
Member

@benwtrent benwtrent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First, thank you for iterating on this and the contribution. It has proved to spur some fruitful conversation.

My overall thoughts on the change, for whatever they are worth.

  • Since this is a very basic post filter, I am not sure it is even necessary in Lucene. It adds some complexity with not a ton of value.
  • I agree that this kind of thing is valuable in KNN. KNN is unique when compared to sparse retrieval as you always retrieve K results (unless using a restrictive filter). In some cases, the K retrieved can be irrelevant, especially in the case when a filter is used. That said, it seems better fit outside of Lucene.
  • As for the original implementation (filtering while searching the graph), this kind of thing will probably exist in the future. But we would need a new vector data structure. Additionally, it probably wouldn't be an add-on to the existing KNN queries, but a completely new query (VectorSimilarityQuery? VectorRadiusQuery???). HNSW doesn't work for this kind of thing. Once we have changed the underlying data structure, this should be revisited.

Not being a committer, and newer to Lucene, I can understand if my opinion is ignored :).

If this is committed, I would at least like similarityThreshold to be clearly document as to what it means.

@rmuir
Copy link
Member

rmuir commented Dec 19, 2022

  • I agree that this kind of thing is valuable in KNN. KNN is unique when compared to sparse retrieval as you always retrieve K results (unless using a restrictive filter). In some cases, the K retrieved can be irrelevant, especially in the case when a filter is used. That said, it seems better fit outside of Lucene.

this isn't any different than BM25 search.

Nothing special about KNN.

Still no justification to filter by score / scores as percentages.

@agorlenko
Copy link
Author

Ok, it seems that I should close this PR, shouldn't I? It is not difficult to implement such functionality in the code which uses lucene if it is necessary (in contrast to the first implementation).

@msokolov what do you think?

In any case, I thank you all for the discussion.

@github-actions
Copy link
Contributor

github-actions bot commented Jan 8, 2024

This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the dev@lucene.apache.org list. Thank you for your contribution!

@github-actions github-actions bot added the Stale label Jan 8, 2024
@benwtrent
Copy link
Member

There is the new vector similarity query that handles this overall request.

@agorlenko #12679 do you think this covers your use case?

I am thinking we should close this PR in deference to that query.

@github-actions github-actions bot removed the Stale label Jun 19, 2024
@agorlenko
Copy link
Author

It is not exactly the same, but I don't mind )

@agorlenko
Copy link
Author

closed, because of the similar functionality

@agorlenko agorlenko closed this Jun 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants