-
Notifications
You must be signed in to change notification settings - Fork 1.3k
Description
Context
Almost all vector search algorithms focus on getting the topK highest-scoring results for a given query vector. This however, may not be the best objective to pursue because:
- Some queries are more suited for vector search: Indexed vectors may not be uniformly distributed, with some parts of the vector space denser than others. If the query is present in a sparse part of the space, the
topKresults will be low scoring (and not relevant to the query). We may want some kind of scorethreshold, below which a vector is not considered as a "match"? - Conversely, if the query is present in a dense part of the vector space, we may not want to limit vector search to
topKwhen we have some more (slightly lower scoring) results, but still above thethreshold. Considering a query and a vector as a "match" if their similarity score is above thethreshold, we may want ALL possible matches?
I found the same thing being discussed in #11946 as well.. However, I'm not aware of any algorithms designed for this objective (or how common of a use-case this may be)
Proposal
IF this turns out to be a valid use-case, HNSW may not be the best algorithm for this new objective (as it was designed for getting the K-Nearest Neighbors). On a high-level, the current algorithm is:
- Maintain a queue of
topKresults (initially empty) and an unbounded queue of candidates (initially containing the entry node). Both of these are priority queues in descending order of similarity scores - Starting from the best (highest scoring) candidate, we insert it into the result queue if there are (1) fewer than
topKresults (2) it is better scoring than any existing result (also removing the least scoring result in this case) - If a node is added as a result, consider all its neighbors as further candidates
- The search stops when the best scoring candidate is worse than the least scoring result
For the new objective, the underlying graphs created by HNSW may still be useful - where each node is connected to its nearest (and diverse) neighbors, so we have some sense of direction / ordering on which node to explore further
Can we instead change the graph traversal (and result collection) criteria such that:
- Consider a node as a candidate if it exceeds a lower, graph-traversal specific score threshold
- Consider a candidate as a result if it exceeds the actual, query-result similarity score threshold
The lower graph-traversal threshold acts as a buffer, to retain results where all nodes along its path may not be above the actual threshold (and is treated as a search-time hyper-parameter, chosen according to the underlying vector space and queries)
The upper levels of search will remain the same (find the single-best entry point for the next layer) and this new criteria is only applicable to the lowest layer (with all nodes) to collect an unbounded number of results above the threshold
This will also eliminate the need to maintain results and candidates as priority queues, instead keeping them as simple lists (reducing the overhead of heapify calls that maintain the min-max heap property) and just sort them once at the end
Looking for inputs on both the use-case and tweaked algorithm..