Re-use information from graph traversal during exact search by kaivalnp · Pull Request #12820 · apache/lucene

kaivalnp · 2023-11-16T17:51:35Z

Description

In KNN queries with a pre-filter, we first perform an approximate graph search and then fallback to exact search based on the cost of the filter (if we visit more nodes than what the filter matches, it is cheaper to perform an exact search)

Graph traversal performs some work (like scoring nodes, maintaining the topK, etc) which can be re-used from exact search - on the fact that any node which is "visited but not collected" will not be collected from exact search as well..

If we start exact search from the previous state (current topK vectors from graph search) - we can ignore vectors already rejected (i.e. visited) from graph traversal, and save on their similarity computations (duplicate work) plus re-use the existing min-max heap from the KnnCollector

I performed some benchmarks using KnnGraphTester by indexing 100k vectors of 100 dimensions, and ran 10k queries with a topK of 1000 with a range of selectivity values (denoting what proportion of all documents are accepted by the filter):

baseline

recall latency   nDoc fanout maxConn beamWidth  topK selectivity     type
0.980	 4.84	100000	0	16	100	1000	1.000	  post-filter
0.988	11.94	100000	0	16	100	1000	0.300	  pre-filter
0.988	14.09	100000	0	16	100	1000	0.250	  pre-filter
0.995	20.01	100000	0	16	100	1000	0.200	  pre-filter
1.000	18.86	100000	0	16	100	1000	0.150	  pre-filter
1.000	12.94	100000	0	16	100	1000	0.100	  pre-filter

candidate

recall latency   nDoc fanout maxConn beamWidth  topK selectivity     type
0.980	 4.86	100000	0	16	100	1000	1.000	  post-filter
0.989	12.16	100000	0	16	100	1000	0.300	  pre-filter
0.989	13.64	100000	0	16	100	1000	0.250	  pre-filter
0.994	19.04	100000	0	16	100	1000	0.200	  pre-filter
1.000	17.24	100000	0	16	100	1000	0.150	  pre-filter
1.000	11.61	100000	0	16	100	1000	0.100	  pre-filter

The gains may be beneficial only when topK is large and the filter is restrictive (more number of nodes visited -> more chances of falling back to exact search -> more duplicate similarity computations saved)

jpountz · 2023-11-16T21:41:15Z

This is an interesting idea. Ideally we would figure out up-front whether it's best to use the graph or not, but I can also imagine that we can't always make the right decision there, so we need the ability to fall back. I wonder if we could make it look a bit nicer API-wise, e.g. could we more generally move the responsibility of tracking which doc IDs have already been collected from the codec to the collector, so that it wouldn't even need changes to the API? I guess that the downside is that it would force us to track this information in the doc ID space, while the codec can do this more efficiently right now by tracking a bit set of vector ordinals.

kaivalnp · 2023-11-17T16:23:59Z

Thanks @jpountz! I realised something from your comment:

My current implementation has a flaw, because it cannot handle the OrdinalTranslatedKnnCollector correctly: The setVisited call has the BitSet visited as packed ordinals, but the getVisited call receives a docId (and not a vectorId) so we would need a reverse IntToIntFunction docIdToVectorOrdinal to map it back to an ordinal

This is straightforward for DenseOffHeapVectorValues or EmptyOffHeapVectorValues (because there is a 1-1 mapping between a doc and ordinal) but becomes a problem for SparseOffHeapVectorValues which has the vectorOrdinaltoDocId implemented as a DirectMonotonicReader - which I think is docIds stored one after another - so getting the docId for an ordinal is as simple as a lookup at that offset. However, getting an inverse of this can become costly (binary search -> returning the index) as opposed to the current constant time lookup

I wonder how costly it would be to maintain the set of visited docs at the KnnCollector like you mentioned (perhaps using a SparseFixedBitSet)? We already create a BitSet of maxDoc length to hold the filtered docs..

In the worst case, we would need another BitSet of the same length to store which docs are visited from graph search, then skip over those from #exactSearch. However, there may be a better opportunity here: since we want to go over docs that are "prefiltered but not visited", can we simply #clear the bits whenever we visit a node - we just need to find a way to do this cleanly?

vigyasharma

Exciting change @kaivalnp! I agree with the general direction of reusing the work we've done in approximateSearch() when we fall back to exactSearch(). Would be great to avoid redoing some work.

There are subtleties however that might require a deeper change to code structure?... like working in ordinal v/s docId space that @jpountz pointed out, or how to cleanly share the collector between exact and approximate search. We need to think a bit more about those areas.

This PR is still a great first step. It is esp. helpful to use it to profile potential gains with this change.

vigyasharma · 2023-11-20T17:14:40Z