Skip to content

Re-use information from graph traversal during exact search#12820

Closed
kaivalnp wants to merge 2 commits intoapache:mainfrom
kaivalnp:reuse-graph-search
Closed

Re-use information from graph traversal during exact search#12820
kaivalnp wants to merge 2 commits intoapache:mainfrom
kaivalnp:reuse-graph-search

Conversation

@kaivalnp
Copy link
Contributor

Description

In KNN queries with a pre-filter, we first perform an approximate graph search and then fallback to exact search based on the cost of the filter (if we visit more nodes than what the filter matches, it is cheaper to perform an exact search)

Graph traversal performs some work (like scoring nodes, maintaining the topK, etc) which can be re-used from exact search - on the fact that any node which is "visited but not collected" will not be collected from exact search as well..

If we start exact search from the previous state (current topK vectors from graph search) - we can ignore vectors already rejected (i.e. visited) from graph traversal, and save on their similarity computations (duplicate work) plus re-use the existing min-max heap from the KnnCollector

I performed some benchmarks using KnnGraphTester by indexing 100k vectors of 100 dimensions, and ran 10k queries with a topK of 1000 with a range of selectivity values (denoting what proportion of all documents are accepted by the filter):

baseline

recall latency   nDoc fanout maxConn beamWidth  topK selectivity     type
0.980	 4.84	100000	0	16	100	1000	1.000	  post-filter
0.988	11.94	100000	0	16	100	1000	0.300	  pre-filter
0.988	14.09	100000	0	16	100	1000	0.250	  pre-filter
0.995	20.01	100000	0	16	100	1000	0.200	  pre-filter
1.000	18.86	100000	0	16	100	1000	0.150	  pre-filter
1.000	12.94	100000	0	16	100	1000	0.100	  pre-filter

candidate

recall latency   nDoc fanout maxConn beamWidth  topK selectivity     type
0.980	 4.86	100000	0	16	100	1000	1.000	  post-filter
0.989	12.16	100000	0	16	100	1000	0.300	  pre-filter
0.989	13.64	100000	0	16	100	1000	0.250	  pre-filter
0.994	19.04	100000	0	16	100	1000	0.200	  pre-filter
1.000	17.24	100000	0	16	100	1000	0.150	  pre-filter
1.000	11.61	100000	0	16	100	1000	0.100	  pre-filter

The gains may be beneficial only when topK is large and the filter is restrictive (more number of nodes visited -> more chances of falling back to exact search -> more duplicate similarity computations saved)

@jpountz
Copy link
Contributor

jpountz commented Nov 16, 2023

This is an interesting idea. Ideally we would figure out up-front whether it's best to use the graph or not, but I can also imagine that we can't always make the right decision there, so we need the ability to fall back. I wonder if we could make it look a bit nicer API-wise, e.g. could we more generally move the responsibility of tracking which doc IDs have already been collected from the codec to the collector, so that it wouldn't even need changes to the API? I guess that the downside is that it would force us to track this information in the doc ID space, while the codec can do this more efficiently right now by tracking a bit set of vector ordinals.

@kaivalnp
Copy link
Contributor Author

Thanks @jpountz! I realised something from your comment:

My current implementation has a flaw, because it cannot handle the OrdinalTranslatedKnnCollector correctly: The setVisited call has the BitSet visited as packed ordinals, but the getVisited call receives a docId (and not a vectorId) so we would need a reverse IntToIntFunction docIdToVectorOrdinal to map it back to an ordinal

This is straightforward for DenseOffHeapVectorValues or EmptyOffHeapVectorValues (because there is a 1-1 mapping between a doc and ordinal) but becomes a problem for SparseOffHeapVectorValues which has the vectorOrdinaltoDocId implemented as a DirectMonotonicReader - which I think is docIds stored one after another - so getting the docId for an ordinal is as simple as a lookup at that offset. However, getting an inverse of this can become costly (binary search -> returning the index) as opposed to the current constant time lookup

I wonder how costly it would be to maintain the set of visited docs at the KnnCollector like you mentioned (perhaps using a SparseFixedBitSet)? We already create a BitSet of maxDoc length to hold the filtered docs..

In the worst case, we would need another BitSet of the same length to store which docs are visited from graph search, then skip over those from #exactSearch. However, there may be a better opportunity here: since we want to go over docs that are "prefiltered but not visited", can we simply #clear the bits whenever we visit a node - we just need to find a way to do this cleanly?

Copy link
Contributor

@vigyasharma vigyasharma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exciting change @kaivalnp! I agree with the general direction of reusing the work we've done in approximateSearch() when we fall back to exactSearch(). Would be great to avoid redoing some work.

There are subtleties however that might require a deeper change to code structure?... like working in ordinal v/s docId space that @jpountz pointed out, or how to cleanly share the collector between exact and approximate search. We need to think a bit more about those areas.

This PR is still a great first step. It is esp. helpful to use it to profile potential gains with this change.

Scorer scorer = filterWeight.scorer(ctx);
if (scorer == null) {
KnnCollector collector = getCollector(ctx, cost);
if (collector == null) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we get a null collector here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The null check comes from DiversifyingChildren{Byte,Float}KnnVectorQuery, where the parentBitSet may be null

In this case, we do not want to perform any searches, since there are no matching docs anyways

Comment on lines +112 to +116
Bits acceptDocs;
int cost;
if (filterWeight == null) {
return approximateSearch(ctx, liveDocs, Integer.MAX_VALUE);
acceptDocs = liveDocs;
cost = Integer.MAX_VALUE;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see you're trying to use acceptDocs across the different search calls here, but I wonder if this particular instance is worth changing. It requires us to use Bits acceptDocs instead of a Bitset, due to which you have to upcast it later in this method.

Since we'll always do approximateSearch when filterWeight is null (there can be no exact search), can we leave the return approximateSearch(ctx, liveDocs, Integer.MAX_VALUE); as is?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, it looks a bit convoluted now. I'll change it back to the original flow

if ((cost <= k || collector.earlyTerminated()) && acceptDocs instanceof BitSet bitSet) {
// If there are <= k possible matches, or we stopped the kNN search because it visited too
// many nodes, fall back to exact search
return exactSearch(ctx, new BitSetIterator(bitSet, cost), collector);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! This should save us from recomputing similarity on docs that we know will get rejected later.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is based on the assumption that any doc "visited but not collected" should be outright rejected from #exactSearch


protected abstract TopDocs approximateSearch(
LeafReaderContext context, Bits acceptDocs, int visitedLimit) throws IOException;
protected KnnCollector getCollector(LeafReaderContext context, int visitLimit)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need context here? We don't seem to be using it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I see we have trouble with DiversifyingNearestChildrenKnnCollector which needs a parent filter bitset to initialize. So approximateSearch() tends to create its own collectors. But our problem here is we want to share the collector between two different types of searches. I get the workaround you have here, but it feels a little trappy to me. Sorry I don't have a good solution either, I need to think about this more deeply... It might be a bigger change.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, perhaps we can modify the functions a bit to make it cleaner

topDoc.score = score;
topDoc.doc = doc;
topDoc = queue.updateTop();
if (score > collector.minCompetitiveSimilarity()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think using minCompetitiveSimilarity() here makes sense. Worth noting that once we have #12794, exactSearch() will start competing with global competitive scores, something that we don't currently do.
I don't think it should be a problem though, in fact, it seems like the right behavior.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting! Using all available information (like higher scores from some other segment's results) should be beneficial


TotalHits totalHits = new TotalHits(acceptIterator.cost(), TotalHits.Relation.EQUAL_TO);
return new TopDocs(totalHits, topScoreDocs);
return new TopDocs(totalHits, collector.topDocs().scoreDocs);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, so we cannot return collector.topDocs() like we do here in LeafReader, because the collector has actually visited more nodes than the filter. Good catch!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we do not want the relation to be GREATER_THAN_OR_EQUAL_TO

return results != null ? results : NO_RESULTS;
protected void approximateSearch(
LeafReaderContext context, Bits acceptDocs, KnnCollector collector) throws IOException {
if (collector != null) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When would we get a null collector here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is extended by the DiversifyingChildrenFloatKnnVectorQuery which can have null collectors. Perhaps we can remove the check here, but again override it from the sub-class and include this check?

I just added this because users may tend to extend KnnFloatVectorQuery for custom implementations, and need not worry about the collector being null

return result;
}

private static class ParentBlockJoinByteVectorScorer {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we not need this code anymore?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code implements a special #exactSearch for DiversifyingChildren{Byte,Float}KnnVectorQuery where we:

  1. Go over parent docs
  2. Compute the best-scoring child for that parent
  3. Insert child into the queue if the best-scoring child is better than any collected result

Now that we're re-using the KnnCollector for #exactSearch, the existing DiversifyingNearestChildrenKnnCollector implicitly does all of this - via NodeIdCachingHeap

@vigyasharma
Copy link
Contributor

The gains may be beneficial only when topK is large and the filter is restrictive

Is this because a large topK will cause approximate search to fan out more and have potentially more nodes to visit. While the restrictive filter increases the odds of neighbors with high similarity getting rejected, causing us to visit >cost(filter) and fallback to exact search.

So we see a 5-10% improvement in latency b/w baseline and candidate?

@kaivalnp
Copy link
Contributor Author

Yes, the restrictive filter will cause more fallbacks to #exactSearch, and the high topK will mean more visitation = saving more on duplicate work

So we see a 5-10% improvement in latency b/w baseline and candidate?

The benchmark is sort of a happy case (high topK), but yes, we see a 5-10% improvement in latency there. I'm not sure what's the general topK - selectivity combinations seen by users, may be helpful if someone else can replicate the benchmark with values close to their use-case

like working in ordinal v/s docId space that @jpountz pointed out

I took a shot at moving the tracking of visited nodes to the docId space (delegated to KnnCollector), and the benchmark results are similar to the posted one (but the flaw mentioned above is addressed)

Copy link
Member

@benwtrent benwtrent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the createCollector interface and then using the collector focused leaf searcher function. Maybe extract that to another PR? that could make this one a bit cleaner and easier to reason about as its a pretty simple refactor.

However, I do not immediately like the prepareScratchState. It seems like the collector is doing too much now.

I need to think a bit more, it seems like prepareScratchState shouldn't be attached to the collector, its leaking information.

Comment on lines +88 to +91

void prepareScratchState(int maxDoc);

boolean visit(int docId);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are really spreading the API here. I will need to think more on this, I am not sure.

Comment on lines +73 to +81
@Override
public void prepareScratchState(int maxDoc) {
if (visited == null) {
visited = new FixedBitSet(maxDoc + 1);
} else {
visited = FixedBitSet.ensureCapacity(visited, maxDoc);
visited.clear();
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bad idea, on very large graphs, since we scale logarithmically, sparsebitset would be much better.

See: #12789

@github-actions
Copy link
Contributor

github-actions bot commented Jan 8, 2024

This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the dev@lucene.apache.org list. Thank you for your contribution!

@github-actions github-actions bot added the Stale label Jan 8, 2024
@benwtrent
Copy link
Member

I have done some more benchmarking and there isn't really a significant improvement. This is over 500k, 1024 vectors. Getting the nearest 500 neighbors.

Baseline

latency	nDoc	filter
3.11	500000	0.0060	pre-filter
3.11	500000	0.0059	pre-filter
2.94	500000	0.0058	pre-filter
2.90	500000	0.0057	pre-filter
2.81	500000	0.0056	pre-filter
2.77	500000	0.0055	pre-filter
2.65	500000	0.0054	pre-filter
2.80	500000	0.0053	pre-filter

Candidate (this using a FixedBitSet to keep track of visited in a collector)

latency	nDoc    filter
2.94	500000	0.0060	pre-filter
2.90	500000	0.0059	pre-filter
2.87	500000	0.0058	pre-filter
2.94	500000	0.0057	pre-filter
2.70	500000	0.0056	pre-filter
2.60	500000	0.0055	pre-filter
2.63	500000	0.0054	pre-filter
2.59	500000	0.0053	pre-filter

Note, this is with a FixedBitSet that allocates enough space to track every vector. This can get very expensive on large segments. I tried a SparseBitSet but at my scale of only 500k docs, it was actually slower than baseline.

It just shows that the margins of the gain here may be very slim :/

@github-actions github-actions bot removed the Stale label Feb 22, 2024
@kaivalnp
Copy link
Contributor Author

Thanks for checking @benwtrent!

We primarily improve cases of using a high topK + a selective filter (good rate of fallback, large number of duplicate computations). I notice ~5% gains in those cases from your numbers, which may not be high enough for the extra memory consumption

I'll wait for a couple of days in case someone feels this is still worth pursuing, or else I'll close this PR..

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants