Re-use information from graph traversal during exact search#12820
Re-use information from graph traversal during exact search#12820kaivalnp wants to merge 2 commits intoapache:mainfrom
Conversation
|
This is an interesting idea. Ideally we would figure out up-front whether it's best to use the graph or not, but I can also imagine that we can't always make the right decision there, so we need the ability to fall back. I wonder if we could make it look a bit nicer API-wise, e.g. could we more generally move the responsibility of tracking which doc IDs have already been collected from the codec to the collector, so that it wouldn't even need changes to the API? I guess that the downside is that it would force us to track this information in the doc ID space, while the codec can do this more efficiently right now by tracking a bit set of vector ordinals. |
|
Thanks @jpountz! I realised something from your comment: My current implementation has a flaw, because it cannot handle the This is straightforward for I wonder how costly it would be to maintain the set of visited docs at the In the worst case, we would need another |
vigyasharma
left a comment
There was a problem hiding this comment.
Exciting change @kaivalnp! I agree with the general direction of reusing the work we've done in approximateSearch() when we fall back to exactSearch(). Would be great to avoid redoing some work.
There are subtleties however that might require a deeper change to code structure?... like working in ordinal v/s docId space that @jpountz pointed out, or how to cleanly share the collector between exact and approximate search. We need to think a bit more about those areas.
This PR is still a great first step. It is esp. helpful to use it to profile potential gains with this change.
| Scorer scorer = filterWeight.scorer(ctx); | ||
| if (scorer == null) { | ||
| KnnCollector collector = getCollector(ctx, cost); | ||
| if (collector == null) { |
There was a problem hiding this comment.
Can we get a null collector here?
There was a problem hiding this comment.
The null check comes from DiversifyingChildren{Byte,Float}KnnVectorQuery, where the parentBitSet may be null
In this case, we do not want to perform any searches, since there are no matching docs anyways
| Bits acceptDocs; | ||
| int cost; | ||
| if (filterWeight == null) { | ||
| return approximateSearch(ctx, liveDocs, Integer.MAX_VALUE); | ||
| acceptDocs = liveDocs; | ||
| cost = Integer.MAX_VALUE; |
There was a problem hiding this comment.
I see you're trying to use acceptDocs across the different search calls here, but I wonder if this particular instance is worth changing. It requires us to use Bits acceptDocs instead of a Bitset, due to which you have to upcast it later in this method.
Since we'll always do approximateSearch when filterWeight is null (there can be no exact search), can we leave the return approximateSearch(ctx, liveDocs, Integer.MAX_VALUE); as is?
There was a problem hiding this comment.
I agree, it looks a bit convoluted now. I'll change it back to the original flow
| if ((cost <= k || collector.earlyTerminated()) && acceptDocs instanceof BitSet bitSet) { | ||
| // If there are <= k possible matches, or we stopped the kNN search because it visited too | ||
| // many nodes, fall back to exact search | ||
| return exactSearch(ctx, new BitSetIterator(bitSet, cost), collector); |
There was a problem hiding this comment.
Nice! This should save us from recomputing similarity on docs that we know will get rejected later.
There was a problem hiding this comment.
Yes, this is based on the assumption that any doc "visited but not collected" should be outright rejected from #exactSearch
|
|
||
| protected abstract TopDocs approximateSearch( | ||
| LeafReaderContext context, Bits acceptDocs, int visitedLimit) throws IOException; | ||
| protected KnnCollector getCollector(LeafReaderContext context, int visitLimit) |
There was a problem hiding this comment.
Why do we need context here? We don't seem to be using it.
There was a problem hiding this comment.
Ah I see we have trouble with DiversifyingNearestChildrenKnnCollector which needs a parent filter bitset to initialize. So approximateSearch() tends to create its own collectors. But our problem here is we want to share the collector between two different types of searches. I get the workaround you have here, but it feels a little trappy to me. Sorry I don't have a good solution either, I need to think about this more deeply... It might be a bigger change.
There was a problem hiding this comment.
Agreed, perhaps we can modify the functions a bit to make it cleaner
| topDoc.score = score; | ||
| topDoc.doc = doc; | ||
| topDoc = queue.updateTop(); | ||
| if (score > collector.minCompetitiveSimilarity()) { |
There was a problem hiding this comment.
I think using minCompetitiveSimilarity() here makes sense. Worth noting that once we have #12794, exactSearch() will start competing with global competitive scores, something that we don't currently do.
I don't think it should be a problem though, in fact, it seems like the right behavior.
There was a problem hiding this comment.
Interesting! Using all available information (like higher scores from some other segment's results) should be beneficial
|
|
||
| TotalHits totalHits = new TotalHits(acceptIterator.cost(), TotalHits.Relation.EQUAL_TO); | ||
| return new TopDocs(totalHits, topScoreDocs); | ||
| return new TopDocs(totalHits, collector.topDocs().scoreDocs); |
There was a problem hiding this comment.
Ah, so we cannot return collector.topDocs() like we do here in LeafReader, because the collector has actually visited more nodes than the filter. Good catch!
There was a problem hiding this comment.
Yes, we do not want the relation to be GREATER_THAN_OR_EQUAL_TO
| return results != null ? results : NO_RESULTS; | ||
| protected void approximateSearch( | ||
| LeafReaderContext context, Bits acceptDocs, KnnCollector collector) throws IOException { | ||
| if (collector != null) { |
There was a problem hiding this comment.
When would we get a null collector here?
There was a problem hiding this comment.
This is extended by the DiversifyingChildrenFloatKnnVectorQuery which can have null collectors. Perhaps we can remove the check here, but again override it from the sub-class and include this check?
I just added this because users may tend to extend KnnFloatVectorQuery for custom implementations, and need not worry about the collector being null
| return result; | ||
| } | ||
|
|
||
| private static class ParentBlockJoinByteVectorScorer { |
There was a problem hiding this comment.
Do we not need this code anymore?
There was a problem hiding this comment.
This code implements a special #exactSearch for DiversifyingChildren{Byte,Float}KnnVectorQuery where we:
- Go over parent docs
- Compute the best-scoring child for that parent
- Insert child into the queue if the best-scoring child is better than any collected result
Now that we're re-using the KnnCollector for #exactSearch, the existing DiversifyingNearestChildrenKnnCollector implicitly does all of this - via NodeIdCachingHeap
Is this because a large So we see a 5-10% improvement in latency b/w baseline and candidate? |
0362475 to
4ca5b08
Compare
|
Yes, the restrictive filter will cause more fallbacks to
The benchmark is sort of a happy case (high
I took a shot at moving the tracking of |
benwtrent
left a comment
There was a problem hiding this comment.
I like the createCollector interface and then using the collector focused leaf searcher function. Maybe extract that to another PR? that could make this one a bit cleaner and easier to reason about as its a pretty simple refactor.
However, I do not immediately like the prepareScratchState. It seems like the collector is doing too much now.
I need to think a bit more, it seems like prepareScratchState shouldn't be attached to the collector, its leaking information.
|
|
||
| void prepareScratchState(int maxDoc); | ||
|
|
||
| boolean visit(int docId); |
There was a problem hiding this comment.
We are really spreading the API here. I will need to think more on this, I am not sure.
| @Override | ||
| public void prepareScratchState(int maxDoc) { | ||
| if (visited == null) { | ||
| visited = new FixedBitSet(maxDoc + 1); | ||
| } else { | ||
| visited = FixedBitSet.ensureCapacity(visited, maxDoc); | ||
| visited.clear(); | ||
| } | ||
| } |
There was a problem hiding this comment.
This is a bad idea, on very large graphs, since we scale logarithmically, sparsebitset would be much better.
See: #12789
|
This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the dev@lucene.apache.org list. Thank you for your contribution! |
|
I have done some more benchmarking and there isn't really a significant improvement. This is over 500k, 1024 vectors. Getting the nearest 500 neighbors. Baseline Candidate (this using a FixedBitSet to keep track of visited in a collector) Note, this is with a FixedBitSet that allocates enough space to track every vector. This can get very expensive on large segments. I tried a It just shows that the margins of the gain here may be very slim :/ |
|
Thanks for checking @benwtrent! We primarily improve cases of using a high topK + a selective filter (good rate of fallback, large number of duplicate computations). I notice ~5% gains in those cases from your numbers, which may not be high enough for the extra memory consumption I'll wait for a couple of days in case someone feels this is still worth pursuing, or else I'll close this PR.. |
Description
In KNN queries with a pre-filter, we first perform an approximate graph search and then fallback to exact search based on the cost of the filter (if we visit more nodes than what the filter matches, it is cheaper to perform an exact search)
Graph traversal performs some work (like scoring nodes, maintaining the
topK, etc) which can be re-used from exact search - on the fact that any node which is "visited but not collected" will not be collected from exact search as well..If we start exact search from the previous state (current
topKvectors from graph search) - we can ignore vectors already rejected (i.e. visited) from graph traversal, and save on their similarity computations (duplicate work) plus re-use the existing min-max heap from theKnnCollectorI performed some benchmarks using
KnnGraphTesterby indexing 100k vectors of 100 dimensions, and ran 10k queries with atopKof 1000 with a range ofselectivityvalues (denoting what proportion of all documents are accepted by the filter):baseline
candidate
The gains may be beneficial only when
topKis large and the filter is restrictive (more number of nodes visited -> more chances of falling back to exact search -> more duplicate similarity computations saved)