LUCENE-9614: Fix KnnVectorQuery failure when numDocs is 0#413
LUCENE-9614: Fix KnnVectorQuery failure when numDocs is 0#413jtibshirani merged 5 commits intoapache:mainfrom
Conversation
|
This is a pretty obscure bug, I only noticed it because we use a custom |
| for (LeafReaderContext ctx : reader.leaves()) { | ||
| perLeafResults[ctx.ord] = searchLeaf(ctx, Math.min(k, reader.numDocs())); | ||
| int numDocs = ctx.reader().numDocs(); | ||
| perLeafResults[ctx.ord] = numDocs > 0 ? searchLeaf(ctx, Math.min(k, numDocs)) : NO_RESULTS; |
There was a problem hiding this comment.
This makes me wonder why we pass min(k, numDocs) rather than just k. Is it to avoid oversizing the heap that collect nearest neighbors?
There was a problem hiding this comment.
One reason why I'm wondering this is because in the past we tried to avoid calling numDocs()unless it was strictly necessary in the past because it's expensive on reader views that hide subsets of documents (LUCENE-9003).
There was a problem hiding this comment.
I also guessed it was to avoid oversizing the heap for tiny segments. I'm not sure how helpful this is though, maybe we can simplify and just use k here.
There was a problem hiding this comment.
I was thinking of passing k here, and moving the logic to avoid oversizing the heap to Lucene90HnswVectorsReader by doing k = min(k, size()) (where size() is the number of docs that have a vector).
There was a problem hiding this comment.
This makes sense to me, I pushed a change. Instead of Lucene90HnswVectorsReader, I thought it could make sense to apply the bound in HnswGraph. But this turned out messier because there's separate concepts for topK and numSeed (we're cleaning this up as part of LUCENE-10054).
There was a problem hiding this comment.
Thanks for fixing this - it makes sense to me use size() instead of numDocs(), or even simply k; I wasn't aware of the costly nature of that call. Indeed the idea here was just to avoid spending extra work on tiny segments; something I noticed all the time in tests, but which is probably not much of an issue in reality.
| import org.apache.lucene.index.Term; | ||
| import org.apache.lucene.index.VectorSimilarityFunction; | ||
| import org.apache.lucene.document.*; | ||
| import org.apache.lucene.index.*; |
There was a problem hiding this comment.
Oh, I thought we failed the build on wildcard imports, but apparently we don't. Maybe still use explicit imports to reduce line changes of this PR?
There was a problem hiding this comment.
I also noticed our static analysis is totally fine with it (surprisingly?) I'll need to fix my IntelliJ setup :)
When the reader has no live docs,
KnnVectorQuerycan error out. This happensbecause
IndexReader#numDocsis 0, and we end up passing an illegal value ofk = 0to the search method.This commit removes the problematic optimization in
KnnVectorQueryandreplaces with a lower-level based on the total number of vectors in the segment.