Skip to content

LUCENE-9614: Fix KnnVectorQuery failure when numDocs is 0#413

Merged
jtibshirani merged 5 commits intoapache:mainfrom
jtibshirani:knn-query
Oct 27, 2021
Merged

LUCENE-9614: Fix KnnVectorQuery failure when numDocs is 0#413
jtibshirani merged 5 commits intoapache:mainfrom
jtibshirani:knn-query

Conversation

@jtibshirani
Copy link
Copy Markdown
Member

@jtibshirani jtibshirani commented Oct 25, 2021

When the reader has no live docs, KnnVectorQuery can error out. This happens
because IndexReader#numDocs is 0, and we end up passing an illegal value of
k = 0 to the search method.

This commit removes the problematic optimization in KnnVectorQuery and
replaces with a lower-level based on the total number of vectors in the segment.

@jtibshirani
Copy link
Copy Markdown
Member Author

This is a pretty obscure bug, I only noticed it because we use a custom FilterLeafReader as part of our security implementation. I wasn't able to reproduce it using regular deletes or even soft deletes.

for (LeafReaderContext ctx : reader.leaves()) {
perLeafResults[ctx.ord] = searchLeaf(ctx, Math.min(k, reader.numDocs()));
int numDocs = ctx.reader().numDocs();
perLeafResults[ctx.ord] = numDocs > 0 ? searchLeaf(ctx, Math.min(k, numDocs)) : NO_RESULTS;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes me wonder why we pass min(k, numDocs) rather than just k. Is it to avoid oversizing the heap that collect nearest neighbors?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One reason why I'm wondering this is because in the past we tried to avoid calling numDocs()unless it was strictly necessary in the past because it's expensive on reader views that hide subsets of documents (LUCENE-9003).

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also guessed it was to avoid oversizing the heap for tiny segments. I'm not sure how helpful this is though, maybe we can simplify and just use k here.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking of passing k here, and moving the logic to avoid oversizing the heap to Lucene90HnswVectorsReader by doing k = min(k, size()) (where size() is the number of docs that have a vector).

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes sense to me, I pushed a change. Instead of Lucene90HnswVectorsReader, I thought it could make sense to apply the bound in HnswGraph. But this turned out messier because there's separate concepts for topK and numSeed (we're cleaning this up as part of LUCENE-10054).

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for fixing this - it makes sense to me use size() instead of numDocs(), or even simply k; I wasn't aware of the costly nature of that call. Indeed the idea here was just to avoid spending extra work on tiny segments; something I noticed all the time in tests, but which is probably not much of an issue in reality.

@jtibshirani jtibshirani requested a review from msokolov October 25, 2021 20:56
import org.apache.lucene.index.Term;
import org.apache.lucene.index.VectorSimilarityFunction;
import org.apache.lucene.document.*;
import org.apache.lucene.index.*;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I thought we failed the build on wildcard imports, but apparently we don't. Maybe still use explicit imports to reduce line changes of this PR?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also noticed our static analysis is totally fine with it (surprisingly?) I'll need to fix my IntelliJ setup :)

@jtibshirani jtibshirani merged commit abd5ec4 into apache:main Oct 27, 2021
@jtibshirani jtibshirani deleted the knn-query branch October 27, 2021 18:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants