Fix failing BaseVectorSimilarityQueryTestCase#testApproximate by kaivalnp · Pull Request #12922 · apache/lucene

kaivalnp · 2023-12-12T07:14:59Z

Discovered in #12921, and introduced in #12679

The first issue is that we weren't advancing the VectorScorer here -- so it was still un-positioned while trying to compute the similarity score

Earlier in the PR, the underlying delegate of the FilteredDocIdSetIterator was scorer.iterator() (see here) -- so we didn't need to explicitly advance it

Later, we decided to maintain parity to AbstractKnnVectorQuery and introduce filtering in AbstractVectorSimilarityQuery (see this commit) to determine the visitLimit of approximate search -- after which the underlying iterator changed to the accepted docs (see here) and I missed advancing the VectorScorer explicitly..

After doing so, we no longer get the original java.lang.ArrayIndexOutOfBoundsException -- but the BaseVectorSimilarityQueryTestCase#testApproximate starts failing because it falls back to exact search, as the limit of the prefilter is met during graph search

Relaxed the parameters of the test to fix this (making the filter less restrictive, and trying to visit a fewer number of nodes so that approximate search completes without hitting its limit)

Sorry for missing this earlier!

benwtrent · 2023-12-12T12:48:16Z

lucene/core/src/java/org/apache/lucene/search/AbstractVectorSimilarityQuery.java

+              // Advance the scorer
+              if (!scorer.advanceExact(doc)) {
+                return false;
+              }
+


This tells me that none of the filtered cases drop to using exact search? If they did they would match on docs that were actually filtered out.

Could you add a test exercising this path?

This tells me that none of the filtered cases drop to using exact search?

We do have a test case that covers this: testRandomFilter randomly filters a sub-range of documents and expects all of them to be found -- which always falls back to exact search (because a traversalSimilarity of Float.NEGATIVE_INFINITY here will visit the entire graph and exhaust any limit)

While running this test, I put debug points here to check the scorer, and it was always unpositioned (docid = -1) but doesn't throw an error in many cases

The error is thrown when the underlying values are randomly chosen to SimpleTextFloatVectorValues - try ./gradlew test --tests TestFloatVectorSimilarityQuery.testRandomFilter -Dtests.seed=119135B1F0803918 for a failing case (as mentioned here, thanks @epotyom!)

In other cases (for example when values are DenseOffHeapVectorValues), the scorer does not fail even when unpositioned and returns some (garbage) value - but the count of results is still correct (and asserted)

I wonder if we can put an assert values.docID() != -1 before this and this line to ensure it is positioned before trying to compute scores? -- I tried this offline, and the issue was caught immediately

I added a commit to demonstrate this^, please let me know if we can improve this / do it some other way?

adding an assert is nice :). In reality it would throw trying to read garbage.

Discovered in #12921, and introduced in #12679 The first issue is that we weren't advancing the `VectorScorer` [here](https://github.com/apache/lucene/blob/cf13a9295052288b748ed8f279f05ee26f3bfd5f/lucene/core/src/java/org/apache/lucene/search/AbstractVectorSimilarityQuery.java#L257-L262) -- so it was still un-positioned while trying to compute the similarity score Earlier in the PR, the underlying delegate of the `FilteredDocIdSetIterator` was `scorer.iterator()` (see [here](https://github.com/apache/lucene/blob/cad565439be512ac6e95a698007b1fc971173f00/lucene/core/src/java/org/apache/lucene/search/AbstractVectorSimilarityQuery.java#L107)) -- so we didn't need to explicitly advance it Later, we decided to maintain parity to `AbstractKnnVectorQuery` and introduce filtering in `AbstractVectorSimilarityQuery` (see [this commit](5096790)) to determine the `visitLimit` of approximate search -- after which the underlying iterator changed to the accepted docs (see [here](https://github.com/apache/lucene/blob/5096790f281e477c529a7c8311aeb353ccdffdeb/lucene/core/src/java/org/apache/lucene/search/AbstractVectorSimilarityQuery.java#L255)) and I missed advancing the `VectorScorer` explicitly.. After doing so, we no longer get the original `java.lang.ArrayIndexOutOfBoundsException` -- but the `BaseVectorSimilarityQueryTestCase#testApproximate` starts failing because it falls back to exact search, as the limit of the prefilter is met during graph search Relaxed the parameters of the test to fix this (making the filter less restrictive, and trying to visit a fewer number of nodes so that approximate search completes without hitting its limit) Sorry for missing this earlier!

kaivalnp added 2 commits December 12, 2023 06:42

Advance VectorScorer to the correct result while iterating

7d9d634

Relax parameters of BaseVectorSimilarityQueryTestCase#testApproximate

a2b94db

benwtrent reviewed Dec 12, 2023

View reviewed changes

Assert that VectorScorer is positioned

fce91b9

benwtrent merged commit 6c5dcc1 into apache:main Dec 13, 2023

benwtrent mentioned this pull request Dec 13, 2023

Reproducible failure in TestFloatVectorSimilarityQuery.testApproximate #12921

Closed

kaivalnp deleted the fix-scorer branch January 2, 2024 15:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix failing BaseVectorSimilarityQueryTestCase#testApproximate#12922

Fix failing BaseVectorSimilarityQueryTestCase#testApproximate#12922
benwtrent merged 3 commits intoapache:mainfrom
kaivalnp:fix-scorer

kaivalnp commented Dec 12, 2023

Uh oh!

benwtrent Dec 12, 2023

Uh oh!

kaivalnp Dec 12, 2023

Uh oh!

kaivalnp Dec 12, 2023

Uh oh!

benwtrent Dec 12, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

kaivalnp commented Dec 12, 2023

Uh oh!

benwtrent Dec 12, 2023

Choose a reason for hiding this comment

Uh oh!

kaivalnp Dec 12, 2023

Choose a reason for hiding this comment

Uh oh!

kaivalnp Dec 12, 2023

Choose a reason for hiding this comment

Uh oh!

benwtrent Dec 12, 2023

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants