LUCENE-10436: Deprecate DocValuesFieldExistsQuery, NormsFieldExistsQuery and KnnVectorFieldExistsQuery with FieldExistsQuery#767
Conversation
…ery and KnnVectorFieldExistsQuery with FieldExistsQuery
|
Quick question: shall I also merge existing test cases from |
mikemccand
left a comment
There was a problem hiding this comment.
Awesome refactoring/simplification @zacharymorn! Thank you.
+1 -- that would avoid Deprecation warnings compiling our own test cases? But I do not think it is a blocker to first pushing this -- progress not perfection! |
jpountz
left a comment
There was a problem hiding this comment.
+1 on merging test cases too
We should keep the deprecated queries on branch_9x but on main we could remove them in a follow-up?
| } | ||
|
|
||
| // nocommit this seems to be generalizable to norms and knn as well given LUCENE-9334, and thus | ||
| // could be moved to the new FieldExistsQuery? |
There was a problem hiding this comment.
Thanks for the confirmation! I do have one follow-up question though. When I looked into this further, I noticed PointValues was used in the current implementation (pasted below for ease of reference) for determining if DocValuesFieldExistsQuery could be re-written to MatchAllDocsQuery:
@Override
public Query rewrite(IndexReader reader) throws IOException {
boolean allReadersRewritable = true;
for (LeafReaderContext context : reader.leaves()) {
LeafReader leaf = context.reader();
Terms terms = leaf.terms(field);
PointValues pointValues = leaf.getPointValues(field);
if ((terms == null || terms.getDocCount() != leaf.maxDoc())
&& (pointValues == null || pointValues.getDocCount() != leaf.maxDoc())) {
allReadersRewritable = false;
break;
}
}
if (allReadersRewritable) {
return new MatchAllDocsQuery();
}
return super.rewrite(reader);
}
I thought PointValues and DocValues are separate indexed values and hence we would use one of the leaf.getXXXDocValues method here instead? On the other hand, the returned values of those methods such as NumericDocValues and BinaryDocValues don't seems to have a method to retrieve the associated doc count. I felt I might be missing something here but will look further into it.
There was a problem hiding this comment.
Lucene 9 introduced a new index-time requirement that a field always uses the same data structures. Ie. if a document has both points and doc values for a field, then all other documents need to either have neither points and doc values for the field, or both points and doc values. It makes it legal to do this sort of optimization. https://issues.apache.org/jira/browse/LUCENE-9334
There was a problem hiding this comment.
Thanks @jpountz for the clarification! I guess to put my question differently, this current implementation doesn't seems to cover the case where the docs only have DocValues, but not PointValues fields, as asserted in this existing test case:
I would imagine in this scenario DocValuesFieldExistsQuery should be overwritten to MatchAllDocsQuery, since all docs that doc values field and value?
There was a problem hiding this comment.
But anyway, I have given this a try in e323b49, please let me know if this looks correct to you.
There was a problem hiding this comment.
It would be nice if we could detect when all documents have a doc value but unfortunately we do not have an inedx statistic we can use to check.
| default: | ||
| throw new AssertionError(); | ||
| } | ||
| } |
There was a problem hiding this comment.
Maybe add an else condition that throws an exception. This query is about finding documents that have a value for a field. If the field exists but doesn't index any of the data structures that can help figure out whether it actually exists, then it means users haven't indexed data correctly, or they are using this query while they shouldn't?
| } | |
| } else { | |
| throw new IllegalStateException("FieldExistsQuery requires that the field indexes doc values, norms or vectors, but field '" + fieldInfo.name + "' exists and indexes neither of these data structures"); | |
| } |
There was a problem hiding this comment.
Makes sense! I've implemented it in 97344ef.
| return super.count(context); | ||
| } | ||
|
|
||
| return 0; |
There was a problem hiding this comment.
Let's throw an exception here too?
|
Thanks @mikemccand @jpountz for the reviews and suggestions! I've gone ahead and merge the test cases, and will remove the deprecated queries as well in follow-up PR. |
…index norms, knn vectors nor doc values
jpountz
left a comment
There was a problem hiding this comment.
One minor comment, otherwise LGTM.
| } | ||
| } else if (fieldInfo.getDocValuesType() != DocValuesType.NONE | ||
| || leaf.terms(field) != null | ||
| || leaf.getPointValues(field) != null) { // the field indexes doc values or points |
There was a problem hiding this comment.
Oh, I think you added these checks because the old implementation did not verify that the field info had doc values enabled, but this was leniency. I would only check that the field info has doc values here.
There was a problem hiding this comment.
I gave that a try in d4d9a3f, but it failed fa few tests (that I fixed), mostly having to do with BinaryPoint or StringField fields don't pass the condition fieldInfo.getDocValuesType() != DocValuesType.NONE. Could you let me know if it looks correct to you ?
|
Since at search phase, vector's all docs of all fields have been loaded into memory, when |
Thanks @LuXugang for the feedback! Since this issue focuses on deprecating / migrating existing exits queries, I feel this can be a follow-up discussion / issue? What do you think @jpountz @jtibshirani ? |
|
+1 to discuss improvements to this query separately. For this specific one, I have a bias towards not relying on implementation details of codecs, I'm hoping that docs of vectors move to disk in the near future. |
jpountz
left a comment
There was a problem hiding this comment.
Thanks, I think we should fix tests differently, but things make sense to me otherwise.
| final IndexReader reader = iw.getReader(); | ||
| iw.close(); | ||
|
|
||
| expectThrows(IllegalStateException.class, () -> new FieldExistsQuery("dim").rewrite(reader)); |
There was a problem hiding this comment.
can you fix the indexing logic to index a field with doc values instead, and keep checking that the query rewrites to a MatchAllDocsQuery?
There was a problem hiding this comment.
Updated in ffcd329. I also indexed in a StringField , without which the assertion of query rewriting to MatchAllDocsQuery would fail.
| iw.close(); | ||
|
|
||
| assertFalse((new FieldExistsQuery("dim")).rewrite(reader) instanceof MatchAllDocsQuery); | ||
| expectThrows(IllegalStateException.class, () -> new FieldExistsQuery("f").rewrite(reader)); |
There was a problem hiding this comment.
And likewise here, can you index points in addition to doc values for dim and doc values in addition to terms for f?
|
|
||
| assertNormsCountWithShortcut(searcher, "text", randomNumDocs); | ||
| assertNormsCountWithShortcut(searcher, "doesNotExist", 0); | ||
| expectThrows(IllegalStateException.class, () -> searcher.count(new FieldExistsQuery("text_n"))); |
There was a problem hiding this comment.
checking for an exception here makes sense to me however 👍
Thanks @jpountz for the review and suggestions! I have updated the tests accordingly, and also put in a comment for future reference. Please let me know if it looks good to you. |
| Document doc = new Document(); | ||
| doc.add(new BinaryPoint("dim", new byte[4], new byte[4])); | ||
| doc.add(new DoubleDocValuesField("f", 2.0)); | ||
| doc.add(new StringField("f", random().nextBoolean() ? "yes" : "no", Store.NO)); |
There was a problem hiding this comment.
I think we should keep the name and intention of this test by indexing a point field rather than a StringField, in addition to the doc-values field.
No problem, thanks @jpountz for all the reviews and suggestions as well! I'll open a follow-up PR after merging this as you suggested to remove the deprecated classes for @mikemccand, please let me know if you have further feedback for this PR as well. I plan to merge this in the next few days. |
…ery and KnnVectorFieldExistsQuery with FieldExistsQuery (apache#767)
| case SORTED_SET: | ||
| iterator = context.reader().getSortedSetDocValues(field); | ||
| break; | ||
| case NONE: |
There was a problem hiding this comment.
Spotless: this case seems never reachable?
Description / Solution
Deprecate DocValuesFieldExistsQuery, NormsFieldExistsQuery and KnnVectorFieldExistsQuery with FieldExistsQuery
Tests
Passed existing tests (especially
TestNormsFieldExistsQuery,TestKnnVectorFieldExistsQueryandTestDocValuesFieldExistsQueryvia./gradlew check -Pvalidation.git.failOnModified=falsewithnocommitignoredChecklist
Please review the following and check all that apply:
mainbranch../gradlew check.