LUCENE-10436: Deprecate DocValuesFieldExistsQuery, NormsFieldExistsQuery and KnnVectorFieldExistsQuery with FieldExistsQuery by zacharymorn · Pull Request #767 · apache/lucene

zacharymorn · 2022-03-26T02:43:55Z

Description / Solution

Deprecate DocValuesFieldExistsQuery, NormsFieldExistsQuery and KnnVectorFieldExistsQuery with FieldExistsQuery

Tests

Passed existing tests (especially TestNormsFieldExistsQuery, TestKnnVectorFieldExistsQuery and TestDocValuesFieldExistsQuery via ./gradlew check -Pvalidation.git.failOnModified=false with nocommit ignored

Checklist

Please review the following and check all that apply:

I have reviewed the guidelines for How to Contribute and my code conforms to the standards described there to the best of my ability.
I have created a Jira issue and added the issue ID to my pull request title.
I have given Lucene maintainers access to contribute to my PR branch. (optional but recommended)
I have developed this patch against the main branch.
I have run ./gradlew check.
I have added tests for my changes.

…ery and KnnVectorFieldExistsQuery with FieldExistsQuery

zacharymorn · 2022-03-26T02:44:44Z

Quick question: shall I also merge existing test cases from TestNormsFieldExistsQuery, TestKnnVectorFieldExistsQuery and TestDocValuesFieldExistsQuery as well ?

mikemccand

Awesome refactoring/simplification @zacharymorn! Thank you.

mikemccand · 2022-03-29T17:04:35Z

Quick question: shall I also merge existing test cases from TestNormsFieldExistsQuery, TestKnnVectorFieldExistsQuery and TestDocValuesFieldExistsQuery as well ?

+1 -- that would avoid Deprecation warnings compiling our own test cases? But I do not think it is a blocker to first pushing this -- progress not perfection!

jpountz

+1 on merging test cases too

We should keep the deprecated queries on branch_9x but on main we could remove them in a follow-up?

jpountz · 2022-03-29T17:40:24Z

lucene/core/src/java/org/apache/lucene/search/DocValuesFieldExistsQuery.java

-  }
-
+  // nocommit this seems to be generalizable to norms and knn as well given LUCENE-9334, and thus
+  // could be moved to the new FieldExistsQuery?


Thanks for the confirmation! I do have one follow-up question though. When I looked into this further, I noticed PointValues was used in the current implementation (pasted below for ease of reference) for determining if DocValuesFieldExistsQuery could be re-written to MatchAllDocsQuery:

@Override public Query rewrite(IndexReader reader) throws IOException { boolean allReadersRewritable = true; for (LeafReaderContext context : reader.leaves()) { LeafReader leaf = context.reader(); Terms terms = leaf.terms(field); PointValues pointValues = leaf.getPointValues(field); if ((terms == null || terms.getDocCount() != leaf.maxDoc()) && (pointValues == null || pointValues.getDocCount() != leaf.maxDoc())) { allReadersRewritable = false; break; } } if (allReadersRewritable) { return new MatchAllDocsQuery(); } return super.rewrite(reader); }

I thought PointValues and DocValues are separate indexed values and hence we would use one of the leaf.getXXXDocValues method here instead? On the other hand, the returned values of those methods such as NumericDocValues and BinaryDocValues don't seems to have a method to retrieve the associated doc count. I felt I might be missing something here but will look further into it.

Lucene 9 introduced a new index-time requirement that a field always uses the same data structures. Ie. if a document has both points and doc values for a field, then all other documents need to either have neither points and doc values for the field, or both points and doc values. It makes it legal to do this sort of optimization. https://issues.apache.org/jira/browse/LUCENE-9334

Thanks @jpountz for the clarification! I guess to put my question differently, this current implementation doesn't seems to cover the case where the docs only have DocValues, but not PointValues fields, as asserted in this existing test case:

lucene/lucene/core/src/test/org/apache/lucene/search/TestDocValuesFieldExistsQuery.java

Lines 103 to 126 in 69b040f

public void testNoRewriteWithDocValues() throws IOException {

Directory dir = newDirectory();

RandomIndexWriter iw = new RandomIndexWriter(random(), dir);

final int numDocs = atLeast(100);

for (int i = 0; i < numDocs; ++i) {

Document doc = new Document();

doc.add(new NumericDocValuesField("dv1", 1));

doc.add(new SortedNumericDocValuesField("dv2", 1));

doc.add(new SortedNumericDocValuesField("dv2", 2));

iw.addDocument(doc);

}

iw.commit();

final IndexReader reader = iw.getReader();

iw.close();

assertFalse(

(new DocValuesFieldExistsQuery("dv1")).rewrite(reader) instanceof MatchAllDocsQuery);

assertFalse(

(new DocValuesFieldExistsQuery("dv2")).rewrite(reader) instanceof MatchAllDocsQuery);

assertFalse(

(new DocValuesFieldExistsQuery("dv3")).rewrite(reader) instanceof MatchAllDocsQuery);

reader.close();

dir.close();

}

I would imagine in this scenario DocValuesFieldExistsQuery should be overwritten to MatchAllDocsQuery, since all docs that doc values field and value?

But anyway, I have given this a try in e323b49, please let me know if this looks correct to you.

It would be nice if we could detect when all documents have a doc value but unfortunately we do not have an inedx statistic we can use to check.

jpountz · 2022-03-30T06:24:42Z

lucene/core/src/java/org/apache/lucene/search/FieldExistsQuery.java

+            default:
+              throw new AssertionError();
+          }
+        }


Maybe add an else condition that throws an exception. This query is about finding documents that have a value for a field. If the field exists but doesn't index any of the data structures that can help figure out whether it actually exists, then it means users haven't indexed data correctly, or they are using this query while they shouldn't?

Suggested change

}

} else {

throw new IllegalStateException("FieldExistsQuery requires that the field indexes doc values, norms or vectors, but field '" + fieldInfo.name + "' exists and indexes neither of these data structures");

}

Makes sense! I've implemented it in 97344ef.

jpountz · 2022-03-30T06:30:25Z

lucene/core/src/java/org/apache/lucene/search/FieldExistsQuery.java

+          return super.count(context);
+        }
+
+        return 0;


Let's throw an exception here too?

…charymorn/lucene into LUCENE-10436-FieldExistsQuery

zacharymorn · 2022-03-31T04:11:28Z

Thanks @mikemccand @jpountz for the reviews and suggestions! I've gone ahead and merge the test cases, and will remove the deprecated queries as well in follow-up PR.

…index norms, knn vectors nor doc values

jpountz

One minor comment, otherwise LGTM.

jpountz · 2022-04-01T07:24:34Z

lucene/core/src/java/org/apache/lucene/search/FieldExistsQuery.java

+        }
+      } else if (fieldInfo.getDocValuesType() != DocValuesType.NONE
+          || leaf.terms(field) != null
+          || leaf.getPointValues(field) != null) { // the field indexes doc values or points


Oh, I think you added these checks because the old implementation did not verify that the field info had doc values enabled, but this was leniency. I would only check that the field info has doc values here.

I gave that a try in d4d9a3f, but it failed fa few tests (that I fixed), mostly having to do with BinaryPoint or StringField fields don't pass the condition fieldInfo.getDocValuesType() != DocValuesType.NONE. Could you let me know if it looks correct to you ?

LuXugang · 2022-04-02T03:46:57Z

Since at search phase, vector's all docs of all fields have been loaded into memory, when FieldExistsQuery as a lead iterator, should we always try to supply a Scorer by vector if vector fields were indexed. so should we implement Weight#scorerSupplier ?

zacharymorn · 2022-04-04T01:45:50Z

Since at search phase, vector's all docs of all fields have been loaded into memory, when FieldExistsQuery as a lead iterator, should we always try to supply a Scorer by vector if vector fields were indexed. so should we implement Weight#scorerSupplier ?

Thanks @LuXugang for the feedback! Since this issue focuses on deprecating / migrating existing exits queries, I feel this can be a follow-up discussion / issue? What do you think @jpountz @jtibshirani ?

jpountz · 2022-04-04T16:10:46Z

+1 to discuss improvements to this query separately. For this specific one, I have a bias towards not relying on implementation details of codecs, I'm hoping that docs of vectors move to disk in the near future.

jpountz

Thanks, I think we should fix tests differently, but things make sense to me otherwise.

jpountz · 2022-04-04T16:15:28Z

lucene/core/src/test/org/apache/lucene/search/TestFieldExistsQuery.java

+    final IndexReader reader = iw.getReader();
+    iw.close();
+
+    expectThrows(IllegalStateException.class, () -> new FieldExistsQuery("dim").rewrite(reader));


can you fix the indexing logic to index a field with doc values instead, and keep checking that the query rewrites to a MatchAllDocsQuery?

Updated in ffcd329. I also indexed in a StringField , without which the assertion of query rewriting to MatchAllDocsQuery would fail.

jpountz · 2022-04-04T16:17:32Z

lucene/core/src/test/org/apache/lucene/search/TestFieldExistsQuery.java

+    iw.close();
+
+    assertFalse((new FieldExistsQuery("dim")).rewrite(reader) instanceof MatchAllDocsQuery);
+    expectThrows(IllegalStateException.class, () -> new FieldExistsQuery("f").rewrite(reader));


And likewise here, can you index points in addition to doc values for dim and doc values in addition to terms for f?

jpountz · 2022-04-04T16:19:09Z

lucene/core/src/test/org/apache/lucene/search/TestFieldExistsQuery.java

+
+    assertNormsCountWithShortcut(searcher, "text", randomNumDocs);
+    assertNormsCountWithShortcut(searcher, "doesNotExist", 0);
+    expectThrows(IllegalStateException.class, () -> searcher.count(new FieldExistsQuery("text_n")));


checking for an exception here makes sense to me however 👍

zacharymorn · 2022-04-05T05:04:45Z

Thanks, I think we should fix tests differently, but things make sense to me otherwise.

Thanks @jpountz for the review and suggestions! I have updated the tests accordingly, and also put in a comment for future reference. Please let me know if it looks good to you.

jpountz · 2022-04-05T06:47:42Z

lucene/core/src/test/org/apache/lucene/search/TestFieldExistsQuery.java

      Document doc = new Document();
-      doc.add(new BinaryPoint("dim", new byte[4], new byte[4]));
+      doc.add(new DoubleDocValuesField("f", 2.0));
+      doc.add(new StringField("f", random().nextBoolean() ? "yes" : "no", Store.NO));


I think we should keep the name and intention of this test by indexing a point field rather than a StringField, in addition to the doc-values field.

Updated in 61fedf3 .

jpountz

Thank you!

zacharymorn · 2022-04-05T07:45:28Z

Thank you!

No problem, thanks @jpountz for all the reviews and suggestions as well! I'll open a follow-up PR after merging this as you suggested to remove the deprecated classes for main.

@mikemccand, please let me know if you have further feedback for this PR as well. I plan to merge this in the next few days.

…ery and KnnVectorFieldExistsQuery with FieldExistsQuery (apache#767)

…ery and KnnVectorFieldExistsQuery with FieldExistsQuery (#767) (#791)

LuXugang · 2022-04-08T03:50:12Z

lucene/core/src/java/org/apache/lucene/search/FieldExistsQuery.java

+            case SORTED_SET:
+              iterator = context.reader().getSortedSetDocValues(field);
+              break;
+            case NONE:


Spotless: this case seems never reachable?

LUCENE-10436: Deprecate DocValuesFieldExistsQuery, NormsFieldExistsQu…

ca90e6d

…ery and KnnVectorFieldExistsQuery with FieldExistsQuery

zacharymorn requested a review from jpountz March 26, 2022 02:44

Merge branch 'main' into LUCENE-10436-FieldExistsQuery

325d297

mikemccand approved these changes Mar 29, 2022

View reviewed changes

jpountz reviewed Mar 30, 2022

View reviewed changes

zacharymorn added 3 commits March 30, 2022 15:48

Rename test cases to prep for test cases merge

8dfd157

Merge test cases

b5fc6a9

Merge branch 'LUCENE-10436-FieldExistsQuery' of https://github.com/za…

f44e3eb

…charymorn/lucene into LUCENE-10436-FieldExistsQuery

zacharymorn added 2 commits March 30, 2022 21:33

throw exception when FieldExistsQuery is used when the field doesnot …

97344ef

…index norms, knn vectors nor doc values

Update change entry

1ce3c9c

zacharymorn requested review from jpountz and mikemccand March 31, 2022 04:36

Move rewrite impl into FieldExistsQuery as well

e323b49

jpountz reviewed Apr 1, 2022

View reviewed changes

Address feedback

d4d9a3f

zacharymorn requested a review from jpountz April 4, 2022 01:26

jpountz approved these changes Apr 4, 2022

View reviewed changes

zacharymorn added 2 commits April 4, 2022 21:40

Address feedback to update test

ffcd329

Add comment

3f9df9b

zacharymorn requested a review from jpountz April 5, 2022 05:04

jpountz reviewed Apr 5, 2022

View reviewed changes

Address feedback for point value field

61fedf3

zacharymorn requested a review from jpountz April 5, 2022 07:40

jpountz approved these changes Apr 5, 2022

View reviewed changes

Merge branch 'main' into LUCENE-10436-FieldExistsQuery

ee6a68f

zacharymorn merged commit 91e2940 into apache:main Apr 6, 2022

zacharymorn added a commit to zacharymorn/lucene that referenced this pull request Apr 6, 2022

LUCENE-10436: Deprecate DocValuesFieldExistsQuery, NormsFieldExistsQu…

180b386

…ery and KnnVectorFieldExistsQuery with FieldExistsQuery (apache#767)

zacharymorn mentioned this pull request Apr 6, 2022

LUCENE-10436: (Backporting) Deprecate DocValuesFieldExistsQuery, NormsFieldExistsQuery and KnnVectorFieldExistsQuery with FieldExistsQuery #791

Merged

zacharymorn added a commit that referenced this pull request Apr 7, 2022

LUCENE-10436: Deprecate DocValuesFieldExistsQuery, NormsFieldExistsQu…

a42326b

…ery and KnnVectorFieldExistsQuery with FieldExistsQuery (#767) (#791)

LuXugang reviewed Jul 15, 2022

View reviewed changes

	public void testNoRewriteWithDocValues() throws IOException {
	Directory dir = newDirectory();
	RandomIndexWriter iw = new RandomIndexWriter(random(), dir);
	final int numDocs = atLeast(100);
	for (int i = 0; i < numDocs; ++i) {
	Document doc = new Document();
	doc.add(new NumericDocValuesField("dv1", 1));
	doc.add(new SortedNumericDocValuesField("dv2", 1));
	doc.add(new SortedNumericDocValuesField("dv2", 2));
	iw.addDocument(doc);
	}
	iw.commit();
	final IndexReader reader = iw.getReader();
	iw.close();

	assertFalse(
	(new DocValuesFieldExistsQuery("dv1")).rewrite(reader) instanceof MatchAllDocsQuery);
	assertFalse(
	(new DocValuesFieldExistsQuery("dv2")).rewrite(reader) instanceof MatchAllDocsQuery);
	assertFalse(
	(new DocValuesFieldExistsQuery("dv3")).rewrite(reader) instanceof MatchAllDocsQuery);
	reader.close();
	dir.close();
	}

-        }
+        } else {
+          throw new IllegalStateException("FieldExistsQuery requires that the field indexes doc values, norms or vectors, but field '" + fieldInfo.name + "' exists and indexes neither of these data structures");
+        }

Conversation

zacharymorn commented Mar 26, 2022

Description / Solution

Tests

Checklist

Uh oh!

zacharymorn commented Mar 26, 2022

Uh oh!

mikemccand left a comment

Choose a reason for hiding this comment

Uh oh!

mikemccand commented Mar 29, 2022

Uh oh!

jpountz left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zacharymorn commented Mar 31, 2022

Uh oh!

jpountz left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

LuXugang commented Apr 2, 2022

Uh oh!

zacharymorn commented Apr 4, 2022

Uh oh!

jpountz commented Apr 4, 2022

Uh oh!

jpountz left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zacharymorn commented Apr 5, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jpountz left a comment

Choose a reason for hiding this comment

Uh oh!

zacharymorn commented Apr 5, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zacharymorn commented Apr 5, 2022 •

edited

Loading