Skip to content

LUCENE-10436: Deprecate DocValuesFieldExistsQuery, NormsFieldExistsQuery and KnnVectorFieldExistsQuery with FieldExistsQuery#767

Merged
zacharymorn merged 13 commits intoapache:mainfrom
zacharymorn:LUCENE-10436-FieldExistsQuery
Apr 6, 2022
Merged

LUCENE-10436: Deprecate DocValuesFieldExistsQuery, NormsFieldExistsQuery and KnnVectorFieldExistsQuery with FieldExistsQuery#767
zacharymorn merged 13 commits intoapache:mainfrom
zacharymorn:LUCENE-10436-FieldExistsQuery

Conversation

@zacharymorn
Copy link
Copy Markdown
Contributor

Description / Solution

Deprecate DocValuesFieldExistsQuery, NormsFieldExistsQuery and KnnVectorFieldExistsQuery with FieldExistsQuery

Tests

Passed existing tests (especially TestNormsFieldExistsQuery, TestKnnVectorFieldExistsQuery and TestDocValuesFieldExistsQuery via ./gradlew check -Pvalidation.git.failOnModified=false with nocommit ignored

Checklist

Please review the following and check all that apply:

  • I have reviewed the guidelines for How to Contribute and my code conforms to the standards described there to the best of my ability.
  • I have created a Jira issue and added the issue ID to my pull request title.
  • I have given Lucene maintainers access to contribute to my PR branch. (optional but recommended)
  • I have developed this patch against the main branch.
  • I have run ./gradlew check.
  • I have added tests for my changes.

…ery and KnnVectorFieldExistsQuery with FieldExistsQuery
@zacharymorn zacharymorn requested a review from jpountz March 26, 2022 02:44
@zacharymorn
Copy link
Copy Markdown
Contributor Author

Quick question: shall I also merge existing test cases from TestNormsFieldExistsQuery, TestKnnVectorFieldExistsQuery and TestDocValuesFieldExistsQuery as well ?

Copy link
Copy Markdown
Member

@mikemccand mikemccand left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome refactoring/simplification @zacharymorn! Thank you.

@mikemccand
Copy link
Copy Markdown
Member

Quick question: shall I also merge existing test cases from TestNormsFieldExistsQuery, TestKnnVectorFieldExistsQuery and TestDocValuesFieldExistsQuery as well ?

+1 -- that would avoid Deprecation warnings compiling our own test cases? But I do not think it is a blocker to first pushing this -- progress not perfection!

Copy link
Copy Markdown
Contributor

@jpountz jpountz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 on merging test cases too

We should keep the deprecated queries on branch_9x but on main we could remove them in a follow-up?

}

// nocommit this seems to be generalizable to norms and knn as well given LUCENE-9334, and thus
// could be moved to the new FieldExistsQuery?
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the confirmation! I do have one follow-up question though. When I looked into this further, I noticed PointValues was used in the current implementation (pasted below for ease of reference) for determining if DocValuesFieldExistsQuery could be re-written to MatchAllDocsQuery:

@Override
  public Query rewrite(IndexReader reader) throws IOException {
    boolean allReadersRewritable = true;
    for (LeafReaderContext context : reader.leaves()) {
      LeafReader leaf = context.reader();
      Terms terms = leaf.terms(field);
      PointValues pointValues = leaf.getPointValues(field);
      if ((terms == null || terms.getDocCount() != leaf.maxDoc())
          && (pointValues == null || pointValues.getDocCount() != leaf.maxDoc())) {
        allReadersRewritable = false;
        break;
      }
    }
    if (allReadersRewritable) {
      return new MatchAllDocsQuery();
    }
    return super.rewrite(reader);
  }

I thought PointValues and DocValues are separate indexed values and hence we would use one of the leaf.getXXXDocValues method here instead? On the other hand, the returned values of those methods such as NumericDocValues and BinaryDocValues don't seems to have a method to retrieve the associated doc count. I felt I might be missing something here but will look further into it.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lucene 9 introduced a new index-time requirement that a field always uses the same data structures. Ie. if a document has both points and doc values for a field, then all other documents need to either have neither points and doc values for the field, or both points and doc values. It makes it legal to do this sort of optimization. https://issues.apache.org/jira/browse/LUCENE-9334

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @jpountz for the clarification! I guess to put my question differently, this current implementation doesn't seems to cover the case where the docs only have DocValues, but not PointValues fields, as asserted in this existing test case:

public void testNoRewriteWithDocValues() throws IOException {
Directory dir = newDirectory();
RandomIndexWriter iw = new RandomIndexWriter(random(), dir);
final int numDocs = atLeast(100);
for (int i = 0; i < numDocs; ++i) {
Document doc = new Document();
doc.add(new NumericDocValuesField("dv1", 1));
doc.add(new SortedNumericDocValuesField("dv2", 1));
doc.add(new SortedNumericDocValuesField("dv2", 2));
iw.addDocument(doc);
}
iw.commit();
final IndexReader reader = iw.getReader();
iw.close();
assertFalse(
(new DocValuesFieldExistsQuery("dv1")).rewrite(reader) instanceof MatchAllDocsQuery);
assertFalse(
(new DocValuesFieldExistsQuery("dv2")).rewrite(reader) instanceof MatchAllDocsQuery);
assertFalse(
(new DocValuesFieldExistsQuery("dv3")).rewrite(reader) instanceof MatchAllDocsQuery);
reader.close();
dir.close();
}

I would imagine in this scenario DocValuesFieldExistsQuery should be overwritten to MatchAllDocsQuery, since all docs that doc values field and value?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But anyway, I have given this a try in e323b49, please let me know if this looks correct to you.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice if we could detect when all documents have a doc value but unfortunately we do not have an inedx statistic we can use to check.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok I see.

default:
throw new AssertionError();
}
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add an else condition that throws an exception. This query is about finding documents that have a value for a field. If the field exists but doesn't index any of the data structures that can help figure out whether it actually exists, then it means users haven't indexed data correctly, or they are using this query while they shouldn't?

Suggested change
}
} else {
throw new IllegalStateException("FieldExistsQuery requires that the field indexes doc values, norms or vectors, but field '" + fieldInfo.name + "' exists and indexes neither of these data structures");
}

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense! I've implemented it in 97344ef.

return super.count(context);
}

return 0;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's throw an exception here too?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@zacharymorn
Copy link
Copy Markdown
Contributor Author

Thanks @mikemccand @jpountz for the reviews and suggestions! I've gone ahead and merge the test cases, and will remove the deprecated queries as well in follow-up PR.

Copy link
Copy Markdown
Contributor

@jpountz jpountz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One minor comment, otherwise LGTM.

}
} else if (fieldInfo.getDocValuesType() != DocValuesType.NONE
|| leaf.terms(field) != null
|| leaf.getPointValues(field) != null) { // the field indexes doc values or points
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I think you added these checks because the old implementation did not verify that the field info had doc values enabled, but this was leniency. I would only check that the field info has doc values here.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I gave that a try in d4d9a3f, but it failed fa few tests (that I fixed), mostly having to do with BinaryPoint or StringField fields don't pass the condition fieldInfo.getDocValuesType() != DocValuesType.NONE. Could you let me know if it looks correct to you ?

@LuXugang
Copy link
Copy Markdown
Member

LuXugang commented Apr 2, 2022

Since at search phase, vector's all docs of all fields have been loaded into memory, when FieldExistsQuery as a lead iterator, should we always try to supply a Scorer by vector if vector fields were indexed. so should we implement Weight#scorerSupplier ?

@zacharymorn zacharymorn requested a review from jpountz April 4, 2022 01:26
@zacharymorn
Copy link
Copy Markdown
Contributor Author

Since at search phase, vector's all docs of all fields have been loaded into memory, when FieldExistsQuery as a lead iterator, should we always try to supply a Scorer by vector if vector fields were indexed. so should we implement Weight#scorerSupplier ?

Thanks @LuXugang for the feedback! Since this issue focuses on deprecating / migrating existing exits queries, I feel this can be a follow-up discussion / issue? What do you think @jpountz @jtibshirani ?

@jpountz
Copy link
Copy Markdown
Contributor

jpountz commented Apr 4, 2022

+1 to discuss improvements to this query separately. For this specific one, I have a bias towards not relying on implementation details of codecs, I'm hoping that docs of vectors move to disk in the near future.

Copy link
Copy Markdown
Contributor

@jpountz jpountz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I think we should fix tests differently, but things make sense to me otherwise.

final IndexReader reader = iw.getReader();
iw.close();

expectThrows(IllegalStateException.class, () -> new FieldExistsQuery("dim").rewrite(reader));
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you fix the indexing logic to index a field with doc values instead, and keep checking that the query rewrites to a MatchAllDocsQuery?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated in ffcd329. I also indexed in a StringField , without which the assertion of query rewriting to MatchAllDocsQuery would fail.

iw.close();

assertFalse((new FieldExistsQuery("dim")).rewrite(reader) instanceof MatchAllDocsQuery);
expectThrows(IllegalStateException.class, () -> new FieldExistsQuery("f").rewrite(reader));
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And likewise here, can you index points in addition to doc values for dim and doc values in addition to terms for f?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated.


assertNormsCountWithShortcut(searcher, "text", randomNumDocs);
assertNormsCountWithShortcut(searcher, "doesNotExist", 0);
expectThrows(IllegalStateException.class, () -> searcher.count(new FieldExistsQuery("text_n")));
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

checking for an exception here makes sense to me however 👍

@zacharymorn
Copy link
Copy Markdown
Contributor Author

Thanks, I think we should fix tests differently, but things make sense to me otherwise.

Thanks @jpountz for the review and suggestions! I have updated the tests accordingly, and also put in a comment for future reference. Please let me know if it looks good to you.

@zacharymorn zacharymorn requested a review from jpountz April 5, 2022 05:04
Document doc = new Document();
doc.add(new BinaryPoint("dim", new byte[4], new byte[4]));
doc.add(new DoubleDocValuesField("f", 2.0));
doc.add(new StringField("f", random().nextBoolean() ? "yes" : "no", Store.NO));
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should keep the name and intention of this test by indexing a point field rather than a StringField, in addition to the doc-values field.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated in 61fedf3 .

@zacharymorn zacharymorn requested a review from jpountz April 5, 2022 07:40
Copy link
Copy Markdown
Contributor

@jpountz jpountz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!

@zacharymorn
Copy link
Copy Markdown
Contributor Author

zacharymorn commented Apr 5, 2022

Thank you!

No problem, thanks @jpountz for all the reviews and suggestions as well! I'll open a follow-up PR after merging this as you suggested to remove the deprecated classes for main.

@mikemccand, please let me know if you have further feedback for this PR as well. I plan to merge this in the next few days.

@zacharymorn zacharymorn merged commit 91e2940 into apache:main Apr 6, 2022
zacharymorn added a commit to zacharymorn/lucene that referenced this pull request Apr 6, 2022
…ery and KnnVectorFieldExistsQuery with FieldExistsQuery (apache#767)
zacharymorn added a commit that referenced this pull request Apr 7, 2022
…ery and KnnVectorFieldExistsQuery with FieldExistsQuery (#767) (#791)
case SORTED_SET:
iterator = context.reader().getSortedSetDocValues(field);
break;
case NONE:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spotless: this case seems never reachable?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants