LUCENE-9999: CombinedFieldQuery can fail with an exception when document is missing fields#185
Conversation
…ent is missing some fields.
jpountz
left a comment
There was a problem hiding this comment.
LGTM besides a small comment I left.
| if (norms != null) { | ||
| boolean found = norms.advanceExact(doc); | ||
| assert found; | ||
| if (norms != null && norms.advanceExact(doc)) { |
There was a problem hiding this comment.
I believe that the old logic was correct?
The only case I can think of when it might fail is when some fields are indexed with norms and other fields without norms, but I'm not sure we should even try to support this case with CombinedFieldQuery.
There was a problem hiding this comment.
We allow a StringField and a TextField to be used at the moment. We could disallow that upfront (all fields or none have norms enabled) because the old logic would fail with weird exceptions without this change.
There was a problem hiding this comment.
+1 from me to disallow it upfront, the combined score won't be meaningful if there is a mix of norms being disabled/ enabled.
jtibshirani
left a comment
There was a problem hiding this comment.
Thank you @jimczi for fixing this!
| if (norms != null) { | ||
| boolean found = norms.advanceExact(doc); | ||
| assert found; | ||
| if (norms != null && norms.advanceExact(doc)) { |
There was a problem hiding this comment.
+1 from me to disallow it upfront, the combined score won't be meaningful if there is a mix of norms being disabled/ enabled.
| } | ||
|
|
||
| public void testSameScore() throws IOException { | ||
| public void testSparseFieldSameScore() throws IOException { |
There was a problem hiding this comment.
It's a little unclear to me what we aim to test with the 'same score' checks. The testCopyField test seems to best capture the core behavior -- we could have a version of testCopyField where one field is sparse?
|
@jimczi asked if I could finish the PR -- I pushed some changes:
Let me know if it still looks okay or if you have other comments. |
jimczi
left a comment
There was a problem hiding this comment.
Thanks @jtibshirani!
| throw new IllegalArgumentException( | ||
| getClass().getSimpleName() | ||
| + " requires norms to be consistent across fields: some fields cannot" | ||
| + " have norms enabled, while others have norms disabled"); |
There was a problem hiding this comment.
It's kind of an edge case but should we also throw the exception if scoring is not needed ?
There was a problem hiding this comment.
Properly throwing an error would require a more invasive change, since we rewrite the query very early if scores aren't needed:
MultiNormsLeafSimScorer just to be clearer/ more solid?
This also made me notice -- I'm using the fact that LeafReader#getNormsValues is null to detect that norms were disabled. This seems accurate but I couldn't confirm it's really part of the method contract. I wonder if checking field infos is better.
There was a problem hiding this comment.
Doing more research, I don't think it's possible to detect if norms are disabled upfront: LeafReader#getNormsValues could be null if a field uses norms but happens to be missing from a segment. Using field infos would have the same problem.
I opted to just document this requirement (as we do for the fact norms must be additive and encoded in a certain way). If the calling application has its own notion of schema, it can enforce the norms requirements upfront.
This commit moves a fix for a bug in MultiNormsLeafSimScorer in Lucene that fixes an error in the new CombinedFieldQuery for missing values. Its based on a PR for LUCENE-9999 (apache/lucene#185). This is a temporary copy of the affected query and its updated dependencies that should be removed again once we are able to use the original fix from Lucene. Relastes to apache/lucene#185
This commit moves a fix for a bug in MultiNormsLeafSimScorer in Lucene that fixes an error in the new CombinedFieldQuery for missing values. Its based on a PR for LUCENE-9999 (apache/lucene#185). This is a temporary copy of the affected query and its updated dependencies that should be removed again once we are able to use the original fix from Lucene. Relastes to apache/lucene#185
This commit moves a fix for a bug in MultiNormsLeafSimScorer in Lucene that fixes an error in the new CombinedFieldQuery for missing values. Its based on a PR for LUCENE-9999 (apache/lucene#185). This is a temporary copy of the affected query and its updated dependencies that should be removed again once we are able to use the original fix from Lucene. Relastes to apache/lucene#185
This commit moves a fix for a bug in MultiNormsLeafSimScorer in Lucene that fixes an error in the new CombinedFieldQuery for missing values. Its based on a PR for LUCENE-9999 (apache/lucene#185). This is a temporary copy of the affected query and its updated dependencies that should be removed again once we are able to use the original fix from Lucene. Relates to apache/lucene#185
This commit moves a fix for a bug in MultiNormsLeafSimScorer in Lucene that fixes an error in the new CombinedFieldQuery for missing values. Its based on a PR for LUCENE-9999 (apache/lucene#185). This is a temporary copy of the affected query and its updated dependencies that should be removed again once we are able to use the original fix from Lucene. Relastes to apache/lucene#185 Co-authored-by: Christoph Büscher <cbuescher@posteo.de>
|
@jpountz would you be able to take another look? |
jpountz
left a comment
There was a problem hiding this comment.
Maybe something we could check could be verifying that all fields that have a non-null FieldInfo have the same value for hasNorms on a per-segment basis. It wouldn't cover 100% of misuse, but probably most of it.
|
I tried out your idea. I opted to do it on a global basis in I'm going to merge, but feel free to give post merge feedback if anything seems off. |
…ent is missing fields (apache#185) This change fixes a bug in `MultiNormsLeafSimScorer` that assumes that each field should have a norm for every term/document. As part of the fix, it adds validation that the fields have consistent norms settings.
This change fixes a bug in
MultiNormsLeafSimScorerthat assumes that eachfield should have a norm for every term/document.
As part of the fix, it adds validation that the fields have consistent norms
settings.