Fix IAE on cross_fields query introduced in 7.0.1#41938
Fix IAE on cross_fields query introduced in 7.0.1#41938jimczi merged 1 commit intoelastic:masterfrom
Conversation
If the max doc in the index is greater than the minimum total term frequency among the requested fields we need to adjust max doc to be equal to the min ttf. This was removed by mistake when fixing elastic#41125. Closes elastic#41934
|
Pinging @elastic/es-search |
|
@elasticmachine run elasticsearch-ci/1 |
|
@jimczi, thanks for fixing that. This pull request is labeled with v7.2.0. Does it mean this fix will come with v7.2.0 only? |
markharwood
left a comment
There was a problem hiding this comment.
I can verify that this fixes the TTF>=DF assertion failure but I'm not certain whether lowering DF or raising TTF would be the right fix scoring-wise.
The main motivation for BlendedTermQuery was to fix the problem of multi-field-search favouring the wrong context for a term (default Lucene prefers lastName:mark over firstName:mark). To fix that we had to pick the most common field for a term and use that as the choice of DF.
So we want to take max(DF) of the fields containing a term to make it more boring.
Perhaps the fix should be to raise TTF rather than lower DF given our general motivation is to raise DF.
I confess I don't know how TTF is generally used in scoring - I know higher TFs are preferred because it indicates a single doc is repeatedly using a term but not sure what value is attached to TTF.
My assumption is fixing the assertion break by raising TTF is preferable to lowering DF given we want to make terms seem more boring, not rarer and therefore more interesting.
I'd like to merge this fix first since this is a bug introduced in a patch version and we can discuss better ways to handle this in a follow up ? We also want to switch to the new BM25FQuery in Lucene which doesn't alter the ttf or df artificially. |
That's the plan, yes. It's too late for 7.1.0 but the fix should be in the next release after that. |
|
@elasticmachine run elasticsearch-ci/1 |
|
@jimczi Why is this not being put into the very next patch release? The very point of a patch release is to fix bugs. Including this severe bug that has made search completely unusable for us. This bug was introduced in a patch release and has no good reason not to be fixed in a patch release. What was supposed to be a painless upgrade to 7.0.1 has literally broken search for us and we are supposed to wait for 2 more minor releases? Forget 7.2.0 and even 7.1.0, this needs to be a priority to make it into 7.0.2! |
If the max doc in the index is greater than the minimum total term frequency among the requested fields we need to adjust max doc to be equal to the min ttf. This was removed by mistake when fixing elastic#41125. Closes elastic#41934
In #89016 we adjusted the `cross_fields` scoring formula to prevent negative scores. This fix accidentally dropped another important fix that was added in #41938. Specifically, we need to make sure to take the minimum between the document frequency (`actualDf`) and the minimum total term frequency (`minTTF`). Otherwise, we can produce invalid term statistics where the total term frequency is less than the document frequency. Fixes #90275
In #89016 we adjusted the `cross_fields` scoring formula to prevent negative scores. This fix accidentally dropped another important fix that was added in #41938. Specifically, we need to make sure to take the minimum between the document frequency (`actualDf`) and the minimum total term frequency (`minTTF`). Otherwise, we can produce invalid term statistics where the total term frequency is less than the document frequency. Fixes #90275
In elastic#89016 we adjusted the `cross_fields` scoring formula to prevent negative scores. This fix accidentally dropped another important fix that was added in elastic#41938. Specifically, we need to make sure to take the minimum between the document frequency (`actualDf`) and the minimum total term frequency (`minTTF`). Otherwise, we can produce invalid term statistics where the total term frequency is less than the document frequency. Fixes elastic#90275
In #89016 we adjusted the `cross_fields` scoring formula to prevent negative scores. This fix accidentally dropped another important fix that was added in #41938. Specifically, we need to make sure to take the minimum between the document frequency (`actualDf`) and the minimum total term frequency (`minTTF`). Otherwise, we can produce invalid term statistics where the total term frequency is less than the document frequency. Fixes #90275
If the max doc in the index is greater than the minimum total term frequency
among the requested fields we need to adjust max doc to be equal to the min ttf.
This was removed by mistake when fixing #41125.
Closes #41934