Skip to content

Ensure cross_fields always uses valid term statistics#90278

Merged
jtibshirani merged 7 commits intoelastic:mainfrom
jtibshirani:cross-fields
Sep 23, 2022
Merged

Ensure cross_fields always uses valid term statistics#90278
jtibshirani merged 7 commits intoelastic:mainfrom
jtibshirani:cross-fields

Conversation

@jtibshirani
Copy link
Copy Markdown
Contributor

In #89016 we adjusted the cross_fields scoring formula to prevent negative
scores. This fix accidentally dropped another important fix that was added in
#41938. Specifically, we need to make sure to take the minimum between the
document frequency (actualDf) and the minimum total term frequency
(minTTF). Otherwise, we can produce invalid term statistics where the total
term frequency is less than the document frequency.

This PR adds two safeguards to prevent this sort of regression in the future:

  • Adds a new comment explaining the adjustment, and marks it "IMPORTANT".
  • Adds a new test BlendedTermQueryTests#testRandomFields that adds a fairly
    large number of documents with random fields and values. I tried running this
    test 100 times, and checked it would have caught this issue.

Fixes #90275

@elasticsearchmachine
Copy link
Copy Markdown
Collaborator

Pinging @elastic/es-search (Team:Search)

@elasticsearchmachine elasticsearchmachine added v8.6.0 Team:Search Meta label for search team labels Sep 22, 2022
@elasticsearchmachine
Copy link
Copy Markdown
Collaborator

Hi @jtibshirani, I've created a changelog YAML for you.

@jtibshirani
Copy link
Copy Markdown
Contributor Author

I marked this as "non-issue" since it is unreleased on 8.5 and 7.17. However it's a regression in 8.4.2. When I backport I'll make sure to mark it "regression" and add a changes entry.

@jtibshirani
Copy link
Copy Markdown
Contributor Author

@elasticmachine run elasticsearch-ci/part-1

Copy link
Copy Markdown
Contributor

@romseygeek romseygeek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@jtibshirani jtibshirani merged commit 1acc278 into elastic:main Sep 23, 2022
@jtibshirani jtibshirani deleted the cross-fields branch September 23, 2022 17:35
jtibshirani added a commit that referenced this pull request Sep 23, 2022
In #89016 we adjusted the `cross_fields` scoring formula to prevent negative
scores. This fix accidentally dropped another important fix that was added in
#41938. Specifically, we need to make sure to take the minimum between the
document frequency (`actualDf`) and the minimum total term frequency
(`minTTF`). Otherwise, we can produce invalid term statistics where the total
term frequency is less than the document frequency.

Fixes #90275
jtibshirani added a commit to jtibshirani/elasticsearch that referenced this pull request Sep 23, 2022
In elastic#89016 we adjusted the `cross_fields` scoring formula to prevent negative
scores. This fix accidentally dropped another important fix that was added in
elastic#41938. Specifically, we need to make sure to take the minimum between the
document frequency (`actualDf`) and the minimum total term frequency
(`minTTF`). Otherwise, we can produce invalid term statistics where the total
term frequency is less than the document frequency.

Fixes elastic#90275
jtibshirani added a commit that referenced this pull request Sep 23, 2022
In #89016 we adjusted the `cross_fields` scoring formula to prevent negative
scores. This fix accidentally dropped another important fix that was added in
#41938. Specifically, we need to make sure to take the minimum between the
document frequency (`actualDf`) and the minimum total term frequency
(`minTTF`). Otherwise, we can produce invalid term statistics where the total
term frequency is less than the document frequency.

Fixes #90275
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

>non-issue :Search Relevance/Ranking Scoring, rescoring, rank evaluation. Team:Search Meta label for search team v7.17.7 v8.4.3 v8.5.0 v8.6.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

“totalTermFreq must be at least docFreq” error after upgrading to 8.4.2

3 participants