Fix Laplace scorer to multiply by alpha (and not add)#27125
Fix Laplace scorer to multiply by alpha (and not add)#27125jimczi merged 2 commits intoelastic:masterfrom
Conversation
|
Since this is a community submitted pull request, a Jenkins build has not been kicked off automatically. Can an Elastic organization member please verify the contents of this patch and then kick off a build manually? |
|
@s1monw Do you have an opinion on those changes? |
|
Multiply by alpha seems the right thing to do but I don't think that |
The idea of this smoothing comes from the MLE (Maximum Likelihood Estimation). We compute the probability of a term
I know that, and that's why I've also fixed |
I don't necessarily know much about this anymore. I can totally look into it but that might be as good as your opinions.
I tend to agree with this. The frequency for the bi-/tri-grams is calculated using a proportional value ie. docfreq or TTF if available. We use a proportional value for the vocabulary size, that seems correct to me. |
|
@elasticmachine ok to test |
|
Thanks @s1monw. Do you also have an opinion about my other question -- should Regarding whether to smooth by Dividing by That that Thoughts? |
For the bigram/trigram case we use the The rest tests are failing due to the new formula: Can you adapt these tests to the new formula ? |
|
I am pretty sure that we are roughly on the same page, and since I'm not sure if you had typos in your response, I would like to re-iterate what you wrote:
The formula that we use is
Agreed. Except that it's divided by the frequency of the (n-1 gram), which is either With that, I think we are on the same page, and I'll go fix LaplaceScorer to also override
Hmm, it didn't fail when I ran |
We are on the same page, ok with the naming.
Same here, naming issue. Thanks for using the right one ;).
To run all the tests you should use
Ok. I am a bit concerned because it's the default scoring that we're changing but I am good if we consider this as a breaking change (same for the bigram/trigram change actually). The |
5f05f64 to
404ffe8
Compare
|
@jimczi I've fixed the code as we discussed. I also hope that I've addressed all test failures. |
|
Thanks @jimczi! |
|
Looks good guys thanks for fixing this @shaie |
* master: Enhances exists queries to reduce need for `_field_names` (elastic#26930) Added new terms_set query Set request body to required to reflect the code base (elastic#27188) Update Docker docs for 6.0.0-rc2 (elastic#27166) Add version 6.0.0 Docs: restore now fails if it encounters incompatible settings (elastic#26933) Convert index blocks to cluster block exceptions (elastic#27050) [DOCS] Link remote info API in Cross Cluster Search docs page Fix Laplace scorer to multiply by alpha (and not add) (elastic#27125) [DOCS] Clarify migrate guide and search request validation Raise IllegalArgumentException if query validation failed (elastic#26811) prevent duplicate fields when mixing parent and root nested includes (elastic#27072) TopHitsAggregator must propagate calls to `setScorer`. (elastic#27138)
* master: Remove checkpoint tracker bit sets setting Fix stable BWC branch detection logic Fix logic detecting unreleased versions Enhances exists queries to reduce need for `_field_names` (elastic#26930) Added new terms_set query Set request body to required to reflect the code base (elastic#27188) Update Docker docs for 6.0.0-rc2 (elastic#27166) Add version 6.0.0 Docs: restore now fails if it encounters incompatible settings (elastic#26933) Convert index blocks to cluster block exceptions (elastic#27050) [DOCS] Link remote info API in Cross Cluster Search docs page Fix Laplace scorer to multiply by alpha (and not add) (elastic#27125) [DOCS] Clarify migrate guide and search request validation Raise IllegalArgumentException if query validation failed (elastic#26811) prevent duplicate fields when mixing parent and root nested includes (elastic#27072) TopHitsAggregator must propagate calls to `setScorer`. (elastic#27138)
* master: Lazy initialize checkpoint tracker bit sets Remove checkpoint tracker bit sets setting Fix stable BWC branch detection logic Fix logic detecting unreleased versions Enhances exists queries to reduce need for `_field_names` (#26930) Added new terms_set query Set request body to required to reflect the code base (#27188) Update Docker docs for 6.0.0-rc2 (#27166) Add version 6.0.0 Docs: restore now fails if it encounters incompatible settings (#26933) Convert index blocks to cluster block exceptions (#27050) [DOCS] Link remote info API in Cross Cluster Search docs page Fix Laplace scorer to multiply by alpha (and not add) (#27125) [DOCS] Clarify migrate guide and search request validation Raise IllegalArgumentException if query validation failed (#26811) prevent duplicate fields when mixing parent and root nested includes (#27072) TopHitsAggregator must propagate calls to `setScorer`. (#27138)
Laplace scorer seems to incorrectly apply
alpha. According to Wikipedia https://en.wikipedia.org/wiki/Additive_smoothing, the formula should be(C_i + alpha) / (N + alpha * V), where in our caseC_idenotes the frequency of the term, andNandVdenote thesum_ttfandnum_termsrespectively.I've also fixed the implementation to add
numTermsand notvocabSizeper the additive smoothing formula (numTerms == num_termsandvocabSize == sum_ttf). This is also supported by other research papers I find on the web -- you should add one per unique term, and not one (or alpha) per term occurrence.In that regard, shouldn't
LaplaceScoreralso overridescoreUnigram? Currently it inherits the implementation fromWordScorerwhich implements add-one smoothing, but in Laplace, seems that we should implement add-k.