Fix Laplace scorer to multiply by alpha (and not add) by shaie · Pull Request #27125 · elastic/elasticsearch

shaie · 2017-10-26T12:48:29Z

Laplace scorer seems to incorrectly apply alpha. According to Wikipedia https://en.wikipedia.org/wiki/Additive_smoothing, the formula should be (C_i + alpha) / (N + alpha * V), where in our case C_i denotes the frequency of the term, and N and V denote the sum_ttf and num_terms respectively.

I've also fixed the implementation to add numTerms and not vocabSize per the additive smoothing formula (numTerms == num_terms and vocabSize == sum_ttf). This is also supported by other research papers I find on the web -- you should add one per unique term, and not one (or alpha) per term occurrence.

In that regard, shouldn't LaplaceScorer also override scoreUnigram? Currently it inherits the implementation from WordScorer which implements add-one smoothing, but in Laplace, seems that we should implement add-k.

elasticmachine · 2017-10-26T12:48:31Z

Since this is a community submitted pull request, a Jenkins build has not been kicked off automatically. Can an Elastic organization member please verify the contents of this patch and then kick off a build manually?

jpountz · 2017-10-27T13:51:57Z

@s1monw Do you have an opinion on those changes?

jimczi · 2017-10-27T15:23:51Z

Multiply by alpha seems the right thing to do but I don't think that numTerms should be used. The frequency for each term is computed as the number of occurrences of this term across all documents so using the total term frequency for the entire corpus seems appropriate.
Also numTerms is unknown when the shard contains more than one segment so the value will be -1 most of the time (see #27149). I think we should always use the total term frequency and removes numTerms entirely from the formula.
Although I am not sure about the original intentions so I am also interested in what @s1monw thinks of this ;)

shaie · 2017-10-27T15:50:31Z

Multiply by alpha seems the right thing to do but I don't think that numTerms should be used. The frequency for each term is computed as the number of occurrences of this term across all documents so using the total term frequency for the entire corpus seems appropriate.

The idea of this smoothing comes from the MLE (Maximum Likelihood Estimation). We compute the probability of a term P(t1) as ttf(t1) / sum_ttf. In order to account for non existing terms which have some minor probability, we smooth the probability. In Laplace, aka Additive, you can add k where k is between 0 and 1. If you add 1 to a term in the numerator, you have to account that in the divider by adding one to all unique terms (to keep the probabilities the same). If you add k < 1 then you add k*numTerms to the divider. That's the probability theory behind it. If we'll add vocabSize == sum_ttf, we'll add a much larger value to the divider than we should, and therefore lower the probability of all other terms more than we want to.

Also numTerms is unknown when the shard contains more than one segment so the value will be -1 most of the time (see #27149).

I know that, and that's why I've also fixed numTerms to equal reader.maxDoc() as a (very low) approximation. But I think it's better than adding k * vocabSize since it will discount the probability of existing terms less.

s1monw · 2017-10-29T13:48:46Z

Although I am not sure about the original intentions so I am also interested in what @s1monw thinks of this ;)

I don't necessarily know much about this anymore. I can totally look into it but that might be as good as your opinions.

Multiply by alpha seems the right thing to do but I don't think that numTerms should be used. The frequency for each term is computed as the number of occurrences of this term across all documents so using the total term frequency for the entire corpus seems appropriate.
Also numTerms is unknown when the shard contains more than one segment so the value will be -1 most of the time (see #27149). I think we should always use the total term frequency and removes numTerms entirely from the formula.

I tend to agree with this. The frequency for the bi-/tri-grams is calculated using a proportional value ie. docfreq or TTF if available. We use a proportional value for the vocabulary size, that seems correct to me.

s1monw · 2017-10-29T14:13:36Z

@elasticmachine ok to test

shaie · 2017-10-29T14:21:26Z

Thanks @s1monw. Do you also have an opinion about my other question -- should LaplaceScorer also override scoreUnigram to implement k-additive smoothing (instead of falling back to add-one)?

Regarding whether to smooth by sum_ttf (aka vocabulary size) or numTerms, I still think it's wrong to divide by sum_ttf as it shaves off much more of the probability than you want in an add-k smoothing. E.g. take a look here: https://lagunita.stanford.edu/c4x/Engineering/CS-224N/asset/slp4.pdf#subsection.4.4.2, where V is previously defined as the "total extra observations", in our case the total number of unique terms.

Dividing by vocabSize == sum_ttf is just too high. On a not-so-large Wikipedia index (w/ English analysis, 5.6M docs), there are 5.4 billion term occurrences. I don't know what API exists to get the number of unique terms, but I believe it's much smaller, and I think that dividing by such a large number is not the right thing to do.

That that numTerms cannot always be computed is OK, as this patch also fixes it to default to reader.maxDoc() (which is needed for WordScorer.scoreUnigram() anyway), and it will mean it will shave off less probability.

Thoughts?

jimczi · 2017-10-30T08:00:37Z

Regarding whether to smooth by sum_ttf (aka vocabulary size) or numTerms, I still think it's wrong to divide by sum_ttf as it shaves off much more of the probability than you want in an add-k smoothing. E.g. take a look here: https://lagunita.stanford.edu/c4x/Engineering/CS-224N/asset/slp4.pdf#subsection.4.4.2, where V is previously defined as the "total extra observations", in our case the total number of unique terms.

For the bigram/trigram case we use the relative frequency that for the bigram x_y is computed with sum_ttf(x_y) / sum_ttf(x) when ttf is available and tf(x_y) / tf(x) when it's not.
With add-k smoothing the formula is augmented with (sum_ttf(x_y) + k) / (sum_ttf(x) + kV).
For this case I agree that V can be maxDocs or numTerms, it's just an augmentation of the divider but the important part is that it's divided by the total term frequency of the prefix (n-1 gram) which is always greater than the sum_ttf of the entire ngram.
Sorry it was unclear but my comment was about the unigram case where we need to divide by total_term_freq of the entire corpus in order to get the probability of the term: sum_ttf(x) / ttf. It's not about the smoothing which can be any external observations as long as it's the same for all terms. So the formula for unigram would have to keep the ttf divider:
(sum_ttf(x) + k) / (ttf + kV) where V could be numTerms or maxDocs. Since numTerms is only available when a single segment is used I think it would be preferable to always use maxDocs but that's just for clarity. I am ok changing this formula if we fix the bigram/trigram case with your diff but the divider would be: vocabularySize + (k * maxDocs), keeping the total term frequency is important to make sure that the computed value is always <= 1.

The rest tests are failing due to the new formula:

14:40:44 FAILURE 0.09s | DocsClientYamlTestSuiteIT.test {yaml=reference/search/suggesters/phrase-suggest/line_86} <<< FAILURES!
14:40:44    > Throwable #1: java.lang.AssertionError: Failure at [reference/search/suggesters/phrase-suggest:103]: $body didn't match expected value:

Can you adapt these tests to the new formula ?

shaie · 2017-10-30T09:13:03Z

I am pretty sure that we are roughly on the same page, and since I'm not sure if you had typos in your response, I would like to re-iterate what you wrote:

for the bigram x_y is computed with sum_ttf(x_y) / sum_ttf(x) when ttf is available and tf(x_y) / tf(x) when it's not.

The formula that we use is (k + freq(x_y)) / (freq(x) + kV). freq() is computed either as ttf = total_term_freq or df = document freq. Just to be clear, sum_ttf denotes the sum of all term frequencies in a field, and ttf denotes the total term frequency of a single term.

but the important part is that it's divided by the total term frequency of the prefix (n-1 gram) which is always greater than the sum_ttf of the entire ngram.

Agreed. Except that it's divided by the frequency of the (n-1 gram), which is either ttf or df.

With that, I think we are on the same page, and I'll go fix LaplaceScorer to also override scoreUnigram and implement with add-k smoothing.

The rest tests are failing due to the new formula:

Hmm, it didn't fail when I ran gradle test. Should I run something else (maybe gradle itest :) )?

jimczi · 2017-10-30T10:13:45Z

The formula that we use is (k + freq(x_y)) / (freq(x) + kV). freq() is computed either as ttf = total_term_freq or df = document freq. Just to be clear, sum_ttf denotes the sum of all term frequencies in a field, and ttf denotes the total term frequency of a single term.

We are on the same page, ok with the naming.

Agreed. Except that it's divided by the frequency of the (n-1 gram), which is either ttf or df.

Same here, naming issue. Thanks for using the right one ;).

Hmm, it didn't fail when I ran gradle test

To run all the tests you should use gradle check. gradle test only runs unit tests.

With that, I think we are on the same page, and I'll go fix LaplaceScorer to also override scoreUnigram and implement with add-k smoothing.

Ok. I am a bit concerned because it's the default scoring that we're changing but I am good if we consider this as a breaking change (same for the bigram/trigram change actually). The phrase_suggester is not easy to tune right so some users may rely on the current formula even though it was a bit buggy.

shaie · 2017-10-30T19:26:22Z

@jimczi I've fixed the code as we discussed. I also hope that I've addressed all test failures.

jimczi

Thanks @shaie !
As discussed I'll merge the change in master only (7.0).

shaie · 2017-10-31T11:08:08Z

Thanks @jimczi!

s1monw · 2017-11-01T09:17:46Z

Looks good guys thanks for fixing this @shaie

* master: Enhances exists queries to reduce need for `_field_names` (elastic#26930) Added new terms_set query Set request body to required to reflect the code base (elastic#27188) Update Docker docs for 6.0.0-rc2 (elastic#27166) Add version 6.0.0 Docs: restore now fails if it encounters incompatible settings (elastic#26933) Convert index blocks to cluster block exceptions (elastic#27050) [DOCS] Link remote info API in Cross Cluster Search docs page Fix Laplace scorer to multiply by alpha (and not add) (elastic#27125) [DOCS] Clarify migrate guide and search request validation Raise IllegalArgumentException if query validation failed (elastic#26811) prevent duplicate fields when mixing parent and root nested includes (elastic#27072) TopHitsAggregator must propagate calls to `setScorer`. (elastic#27138)

* master: Remove checkpoint tracker bit sets setting Fix stable BWC branch detection logic Fix logic detecting unreleased versions Enhances exists queries to reduce need for `_field_names` (elastic#26930) Added new terms_set query Set request body to required to reflect the code base (elastic#27188) Update Docker docs for 6.0.0-rc2 (elastic#27166) Add version 6.0.0 Docs: restore now fails if it encounters incompatible settings (elastic#26933) Convert index blocks to cluster block exceptions (elastic#27050) [DOCS] Link remote info API in Cross Cluster Search docs page Fix Laplace scorer to multiply by alpha (and not add) (elastic#27125) [DOCS] Clarify migrate guide and search request validation Raise IllegalArgumentException if query validation failed (elastic#26811) prevent duplicate fields when mixing parent and root nested includes (elastic#27072) TopHitsAggregator must propagate calls to `setScorer`. (elastic#27138)

* master: Lazy initialize checkpoint tracker bit sets Remove checkpoint tracker bit sets setting Fix stable BWC branch detection logic Fix logic detecting unreleased versions Enhances exists queries to reduce need for `_field_names` (#26930) Added new terms_set query Set request body to required to reflect the code base (#27188) Update Docker docs for 6.0.0-rc2 (#27166) Add version 6.0.0 Docs: restore now fails if it encounters incompatible settings (#26933) Convert index blocks to cluster block exceptions (#27050) [DOCS] Link remote info API in Cross Cluster Search docs page Fix Laplace scorer to multiply by alpha (and not add) (#27125) [DOCS] Clarify migrate guide and search request validation Raise IllegalArgumentException if query validation failed (#26811) prevent duplicate fields when mixing parent and root nested includes (#27072) TopHitsAggregator must propagate calls to `setScorer`. (#27138)

dakrone added the :Search/Search Search-related issues that do not fall into other categories label Oct 26, 2017

dakrone requested a review from jpountz October 26, 2017 16:55

martijnvg mentioned this pull request Oct 27, 2017

Fix division by zero in phrase suggester that causes assertion to fail #27149

Merged

shaie added 2 commits October 30, 2017 21:22

Fix Laplace scorer to multiply by alpha (and not add)

1097f4e

Override scoreUnigram and fix doc test

404ffe8

shaie force-pushed the fix-laplace-smoothing branch from 5f05f64 to 404ffe8 Compare October 30, 2017 19:24

jimczi added >breaking v7.0.0 labels Oct 31, 2017

jimczi approved these changes Oct 31, 2017

View reviewed changes

jimczi merged commit bd02619 into elastic:master Oct 31, 2017

shaie deleted the fix-laplace-smoothing branch November 1, 2017 07:15

colings86 added v7.0.0-beta1 and removed v7.0.0 labels Feb 7, 2019

Conversation

shaie commented Oct 26, 2017

Uh oh!

elasticmachine commented Oct 26, 2017

Uh oh!

jpountz commented Oct 27, 2017

Uh oh!

jimczi commented Oct 27, 2017

Uh oh!

shaie commented Oct 27, 2017

Uh oh!

s1monw commented Oct 29, 2017

Uh oh!

s1monw commented Oct 29, 2017

Uh oh!

shaie commented Oct 29, 2017

Uh oh!

jimczi commented Oct 30, 2017

Uh oh!

shaie commented Oct 30, 2017

Uh oh!

jimczi commented Oct 30, 2017

Uh oh!

shaie commented Oct 30, 2017

Uh oh!

jimczi left a comment

Choose a reason for hiding this comment

Uh oh!

shaie commented Oct 31, 2017

Uh oh!

s1monw commented Nov 1, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants