[ML] Boost weighting for multiple adjacent words. by edsavage · Pull Request #1903 · elastic/ml-cpp

edsavage · 2021-05-26T10:31:10Z

In an effort to categorise the most important parts of a message, give a boost to the weighting when 3 or more dictionary words in the message are adjacent one another.

Relates to #1724

In an effort to categorize the most important parts of a message, give a boost to the weighting of 3 of more adjacent dictionary words in the message. Relates to elastic#1724

lib/core/unittest/CStringSimilarityTesterTest.cc

droberts195 · 2021-05-26T18:34:37Z

Some of the Java integration tests that have failed suggest there's a bug somewhere.

For example, testStopOnWarn failed with this:

Expected: <warn>
 but: was <ok>

at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:18)
at org.junit.Assert.assertThat(Assert.java:956)
at org.junit.Assert.assertThat(Assert.java:923)
at org.elasticsearch.xpack.ml.integration.CategorizationIT.lambda$testStopOnWarn$1(CategorizationIT.java:400)
at org.elasticsearch.xpack.ml.integration.CategorizationIT.testStopOnWarn(CategorizationIT.java:426)

Given that testStopOnWarn feeds a single message repeatedly for each partition it seems impossible that changes to weighting of tokens could change whether this test sees 100 of the same message in a row. So it seems that an unintentional change to the warn/ok status has been introduced somewhere.

Ensure that stateful functor is reset after each message string tokenisation

Removing unnecessary casts from unit tests

docs/CHANGELOG.asciidoc

include/core/CWordDictionary.h

droberts195

LGTM if it passes CI now

lib/model/CTokenListCategory.cc

In an effort to categorise the most important parts of a message, give a boost to the weighting of 3 of more adjacent dictionary words in the message. Relates to elastic#1724

In an effort to categorise the most important parts of a message, give a boost to the weighting of 3 of more adjacent dictionary words in the message. Relates to #1724 Backports #1903

In elastic#1903 we changed dictionary weighting in categorization to give higher weighting when there were 3 or more adjacent dictionary words. This was the first time that we'd ever had the situation where the same token could have a different weight in different messages. Unfortunately the way this interacted with us requiring equal weights when checking for common tokens meant tokens could be bizarrely removed from categories. For example, with the following two messages we'd put them in the same category but say that "started" was not a common token: - Service abcd was started - Service reaper was started This happens because "abcd" is not a dictionary word but "reaper" is, so then "started" has weight 6 in the first message but weight 31 in the second. Considering "started" to NOT be a common token in this case is extremely bad both intuitively and for the accuracy of drilldown searches. Therefore this PR changes the categorization code to consider tokens equal if their token IDs are equal but their weights are different. Weights are now only used to compute distance between different tokens. This causes the need for another change. It is no longer as simple as it used to be to calculate the highest and lowest possible total weight of a message that might possibly be considered similar to the current message. This calculation now needs to take account of possible adjacency weighting, either in the current message or in the messages being considered as matches. (This also has the side effect that we'll do a higher number of expensive Levenshtein distance calculations, as fewer potential matches will be discarded early by the simple weight check.)

In #1903 we changed dictionary weighting in categorization to give higher weighting when there were 3 or more adjacent dictionary words. This was the first time that we'd ever had the situation where the same token could have a different weight in different messages. Unfortunately the way this interacted with us requiring equal weights when checking for common tokens meant tokens could be bizarrely removed from categories. For example, with the following two messages we'd put them in the same category but say that "started" was not a common token: - Service abcd was started - Service reaper was started This happens because "abcd" is not a dictionary word but "reaper" is, so then "started" has weight 6 in the first message but weight 31 in the second. Considering "started" to NOT be a common token in this case is extremely bad both intuitively and for the accuracy of drilldown searches. Therefore this PR changes the categorization code to consider tokens equal if their token IDs are equal but their weights are different. Weights are now only used to compute distance between different tokens. This causes the need for another change. It is no longer as simple as it used to be to calculate the highest and lowest possible total weight of a message that might possibly be considered similar to the current message. This calculation now needs to take account of possible adjacency weighting, either in the current message or in the messages being considered as matches. (This also has the side effect that we'll do a higher number of expensive Levenshtein distance calculations, as fewer potential matches will be discarded early by the simple weight check.)

In elastic#1903 we changed dictionary weighting in categorization to give higher weighting when there were 3 or more adjacent dictionary words. This was the first time that we'd ever had the situation where the same token could have a different weight in different messages. Unfortunately the way this interacted with us requiring equal weights when checking for common tokens meant tokens could be bizarrely removed from categories. For example, with the following two messages we'd put them in the same category but say that "started" was not a common token: - Service abcd was started - Service reaper was started This happens because "abcd" is not a dictionary word but "reaper" is, so then "started" has weight 6 in the first message but weight 31 in the second. Considering "started" to NOT be a common token in this case is extremely bad both intuitively and for the accuracy of drilldown searches. Therefore this PR changes the categorization code to consider tokens equal if their token IDs are equal but their weights are different. Weights are now only used to compute distance between different tokens. This causes the need for another change. It is no longer as simple as it used to be to calculate the highest and lowest possible total weight of a message that might possibly be considered similar to the current message. This calculation now needs to take account of possible adjacency weighting, either in the current message or in the messages being considered as matches. (This also has the side effect that we'll do a higher number of expensive Levenshtein distance calculations, as fewer potential matches will be discarded early by the simple weight check.) Backport of elastic#2277

In #1903 we changed dictionary weighting in categorization to give higher weighting when there were 3 or more adjacent dictionary words. This was the first time that we'd ever had the situation where the same token could have a different weight in different messages. Unfortunately the way this interacted with us requiring equal weights when checking for common tokens meant tokens could be bizarrely removed from categories. For example, with the following two messages we'd put them in the same category but say that "started" was not a common token: - Service abcd was started - Service reaper was started This happens because "abcd" is not a dictionary word but "reaper" is, so then "started" has weight 6 in the first message but weight 31 in the second. Considering "started" to NOT be a common token in this case is extremely bad both intuitively and for the accuracy of drilldown searches. Therefore this PR changes the categorization code to consider tokens equal if their token IDs are equal but their weights are different. Weights are now only used to compute distance between different tokens. This causes the need for another change. It is no longer as simple as it used to be to calculate the highest and lowest possible total weight of a message that might possibly be considered similar to the current message. This calculation now needs to take account of possible adjacency weighting, either in the current message or in the messages being considered as matches. (This also has the side effect that we'll do a higher number of expensive Levenshtein distance calculations, as fewer potential matches will be discarded early by the simple weight check.) Backport of #2277

[ML] Boost weighting for multiple adjacent words.

50a09e2

In an effort to categorize the most important parts of a message, give a boost to the weighting of 3 of more adjacent dictionary words in the message. Relates to elastic#1724

edsavage added >enhancement review :ml v8.0.0 v7.14.0 labels May 26, 2021

[ML] Update changelog

f6bd185

droberts195 reviewed May 26, 2021

View reviewed changes

lib/core/unittest/CStringSimilarityTesterTest.cc Outdated Show resolved Hide resolved

edsavage added 2 commits May 27, 2021 11:58

[ML] Fix failing Java rest tests

5cb975b

Ensure that stateful functor is reset after each message string tokenisation

[ML] Attending to review comments

ddf1d80

Removing unnecessary casts from unit tests

droberts195 reviewed May 27, 2021

View reviewed changes

docs/CHANGELOG.asciidoc Outdated Show resolved Hide resolved

droberts195 reviewed May 27, 2021

View reviewed changes

include/core/CWordDictionary.h Outdated Show resolved Hide resolved

edsavage added 3 commits May 27, 2021 13:52

[ML] Fix formatting

a817a64

[ML] American spelling in change log

6f2a97a

[ML] Prefer prefix increment

5216411

droberts195 approved these changes May 27, 2021

View reviewed changes

lib/model/CTokenListCategory.cc Outdated Show resolved Hide resolved

droberts195 mentioned this pull request May 27, 2021

[ML] Higher weighting for tokens at the beginning of messages during categorization #1724

Closed

[ML] Fix initialization of for loop

baec783

edsavage merged commit db7aeb9 into elastic:master May 28, 2021

edsavage mentioned this pull request May 28, 2021

[7.x][ML] Boost weighting for multiple adjacent words. (#1903) #1904

Merged

edsavage deleted the categorization_token_weighting branch May 28, 2021 09:28

droberts195 added the affects-results label May 23, 2022

droberts195 mentioned this pull request May 23, 2022

[ML] Adjacency weighting fixes in categorization #2277

Merged

droberts195 mentioned this pull request May 24, 2022

[8.2] [ML] Adjacency weighting fixes in categorization #2278

Merged

droberts195 mentioned this pull request May 24, 2022

[7.17] [ML] Adjacency weighting fixes in categorization #2279

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ML] Boost weighting for multiple adjacent words.#1903

[ML] Boost weighting for multiple adjacent words.#1903
edsavage merged 8 commits intoelastic:masterfrom
edsavage:categorization_token_weighting

edsavage commented May 26, 2021

Uh oh!

Uh oh!

droberts195 commented May 26, 2021

Uh oh!

Uh oh!

Uh oh!

droberts195 left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

edsavage commented May 26, 2021

Uh oh!

Uh oh!

droberts195 commented May 26, 2021

Uh oh!

Uh oh!

Uh oh!

droberts195 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants