Skip to content

[ML] Boost weighting for multiple adjacent words.#1903

Merged
edsavage merged 8 commits intoelastic:masterfrom
edsavage:categorization_token_weighting
May 28, 2021
Merged

[ML] Boost weighting for multiple adjacent words.#1903
edsavage merged 8 commits intoelastic:masterfrom
edsavage:categorization_token_weighting

Conversation

@edsavage
Copy link
Copy Markdown
Contributor

In an effort to categorise the most important parts of a message, give a boost to the weighting when 3 or more dictionary words in the message are adjacent one another.

Relates to #1724

In an effort to categorize the most important parts of a message,
give a boost to the weighting of 3 of more adjacent dictionary words in the message.

Relates to elastic#1724
@droberts195
Copy link
Copy Markdown

Some of the Java integration tests that have failed suggest there's a bug somewhere.

For example, testStopOnWarn failed with this:

Expected: <warn>
 but: was <ok>

at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:18)
at org.junit.Assert.assertThat(Assert.java:956)
at org.junit.Assert.assertThat(Assert.java:923)
at org.elasticsearch.xpack.ml.integration.CategorizationIT.lambda$testStopOnWarn$1(CategorizationIT.java:400)
at org.elasticsearch.xpack.ml.integration.CategorizationIT.testStopOnWarn(CategorizationIT.java:426)

Given that testStopOnWarn feeds a single message repeatedly for each partition it seems impossible that changes to weighting of tokens could change whether this test sees 100 of the same message in a row. So it seems that an unintentional change to the warn/ok status has been introduced somewhere.

edsavage added 2 commits May 27, 2021 11:58
Ensure that stateful functor is reset after each message string
tokenisation
Removing unnecessary casts from unit tests
Copy link
Copy Markdown

@droberts195 droberts195 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM if it passes CI now

@edsavage edsavage merged commit db7aeb9 into elastic:master May 28, 2021
edsavage added a commit to edsavage/ml-cpp that referenced this pull request May 28, 2021
In an effort to categorise the most important parts of a message,
give a boost to the weighting of 3 of more adjacent dictionary words in the message.

Relates to elastic#1724
edsavage added a commit that referenced this pull request May 28, 2021
In an effort to categorise the most important parts of a message,
give a boost to the weighting of 3 of more adjacent dictionary words in the message.

Relates to #1724
Backports #1903
@edsavage edsavage deleted the categorization_token_weighting branch May 28, 2021 09:28
droberts195 added a commit to droberts195/ml-cpp that referenced this pull request May 23, 2022
In elastic#1903 we changed dictionary weighting in categorization to give
higher weighting when there were 3 or more adjacent dictionary
words. This was the first time that we'd ever had the situation
where the same token could have a different weight in different
messages. Unfortunately the way this interacted with us requiring
equal weights when checking for common tokens meant tokens could
be bizarrely removed from categories. For example, with the
following two messages we'd put them in the same category but say
that "started" was not a common token:

- Service abcd was started
- Service reaper was started

This happens because "abcd" is not a dictionary word but "reaper"
is, so then "started" has weight 6 in the first message but weight
31 in the second. Considering "started" to NOT be a common token
in this case is extremely bad both intuitively and for the accuracy
of drilldown searches.

Therefore this PR changes the categorization code to consider
tokens equal if their token IDs are equal but their weights are
different. Weights are now only used to compute distance between
different tokens.

This causes the need for another change. It is no longer as simple
as it used to be to calculate the highest and lowest possible total
weight of a message that might possibly be considered similar to
the current message. This calculation now needs to take account of
possible adjacency weighting, either in the current message or in
the messages being considered as matches. (This also has the side
effect that we'll do a higher number of expensive Levenshtein
distance calculations, as fewer potential matches will be discarded
early by the simple weight check.)
droberts195 added a commit that referenced this pull request May 23, 2022
In #1903 we changed dictionary weighting in categorization to give
higher weighting when there were 3 or more adjacent dictionary
words. This was the first time that we'd ever had the situation
where the same token could have a different weight in different
messages. Unfortunately the way this interacted with us requiring
equal weights when checking for common tokens meant tokens could
be bizarrely removed from categories. For example, with the
following two messages we'd put them in the same category but say
that "started" was not a common token:

- Service abcd was started
- Service reaper was started

This happens because "abcd" is not a dictionary word but "reaper"
is, so then "started" has weight 6 in the first message but weight
31 in the second. Considering "started" to NOT be a common token
in this case is extremely bad both intuitively and for the accuracy
of drilldown searches.

Therefore this PR changes the categorization code to consider
tokens equal if their token IDs are equal but their weights are
different. Weights are now only used to compute distance between
different tokens.

This causes the need for another change. It is no longer as simple
as it used to be to calculate the highest and lowest possible total
weight of a message that might possibly be considered similar to
the current message. This calculation now needs to take account of
possible adjacency weighting, either in the current message or in
the messages being considered as matches. (This also has the side
effect that we'll do a higher number of expensive Levenshtein
distance calculations, as fewer potential matches will be discarded
early by the simple weight check.)
droberts195 added a commit to droberts195/ml-cpp that referenced this pull request May 24, 2022
In elastic#1903 we changed dictionary weighting in categorization to give
higher weighting when there were 3 or more adjacent dictionary
words. This was the first time that we'd ever had the situation
where the same token could have a different weight in different
messages. Unfortunately the way this interacted with us requiring
equal weights when checking for common tokens meant tokens could
be bizarrely removed from categories. For example, with the
following two messages we'd put them in the same category but say
that "started" was not a common token:

- Service abcd was started
- Service reaper was started

This happens because "abcd" is not a dictionary word but "reaper"
is, so then "started" has weight 6 in the first message but weight
31 in the second. Considering "started" to NOT be a common token
in this case is extremely bad both intuitively and for the accuracy
of drilldown searches.

Therefore this PR changes the categorization code to consider
tokens equal if their token IDs are equal but their weights are
different. Weights are now only used to compute distance between
different tokens.

This causes the need for another change. It is no longer as simple
as it used to be to calculate the highest and lowest possible total
weight of a message that might possibly be considered similar to
the current message. This calculation now needs to take account of
possible adjacency weighting, either in the current message or in
the messages being considered as matches. (This also has the side
effect that we'll do a higher number of expensive Levenshtein
distance calculations, as fewer potential matches will be discarded
early by the simple weight check.)

Backport of elastic#2277
droberts195 added a commit to droberts195/ml-cpp that referenced this pull request May 24, 2022
In elastic#1903 we changed dictionary weighting in categorization to give
higher weighting when there were 3 or more adjacent dictionary
words. This was the first time that we'd ever had the situation
where the same token could have a different weight in different
messages. Unfortunately the way this interacted with us requiring
equal weights when checking for common tokens meant tokens could
be bizarrely removed from categories. For example, with the
following two messages we'd put them in the same category but say
that "started" was not a common token:

- Service abcd was started
- Service reaper was started

This happens because "abcd" is not a dictionary word but "reaper"
is, so then "started" has weight 6 in the first message but weight
31 in the second. Considering "started" to NOT be a common token
in this case is extremely bad both intuitively and for the accuracy
of drilldown searches.

Therefore this PR changes the categorization code to consider
tokens equal if their token IDs are equal but their weights are
different. Weights are now only used to compute distance between
different tokens.

This causes the need for another change. It is no longer as simple
as it used to be to calculate the highest and lowest possible total
weight of a message that might possibly be considered similar to
the current message. This calculation now needs to take account of
possible adjacency weighting, either in the current message or in
the messages being considered as matches. (This also has the side
effect that we'll do a higher number of expensive Levenshtein
distance calculations, as fewer potential matches will be discarded
early by the simple weight check.)

Backport of elastic#2277
droberts195 added a commit that referenced this pull request May 24, 2022
In #1903 we changed dictionary weighting in categorization to give
higher weighting when there were 3 or more adjacent dictionary
words. This was the first time that we'd ever had the situation
where the same token could have a different weight in different
messages. Unfortunately the way this interacted with us requiring
equal weights when checking for common tokens meant tokens could
be bizarrely removed from categories. For example, with the
following two messages we'd put them in the same category but say
that "started" was not a common token:

- Service abcd was started
- Service reaper was started

This happens because "abcd" is not a dictionary word but "reaper"
is, so then "started" has weight 6 in the first message but weight
31 in the second. Considering "started" to NOT be a common token
in this case is extremely bad both intuitively and for the accuracy
of drilldown searches.

Therefore this PR changes the categorization code to consider
tokens equal if their token IDs are equal but their weights are
different. Weights are now only used to compute distance between
different tokens.

This causes the need for another change. It is no longer as simple
as it used to be to calculate the highest and lowest possible total
weight of a message that might possibly be considered similar to
the current message. This calculation now needs to take account of
possible adjacency weighting, either in the current message or in
the messages being considered as matches. (This also has the side
effect that we'll do a higher number of expensive Levenshtein
distance calculations, as fewer potential matches will be discarded
early by the simple weight check.)

Backport of #2277
droberts195 added a commit that referenced this pull request May 24, 2022
In #1903 we changed dictionary weighting in categorization to give
higher weighting when there were 3 or more adjacent dictionary
words. This was the first time that we'd ever had the situation
where the same token could have a different weight in different
messages. Unfortunately the way this interacted with us requiring
equal weights when checking for common tokens meant tokens could
be bizarrely removed from categories. For example, with the
following two messages we'd put them in the same category but say
that "started" was not a common token:

- Service abcd was started
- Service reaper was started

This happens because "abcd" is not a dictionary word but "reaper"
is, so then "started" has weight 6 in the first message but weight
31 in the second. Considering "started" to NOT be a common token
in this case is extremely bad both intuitively and for the accuracy
of drilldown searches.

Therefore this PR changes the categorization code to consider
tokens equal if their token IDs are equal but their weights are
different. Weights are now only used to compute distance between
different tokens.

This causes the need for another change. It is no longer as simple
as it used to be to calculate the highest and lowest possible total
weight of a message that might possibly be considered similar to
the current message. This calculation now needs to take account of
possible adjacency weighting, either in the current message or in
the messages being considered as matches. (This also has the side
effect that we'll do a higher number of expensive Levenshtein
distance calculations, as fewer potential matches will be discarded
early by the simple weight check.)

Backport of #2277
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants