Skip to content

[ML] Make ml_standard tokenizer the default for new categorization jobs#73605

Merged
droberts195 merged 1 commit intoelastic:7.xfrom
droberts195:ml_standard_tokenizer_for_new_cat_jobs_7x
Jun 2, 2021
Merged

[ML] Make ml_standard tokenizer the default for new categorization jobs#73605
droberts195 merged 1 commit intoelastic:7.xfrom
droberts195:ml_standard_tokenizer_for_new_cat_jobs_7x

Conversation

@droberts195
Copy link
Copy Markdown

Categorization jobs created once the entire cluster is upgraded to
version 7.14 or higher will default to using the new ml_standard
tokenizer rather than the previous default of the ml_classic
tokenizer, and will incorporate the new first_non_blank_line char
filter so that categorization is based purely on the first non-blank
line of each message.

The difference between the ml_classic and ml_standard tokenizers
is that ml_classic splits on slashes and colons, so creates multiple
tokens from URLs and filesystem paths, whereas ml_standard attempts
to keep URLs, email addresses and filesystem paths as single tokens.

It is still possible to config the ml_classic tokenizer if you
prefer: just provide a categorization_analyzer within your
analysis_config and whichever tokenizer you choose (which could be
ml_classic or any other Elasticsearch tokenizer) will be used.

To opt out of using first_non_blank_line as a default char filter,
you must explicitly specify a categorization_analyzer that does not
include it.

If no categorization_analyzer is specified but categorization_filters
are specified then the categorization filters are converted to char
filters applied that are applied after first_non_blank_line.

Backport of #72805

Categorization jobs created once the entire cluster is upgraded to
version 7.14 or higher will default to using the new ml_standard
tokenizer rather than the previous default of the ml_classic
tokenizer, and will incorporate the new first_non_blank_line char
filter so that categorization is based purely on the first non-blank
line of each message.

The difference between the ml_classic and ml_standard tokenizers
is that ml_classic splits on slashes and colons, so creates multiple
tokens from URLs and filesystem paths, whereas ml_standard attempts
to keep URLs, email addresses and filesystem paths as single tokens.

It is still possible to config the ml_classic tokenizer if you
prefer: just provide a categorization_analyzer within your
analysis_config and whichever tokenizer you choose (which could be
ml_classic or any other Elasticsearch tokenizer) will be used.

To opt out of using first_non_blank_line as a default char filter,
you must explicitly specify a categorization_analyzer that does not
include it.

If no categorization_analyzer is specified but categorization_filters
are specified then the categorization filters are converted to char
filters applied that are applied after first_non_blank_line.

Backport of elastic#72805
droberts195 added a commit to droberts195/elasticsearch that referenced this pull request Jun 1, 2021
Once elastic#73605
is merged this test should pass on master.
@droberts195 droberts195 merged commit 8cf1fdc into elastic:7.x Jun 2, 2021
@droberts195 droberts195 deleted the ml_standard_tokenizer_for_new_cat_jobs_7x branch June 2, 2021 06:04
droberts195 added a commit that referenced this pull request Jun 2, 2021
Once #73605
is merged this test should pass on master.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant