[ML] Fix end offset for first_non_blank_line char_filter by droberts195 · Pull Request #73828 · elastic/elasticsearch

droberts195 · 2021-06-07T11:48:57Z

When the input gets chopped by a char_filter immediately after
a token, that token must be reported as ending at the very end
of the original input, otherwise analysis will have incorrect
offsets when multiple field values are analyzed in the same
_analyze request.

The pattern_replace filter works like this. This PR changes
the new first_non_blank_line filter to work in the same way.

Fixes elastic/kibana#101255

When the input gets chopped by a char_filter immediately after a token, that token must be reported as ending at the very end of the original input, otherwise analysis will have incorrect offsets when multiple field values are analyzed in the same _analyze request. The pattern_replace filter works like this. This PR changes the new first_non_blank_line filter to work in the same way. Fixes elastic/kibana#101255

elasticmachine · 2021-06-07T11:49:00Z

Pinging @elastic/ml-core (Team:ML)

droberts195 · 2021-06-07T11:49:33Z

>non-issue because this is fixing unreleased functionality.

przemekwitek · 2021-06-07T11:55:38Z

x-pack/plugin/src/yamlRestTest/resources/rest-api-spec/test/ml/ml_standard_analyze.yml

+            ],
+            "tokenizer" : "ml_standard",
+            "filter" : [
+              { "type" : "stop", "stopwords": [


Are these stopwords needed for this particular test case? I can't see them in the "text" sentences.

Yes, true, they should not matter. I would like to keep them in the config because the aim is to test the analyzer we default to in production. So I will add a test where they do matter.

droberts195 · 2021-06-07T16:37:30Z

@elasticmachine update branch

…lter

przemekwitek

LGTM

When the input gets chopped by a char_filter immediately after a token, that token must be reported as ending at the very end of the original input, otherwise analysis will have incorrect offsets when multiple field values are analyzed in the same _analyze request. The pattern_replace filter works like this. This PR changes the new first_non_blank_line filter to work in the same way. Backport of elastic#73828

When the input gets chopped by a char_filter immediately after a token, that token must be reported as ending at the very end of the original input, otherwise analysis will have incorrect offsets when multiple field values are analyzed in the same _analyze request. The pattern_replace filter works like this. This PR changes the new first_non_blank_line filter to work in the same way. Backport of #73828

Now that elastic#73882 is merged the test should pass on master. Relates elastic#73828

Now that #73882 is merged the test should pass on master. Relates #73828

droberts195 added >non-issue :ml Machine learning v8.0.0 v7.14.0 labels Jun 7, 2021

elasticmachine added the Team:ML Meta label for the ML team label Jun 7, 2021

benwtrent approved these changes Jun 7, 2021

View reviewed changes

przemekwitek reviewed Jun 7, 2021

View reviewed changes

Address review comment and test blacklisting

b8a99ba

Merge branch 'master' into fix_end_offset_for_first_non_blank_line_fi…

99a71f1

…lter

droberts195 merged commit 334ad82 into elastic:master Jun 7, 2021

droberts195 deleted the fix_end_offset_for_first_non_blank_line_filter branch June 7, 2021 17:18

przemekwitek reviewed Jun 8, 2021

View reviewed changes

droberts195 mentioned this pull request Jun 8, 2021

[ML] Fix end offset for first_non_blank_line char_filter #73882

Merged

droberts195 added a commit to droberts195/elasticsearch that referenced this pull request Jun 8, 2021

[ML] Unmute REST compat test after backport

d0135f1

Now that elastic#73882 is merged the test should pass on master. Relates elastic#73828

droberts195 mentioned this pull request Jun 8, 2021

[ML] Unmute REST compat test after backport #73885

Merged

droberts195 added a commit that referenced this pull request Jun 8, 2021

[ML] Unmute REST compat test after backport (#73885)

58e51a0

Now that #73882 is merged the test should pass on master. Relates #73828

jgowdyelastic mentioned this pull request Jun 22, 2021

[ML] Fixing categorization token highlighting for multi-line messages elastic/kibana#103007

Merged

jakelandis added v8.0.0-alpha1 and removed v8.0.0 labels Jul 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ML] Fix end offset for first_non_blank_line char_filter#73828

[ML] Fix end offset for first_non_blank_line char_filter#73828
droberts195 merged 3 commits intoelastic:masterfrom
droberts195:fix_end_offset_for_first_non_blank_line_filter

droberts195 commented Jun 7, 2021

Uh oh!

elasticmachine commented Jun 7, 2021

Uh oh!

droberts195 commented Jun 7, 2021

Uh oh!

przemekwitek Jun 7, 2021

Uh oh!

droberts195 Jun 7, 2021

Uh oh!

droberts195 commented Jun 7, 2021

Uh oh!

przemekwitek left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

droberts195 commented Jun 7, 2021

Uh oh!

elasticmachine commented Jun 7, 2021

Uh oh!

droberts195 commented Jun 7, 2021

Uh oh!

przemekwitek Jun 7, 2021

Choose a reason for hiding this comment

Uh oh!

droberts195 Jun 7, 2021

Choose a reason for hiding this comment

Uh oh!

droberts195 commented Jun 7, 2021

Uh oh!

przemekwitek left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants