Skip to content

[ML] Fix end offset for first_non_blank_line char_filter#73828

Merged
droberts195 merged 3 commits intoelastic:masterfrom
droberts195:fix_end_offset_for_first_non_blank_line_filter
Jun 7, 2021
Merged

[ML] Fix end offset for first_non_blank_line char_filter#73828
droberts195 merged 3 commits intoelastic:masterfrom
droberts195:fix_end_offset_for_first_non_blank_line_filter

Conversation

@droberts195
Copy link
Copy Markdown

When the input gets chopped by a char_filter immediately after
a token, that token must be reported as ending at the very end
of the original input, otherwise analysis will have incorrect
offsets when multiple field values are analyzed in the same
_analyze request.

The pattern_replace filter works like this. This PR changes
the new first_non_blank_line filter to work in the same way.

Fixes elastic/kibana#101255

When the input gets chopped by a char_filter immediately after
a token, that token must be reported as ending at the very end
of the original input, otherwise analysis will have incorrect
offsets when multiple field values are analyzed in the same
_analyze request.

The pattern_replace filter works like this.  This PR changes
the new first_non_blank_line filter to work in the same way.

Fixes elastic/kibana#101255
@elasticmachine elasticmachine added the Team:ML Meta label for the ML team label Jun 7, 2021
@elasticmachine
Copy link
Copy Markdown
Collaborator

Pinging @elastic/ml-core (Team:ML)

@droberts195
Copy link
Copy Markdown
Author

>non-issue because this is fixing unreleased functionality.

],
"tokenizer" : "ml_standard",
"filter" : [
{ "type" : "stop", "stopwords": [
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these stopwords needed for this particular test case? I can't see them in the "text" sentences.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, true, they should not matter. I would like to keep them in the config because the aim is to test the analyzer we default to in production. So I will add a test where they do matter.

@droberts195
Copy link
Copy Markdown
Author

@elasticmachine update branch

@droberts195 droberts195 merged commit 334ad82 into elastic:master Jun 7, 2021
@droberts195 droberts195 deleted the fix_end_offset_for_first_non_blank_line_filter branch June 7, 2021 17:18
Copy link
Copy Markdown

@przemekwitek przemekwitek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

droberts195 added a commit to droberts195/elasticsearch that referenced this pull request Jun 8, 2021
When the input gets chopped by a char_filter immediately after
a token, that token must be reported as ending at the very end
of the original input, otherwise analysis will have incorrect
offsets when multiple field values are analyzed in the same
_analyze request.

The pattern_replace filter works like this.  This PR changes
the new first_non_blank_line filter to work in the same way.

Backport of elastic#73828
droberts195 added a commit that referenced this pull request Jun 8, 2021
When the input gets chopped by a char_filter immediately after
a token, that token must be reported as ending at the very end
of the original input, otherwise analysis will have incorrect
offsets when multiple field values are analyzed in the same
_analyze request.

The pattern_replace filter works like this.  This PR changes
the new first_non_blank_line filter to work in the same way.

Backport of #73828
droberts195 added a commit to droberts195/elasticsearch that referenced this pull request Jun 8, 2021
Now that elastic#73882 is merged the test should pass on master.

Relates elastic#73828
droberts195 added a commit that referenced this pull request Jun 8, 2021
Now that #73882 is merged the test should pass on master.

Relates #73828
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

:ml Machine learning >non-issue Team:ML Meta label for the ML team v7.14.0 v8.0.0-alpha1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[ML] Token highlighting is completely haywire on long messages in categorization wizard with new default analyzer

5 participants