Allow custom characters in token_chars of ngram tokenizers#49250
Merged
cbuescher merged 2 commits intoelastic:masterfrom Nov 20, 2019
Merged
Allow custom characters in token_chars of ngram tokenizers#49250cbuescher merged 2 commits intoelastic:masterfrom
cbuescher merged 2 commits intoelastic:masterfrom
Conversation
Collaborator
|
Pinging @elastic/es-search (:Search/Analysis) |
Currently the `token_chars` setting in both `edgeNGram` and `ngram` tokenizers only allows for a list of predefined character classes, which might not fit every use case. For example, including underscore "_" in a token would currently require the `punctuation` class which comes with a lot of other characters. This change adds an additional "custom" option to the `token_chars` setting, which requires an additional `custom_token_chars` setting to be present and which will be interpreted as a set of characters to inlcude into a token. Closes elastic#25894
Member
Author
|
@elasticmachine run elasticsearch-ci/packaging-sample-matrix |
romseygeek
approved these changes
Nov 19, 2019
Contributor
romseygeek
left a comment
There was a problem hiding this comment.
LGTM, one nit about the exception message but no need for another go-round.
| if (matcher == null) { | ||
| throw new IllegalArgumentException("Unknown token type: '" + characterClass + "', must be one of " + MATCHERS.keySet()); | ||
| if (characterClass.equals("custom") == false) { | ||
| throw new IllegalArgumentException("Unknown token type: '" + characterClass + "', must be one of " + MATCHERS.keySet()); |
Contributor
There was a problem hiding this comment.
I think we need to include custom in the list here as well?
Member
Author
There was a problem hiding this comment.
Sure, although the message here is a bit unwieldy as it is already. It seems to include a ton of unicode categories from the the java.lang.Character class via the MATCHERS, e.g. currently:
Unknown token type: 'letters', must be one of [symbol, private_use, paragraph_separator, start_punctuation, unassigned, enclosing_mark, connector_punctuation, letter_number, other_number, math_symbol, lowercase_letter, space_separator, surrogate, initial_quote_punctuation, decimal_digit_number, digit, other_punctuation, dash_punctuation, currency_symbol, non_spacing_mark, format, modifier_letter, control, uppercase_letter, other_symbol, end_punctuation, modifier_symbol, other_letter, line_separator, titlecase_letter, letter, punctuation, combining_spacing_mark, final_quote_punctuation, whitespace]
So the "custom" will be burried somewhere in there. If you have any ideas to improve this let me know, otherwise I'll merge this after I pushed the changes and tests passed.
cbuescher
pushed a commit
that referenced
this pull request
Nov 20, 2019
Currently the `token_chars` setting in both `edgeNGram` and `ngram` tokenizers only allows for a list of predefined character classes, which might not fit every use case. For example, including underscore "_" in a token would currently require the `punctuation` class which comes with a lot of other characters. This change adds an additional "custom" option to the `token_chars` setting, which requires an additional `custom_token_chars` setting to be present and which will be interpreted as a set of characters to inlcude into a token. Closes #25894
This was referenced Feb 3, 2020
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Currently the
token_charssetting in bothedgeNGramandngramtokenizersonly allows for a list of predefined character classes, which might not fit
every use case. For example, including underscore "_" in a token would currently
require the
punctuationclass which comes with a lot of other characters.This change adds an additional "custom" option to the
token_charssetting,which requires an additional
custom_token_charssetting to be present andwhich will be interpreted as a set of characters to inlcude into a token.
Closes #25894