Use preconfigured filters correctly in Analyze API#43568
Use preconfigured filters correctly in Analyze API#43568romseygeek merged 9 commits intoelastic:masterfrom
Conversation
|
Pinging @elastic/es-search |
|
This is an interesting failure: the pre-configured EdgeNGramTokenizer uses the lucene default values for min and max, which are both 1; however, the elasticsearch |
cbuescher
left a comment
There was a problem hiding this comment.
The main change in AnalysisRegistry makes sense to me looking at the underlying issue. When reading the test I was wondering if it would be better (and/or easier) to start testing AnalysisRegistry#buildCustomAnalyzer() more directly in AnalysisRegistryTests instead of going through TransportAnalyzeActionTests. Not a heard requirement for this PR but maybe there are things that could be pulled out or added easily.
+1, that makes sense, I'll open a follow-up issue |
cbuescher
left a comment
There was a problem hiding this comment.
Left a minor comment around testing, feel free to disregard if you don't like the idea. Rest LGTM
| VersionUtils.randomVersionBetween(random(), Version.V_7_0_0, VersionUtils.getPreviousVersion(Version.V_7_3_0))) | ||
| .put("index.analysis.analyzer.my_analyzer.tokenizer", "edge_ngram") | ||
| .build(); | ||
| IndexSettings idxSettings = IndexSettingsModule.newIndexSettings("index", indexSettings); |
There was a problem hiding this comment.
Maybe it would make sense to factor out the settings creation into a private helper that takes the version and the tokenizer name as arguments. Its more or less the same in all four cases and takes up a bit of space.
| .put("index.analysis.analyzer.my_analyzer.tokenizer", "standard") | ||
| .putList("index.analysis.analyzer.my_analyzer.filter", "word_delimiter_graph") | ||
| .build(); | ||
| IndexSettings idxSettings = IndexSettingsModule.newIndexSettings("index", indexSettings); |
When a named token filter or char filter is passed as part of an Analyze API request with no index, we currently try and build the relevant filter using no index settings. However, this can miss cases where there is a pre-configured filter defined in the analysis registry. One example here is the elision filter, which has a pre-configured version built with the french elision set; when used as part of normal analysis, this preconfigured set is used, but when used as part of the Analyze API we end up with NPEs because it tries to instantiate the filter with no index settings. This commit changes the Analyze API to check for pre-configured filters in the case that the request has no index defined, and is using a name rather than a custom definition for a filter. It also changes the pre-configured `word_delimiter_graph` filter and `edge_ngram` tokenizer to make their settings consistent with the defaults used when creating them with no settings Closes #43002 Closes #43621 Closes #43582
…r is used (#43684) #26625 deprecated delimited_payload_filter and added tests to check that warnings would be emitted when both a normal and pre-configured filter were used. Unfortunately, due to a bug in the Analyze API, the pre- configured filter check was never actually triggered, and it turns out that the deprecation warning was not in fact being emitted in this case. #43568 fixed the Analyze API bug, which then surfaced this on backport. This commit ensures that the preconfigured filter also emits the warnings and triggers an error if a new index tries to use a preconfigured delimited_payload_filter
edgeNGram and NGram tokenizers and token filters were deprecated. They have not been supported in indices created from 8.0, hence their support can entirely be removed from main. The version related logic around the min grams can also be removed as it refers to 7.x which we no longer need to support. Relates to elastic#50376, elastic#50862, elastic#43568
edgeNGram and NGram tokenizers and token filters were deprecated. They have not been supported in indices created from 8.0, hence their support can entirely be removed from main. The version related logic around the min grams can also be removed as it refers to 7.x which we no longer need to support. Relates to #50376, #50862, #43568
…c#113009) edgeNGram and NGram tokenizers and token filters were deprecated. They have not been supported in indices created from 8.0, hence their support can entirely be removed from main. The version related logic around the min grams can also be removed as it refers to 7.x which we no longer need to support. Relates to elastic#50376, elastic#50862, elastic#43568
When a named token filter or char filter is passed as part of an Analyze API
request with no index, we currently try and build the relevant filter using no
index settings. However, this can miss cases where there is a pre-configured
filter defined in the analysis registry. One example here is the elision filter, which
has a pre-configured version built with the french elision set; when used as part
of normal analysis, this preconfigured set is used, but when used as part of the
Analyze API we end up with NPEs because it tries to instantiate the filter with
no index settings.
This commit changes the Analyze API to check for pre-configured filters in the case
that the request has no index defined, and is using a name rather than a custom
definition for a filter.
Relates to #43002