Use preconfigured filters correctly in Analyze API by romseygeek · Pull Request #43568 · elastic/elasticsearch

romseygeek · 2019-06-25T08:56:33Z

When a named token filter or char filter is passed as part of an Analyze API
request with no index, we currently try and build the relevant filter using no
index settings. However, this can miss cases where there is a pre-configured
filter defined in the analysis registry. One example here is the elision filter, which
has a pre-configured version built with the french elision set; when used as part
of normal analysis, this preconfigured set is used, but when used as part of the
Analyze API we end up with NPEs because it tries to instantiate the filter with
no index settings.

This commit changes the Analyze API to check for pre-configured filters in the case
that the request has no index defined, and is using a name rather than a custom
definition for a filter.

Relates to #43002

elasticmachine · 2019-06-25T08:56:35Z

Pinging @elastic/es-search

…at name

romseygeek · 2019-06-25T11:54:30Z

This is an interesting failure: the pre-configured EdgeNGramTokenizer uses the lucene default values for min and max, which are both 1; however, the elasticsearch EdgeNGramTokenizerFactory instead uses default values from NGramTokenizer, giving a min of 1 and a max of 2. This is a pretty clear bug - settings of tokenizer=edge_ngram and tokenizer.type=edge_ngram give you different configurations.

cbuescher

The main change in AnalysisRegistry makes sense to me looking at the underlying issue. When reading the test I was wondering if it would be better (and/or easier) to start testing AnalysisRegistry#buildCustomAnalyzer() more directly in AnalysisRegistryTests instead of going through TransportAnalyzeActionTests. Not a heard requirement for this PR but maybe there are things that could be pulled out or added easily.

…ilters

romseygeek · 2019-06-26T08:18:12Z

When reading the test I was wondering if it would be better (and/or easier) to start testing AnalysisRegistry#buildCustomAnalyzer() more directly in AnalysisRegistryTests instead of going through TransportAnalyzeActionTests

+1, that makes sense, I'll open a follow-up issue

cbuescher

Left a minor comment around testing, feel free to disregard if you don't like the idea. Rest LGTM

cbuescher · 2019-06-26T12:59:52Z

...analysis-common/src/test/java/org/elasticsearch/analysis/common/EdgeNGramTokenizerTests.java

+                    VersionUtils.randomVersionBetween(random(), Version.V_7_0_0, VersionUtils.getPreviousVersion(Version.V_7_3_0)))
+                .put("index.analysis.analyzer.my_analyzer.tokenizer", "edge_ngram")
+                .build();
+            IndexSettings idxSettings = IndexSettingsModule.newIndexSettings("index", indexSettings);


Maybe it would make sense to factor out the settings creation into a private helper that takes the version and the tokenizer name as arguments. Its more or less the same in all four cases and takes up a bit of space.

cbuescher · 2019-06-26T13:02:19Z

...c/test/java/org/elasticsearch/analysis/common/WordDelimiterGraphTokenFilterFactoryTests.java

+                .put("index.analysis.analyzer.my_analyzer.tokenizer", "standard")
+                .putList("index.analysis.analyzer.my_analyzer.filter", "word_delimiter_graph")
+                .build();
+            IndexSettings idxSettings = IndexSettingsModule.newIndexSettings("index", indexSettings);


…ilters

When a named token filter or char filter is passed as part of an Analyze API request with no index, we currently try and build the relevant filter using no index settings. However, this can miss cases where there is a pre-configured filter defined in the analysis registry. One example here is the elision filter, which has a pre-configured version built with the french elision set; when used as part of normal analysis, this preconfigured set is used, but when used as part of the Analyze API we end up with NPEs because it tries to instantiate the filter with no index settings. This commit changes the Analyze API to check for pre-configured filters in the case that the request has no index defined, and is using a name rather than a custom definition for a filter. It also changes the pre-configured `word_delimiter_graph` filter and `edge_ngram` tokenizer to make their settings consistent with the defaults used when creating them with no settings Closes #43002 Closes #43621 Closes #43582

…r is used (#43684) #26625 deprecated delimited_payload_filter and added tests to check that warnings would be emitted when both a normal and pre-configured filter were used. Unfortunately, due to a bug in the Analyze API, the pre- configured filter check was never actually triggered, and it turns out that the deprecation warning was not in fact being emitted in this case. #43568 fixed the Analyze API bug, which then surfaced this on backport. This commit ensures that the preconfigured filter also emits the warnings and triggers an error if a new index tries to use a preconfigured delimited_payload_filter

edgeNGram and NGram tokenizers and token filters were deprecated. They have not been supported in indices created from 8.0, hence their support can entirely be removed from main. The version related logic around the min grams can also be removed as it refers to 7.x which we no longer need to support. Relates to elastic#50376, elastic#50862, elastic#43568

edgeNGram and NGram tokenizers and token filters were deprecated. They have not been supported in indices created from 8.0, hence their support can entirely be removed from main. The version related logic around the min grams can also be removed as it refers to 7.x which we no longer need to support. Relates to #50376, #50862, #43568

…c#113009) edgeNGram and NGram tokenizers and token filters were deprecated. They have not been supported in indices created from 8.0, hence their support can entirely be removed from main. The version related logic around the min grams can also be removed as it refers to 7.x which we no longer need to support. Relates to elastic#50376, elastic#50862, elastic#43568

Use preconfig filters when no index and no config

7bdeed0

romseygeek added >bug :Search Relevance/Analysis How text is split into tokens v8.0.0 v7.3.0 labels Jun 25, 2019

romseygeek self-assigned this Jun 25, 2019

romseygeek mentioned this pull request Jun 25, 2019

Require [articles] setting in elision filter #43083

Merged

romseygeek added 2 commits June 25, 2019 11:35

Fall back to global components if there are no prebuilt ones under th…

9442e41

…at name

Add unit-test for fallback

04d5155

cbuescher self-requested a review June 25, 2019 13:19

Add fix for elastic#43582

fd4086b

cbuescher reviewed Jun 25, 2019

View reviewed changes

romseygeek added 2 commits June 26, 2019 09:01

Fixes elastic#43621

8c01ed3

Merge remote-tracking branch 'origin/master' into analyze-preconfig-f…

4e4d08a

…ilters

cbuescher approved these changes Jun 26, 2019

View reviewed changes

romseygeek added 3 commits June 26, 2019 15:41

Merge remote-tracking branch 'origin/master' into analyze-preconfig-f…

99ba487

…ilters

dry up tests a bit

b593496

checkstyle

62e9822

romseygeek merged commit fbefb46 into elastic:master Jun 27, 2019

romseygeek deleted the analyze-preconfig-filters branch June 27, 2019 08:08

romseygeek mentioned this pull request Jun 27, 2019

Issue deprecation warnings for preconfigured delimited_payload_filter #43684

Merged

Mpdreamz mentioned this pull request Aug 7, 2019

[meta] 7.3 Release elastic/elasticsearch-net#4001

Closed

16 tasks

jakelandis added v8.0.0-alpha1 and removed v8.0.0 labels Jul 26, 2021

javanna mentioned this pull request Sep 17, 2024

Remove deprecations and 7.x related code from analysis common #113009

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use preconfigured filters correctly in Analyze API#43568

Use preconfigured filters correctly in Analyze API#43568
romseygeek merged 9 commits intoelastic:masterfrom
romseygeek:analyze-preconfig-filters

romseygeek commented Jun 25, 2019

Uh oh!

elasticmachine commented Jun 25, 2019

Uh oh!

romseygeek commented Jun 25, 2019

Uh oh!

cbuescher left a comment

Uh oh!

romseygeek commented Jun 26, 2019

Uh oh!

cbuescher left a comment

Uh oh!

cbuescher Jun 26, 2019

Uh oh!

cbuescher Jun 26, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

romseygeek commented Jun 25, 2019

Uh oh!

elasticmachine commented Jun 25, 2019

Uh oh!

romseygeek commented Jun 25, 2019

Uh oh!

cbuescher left a comment

Choose a reason for hiding this comment

Uh oh!

romseygeek commented Jun 26, 2019

Uh oh!

cbuescher left a comment

Choose a reason for hiding this comment

Uh oh!

cbuescher Jun 26, 2019

Choose a reason for hiding this comment

Uh oh!

cbuescher Jun 26, 2019

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants