Skip to content

Allow TokenFilterFactories to rewrite themselves against their preceding chain#33702

Merged
romseygeek merged 10 commits intoelastic:masterfrom
romseygeek:synonym-chains
Sep 19, 2018
Merged

Allow TokenFilterFactories to rewrite themselves against their preceding chain#33702
romseygeek merged 10 commits intoelastic:masterfrom
romseygeek:synonym-chains

Conversation

@romseygeek
Copy link
Copy Markdown
Contributor

@romseygeek romseygeek commented Sep 14, 2018

We currently special-case SynonymFilterFactory and SynonymGraphFilterFactory, which need to know their predecessors in the analysis chain in order to correctly analyze their synonym lists. This special-casing doesn't work with Referring filter factories, such as the Multiplexer or Conditional filters. We also have a number of filters (eg the Multiplexer) that will break synonyms when they appear before them in a chain, because they produce multiple tokens at the same position.

This commit adds two methods to the TokenFilterFactory interface.

  • getChainAwareTokenFilterFactory() allows a filter factory to rewrite itself against its preceding filter chain, or to resolve references to other filters. It replaces ReferringFilterFactory and CustomAnalyzerProvider.checkAndApplySynonymFilter, and by default returns this.
  • runForSynonyms() defines whether or not a filter should be applied when building a synonym list Analyzer. By default it returns true.

Fixes #33609

@elasticmachine
Copy link
Copy Markdown
Collaborator

Pinging @elastic/es-search-aggs

@romseygeek
Copy link
Copy Markdown
Contributor Author

This would also allow us to move the Synonym filters back into the common analysis module, as the special-casing is no longer required.

Another follow-up could be to add a runForNormalizer method, replacing the slightly hacky MultiTermComponent instance checks, and making it much easier to add new filters to the normalization whitelist.

@jpountz
Copy link
Copy Markdown
Contributor

jpountz commented Sep 14, 2018

I like the idea in general. Maybe we should have eg. TokenFilterFactory getSynonymAnalysisComponent or something similar instead of runForSynonyms to be consistent with multi-term analysis?

Another follow-up could be to add a runForNormalizer method, replacing the slightly hacky MultiTermComponent instance checks, and making it much easier to add new filters to the normalization whitelist.

Agreed that MultiTermComponent is a bit hacky + not type safe which is a pity, we need to fix it. Hopefully we can do it in Lucene at the same time so that the way that this problem is handled on both sides remains similar.

Copy link
Copy Markdown
Contributor

@jpountz jpountz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left some comments, this is a great change for the ootb user-experience.


@Override
public TokenFilterFactory getSynonymFilter() {
return null;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should it be null or a dummy factory that doesn't wrap the token stream? I'm afraid that we might be exposing ourselves to null-pointer exceptions?

public TokenFilterFactory getChainAwareTokenFilterFactory(TokenizerFactory tokenizer, List<CharFilterFactory> charFilters,
List<TokenFilterFactory> previousTokenFilters,
Function<String, TokenFilterFactory> allFilters) {
if (filters == null) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we worry about concurrency? Caching makes me a little uncomfortable due to the fact that it implies we expect that this will always be called with the same arguments, should we change the API or try to detect misusage?

@romseygeek
Copy link
Copy Markdown
Contributor Author

test this please

@romseygeek
Copy link
Copy Markdown
Contributor Author

I refactored things a bit to remove the caching. ScriptedConditionTokenFilter and MultiplexerTokenFilter now generate everything from scratch and return new TokenFilterFactory instances.

Copy link
Copy Markdown
Contributor

@jpountz jpountz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@romseygeek romseygeek merged commit 5107949 into elastic:master Sep 19, 2018
@romseygeek romseygeek deleted the synonym-chains branch September 19, 2018 14:52
romseygeek added a commit that referenced this pull request Sep 19, 2018
…ing chain (#33702)

We currently special-case SynonymFilterFactory and SynonymGraphFilterFactory, which need to
know their predecessors in the analysis chain in order to correctly analyze their synonym lists. This
special-casing doesn't work with Referring filter factories, such as the Multiplexer or Conditional
filters. We also have a number of filters (eg the Multiplexer) that will break synonyms when they
appear before them in a chain, because they produce multiple tokens at the same position.

This commit adds two methods to the TokenFilterFactory interface.

* `getChainAwareTokenFilterFactory()` allows a filter factory to rewrite itself against its preceding
  filter chain, or to resolve references to other filters. It replaces `ReferringFilterFactory` and
  `CustomAnalyzerProvider.checkAndApplySynonymFilter`, and by default returns `this`.
* `getSynonymFilter()` defines whether or not a filter should be applied when building a synonym
  list `Analyzer`. By default it returns `true`.

Fixes #33609
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

keyword_repeat and multiplexer don't play well with subsequent synonym filters

4 participants