Skip to content

Allow specific characters in token_chars of edge ngram tokenizer in addition to classes #25894

@edudar

Description

@edudar

I bumped into this while implementing autocomplete/typeahead functionality with highlighting.

My index settings are:

  analysis:
    tokenizer:
      autocomplete_highlight:
        type: edgeNGram
        min_gram: 1
        max_gram: 15
        token_chars: ["letter", "digit"]
    filter:
      autocomplete_ngram:
        type: edgeNGram
        min_gram: 1
        max_gram: 15
    analyzer:
      autocomplete_index:
        type: custom
        tokenizer: icu_tokenizer
        filter: [standard, icu_normalizer, icu_folding, autocomplete_ngram]
      autocomplete_search:
        type: custom
        tokenizer: icu_tokenizer
        filter: [standard, icu_normalizer, icu_folding, stop]
      autocomplete_highlight:
        type: custom
        tokenizer: autocomplete_highlight
        filter: [standard, icu_normalizer, icu_folding]

I do the search by autocomplete field and highlight on autocomplete_highlight. Everything works fine until I meet _ in a search query. icu_tokenizer keeps it while autocomplete_highlight tokenizer removes as it keeps letters and digits only. Here I can't keep _ only but full punctuation class instead that comes with a whole load of additional symbols that I don't need and they have to go.

I would be helpful to be able to specify exact characters to keep like _.

At the moment I've implemented char_filter that replaces _ with - but that's suboptimal as _ is considered a part of words (same as in icu_tokenizer) and is expected to match rather than being ignored.

Metadata

Metadata

Assignees

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions