Skip to content

Ngram/Edgengram filters don't work with keyword repeat filters #22478

@gibrown

Description

@gibrown

Elasticsearch version: 2.3.3

Plugins installed: [analysis-icu analysis-smartcn delete-by-query lang-javascript whatson analysis-kuromoji analysis-stempel elasticsearch-inquisitor head langdetect statsd/]

Description of the problem including expected versus actual behavior:

I want to index edgengrams from 3 to 15 chars, but also keep the original token in the field as well. This is being used for search as you type functionality. For both speed and relevancy reasons we've settled on 3 being the min num of chars that makes sense, but it leaves some gaps for non-whitespace separated languages and for words like 'pi'.

I thought I could do this using keyword_repeat and unique filters in my analyzer, but that doesn't seem to work with edgengram filters. Maybe I'm doing it wrong, but I haven't come up with a workaround yet.

Steps to reproduce:

PUT test_analyzer
{
  "settings": {
    "analysis": {
      "analyzer": {
        "edgengram_analyzer": {
          "filter": [
            "icu_normalizer",
            "icu_folding",
            "keyword_repeat",
            "edgengram_filter",
            "unique_filter"
          ],
          "type": "custom",
          "tokenizer": "icu_tokenizer"
        },
        "default": {
          "filter": [
            "icu_normalizer",
            "icu_folding"
          ],
          "type": "custom",
          "tokenizer": "icu_tokenizer"
        }
      },
      "filter": {
        "unique_filter": {
          "type": "unique",
          "only_on_same_position": "true"
        },
        "edgengram_filter": {
          "type": "edgeNGram",
          "min_gram": "3",
          "max_gram": "15"
        }
      }
    }
  },
  "mappings": {
    "my_type": {
      "properties": {
        "my_text": {
          "type": "string",
          "similarity": "BM25",
          "analyzer": "default",
          "fields": {
            "ngram": {
              "type": "string",
              "term_vector": "with_positions_offsets",
              "similarity": "BM25",
              "analyzer": "edgengram_analyzer",
              "search_analyzer": "default"
            },
            "word_count": {
              "type": "token_count",
              "analyzer": "default"
            }
          }
        }
      }
    }
  }
}

GET test_analyzer/_analyze 
{
  "analyzer": "edgengram_analyzer", 
  "text":     "Is this déjà vu?"
}

Output:

{
  "tokens": [
    {
      "token": "thi",
      "start_offset": 3,
      "end_offset": 7,
      "type": "word",
      "position": 1
    },
    {
      "token": "this",
      "start_offset": 3,
      "end_offset": 7,
      "type": "word",
      "position": 1
    },
    {
      "token": "dej",
      "start_offset": 8,
      "end_offset": 12,
      "type": "word",
      "position": 2
    },
    {
      "token": "deja",
      "start_offset": 8,
      "end_offset": 12,
      "type": "word",
      "position": 2
    }
  ]
}

I'd expect to get the tokens: is, thi, this, dej, deja, vu

The problem gets worse when looking at non-whitespace languages where many characters are tokenized into one character per token.

I could search across multiple fields, but that prevents me from matching on phrases and using those phrase matches to boost results. For instance if the user types in "hi ther" we should be able to match instances where the content had "hi there" and use that to boost those exact matches. We do this by adding a simple should clause:

            "bool": {
              "must": [
                {
                  "multi_match": {
                    "fields": [
                      "mlt_content.default.ngram"
                    ],
                    "query": "hi ther",
                    "operator": "and",
                    "type": "cross_fields"
                  }
                }
              ],
              "should": [
                {
                  "multi_match": {
                    "type": "phrase",
                    "fields": [
                      "mlt_content.default.ngram"
                    ],
                    "query": "hi ther"
                  }
                }
              ]
            }
          },

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions