Skip to content

Preconfigured edge_ngram tokenizer has incorrect defaults #43582

@romseygeek

Description

@romseygeek

The docs state:

With the default settings, the `edge_ngram` tokenizer treats the initial text as a
single token and produces N-grams with minimum length `1` and maximum length
`2`:

This is corrrect if you define a new tokenizer of type edge_ngram, like so:

PUT test
{
  "settings": {
    "analysis": {
      "analyzer": {
        "default": {
          "tokenizer": "my_ngram"
        }
      },
      "tokenizer" : {
        "my_ngram" : {
          "type" : "edge_ngram"
        }
      }
    }
  }
}
GET test/_analyze
{
  "analyzer" : "default",
  "text" : "test"
}
{
  "tokens" : [
    {
      "token" : "t",
      "start_offset" : 0,
      "end_offset" : 1,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "te",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "word",
      "position" : 1
    }
  ]
}

However, if you instead use the pre-configured edge_ngram tokenizer, you only get ngrams of size 1:

PUT test
{
  "settings": {
    "analysis": {
      "analyzer": {
        "default": {
          "tokenizer": "edge_ngram"
        }
      }
    }
  }
}
GET test/_analyze
{
  "analyzer" : "default",
  "text" : "test"
}
{
  "tokens" : [
    {
      "token" : "t",
      "start_offset" : 0,
      "end_offset" : 1,
      "type" : "word",
      "position" : 0
    }
  ]
}

We should change the preconfigured filter to correspond to the documentation

Metadata

Metadata

Assignees

Labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions