-
Notifications
You must be signed in to change notification settings - Fork 25.8k
Ngram/Edgengram filters don't work with keyword repeat filters #22478
Description
Elasticsearch version: 2.3.3
Plugins installed: [analysis-icu analysis-smartcn delete-by-query lang-javascript whatson analysis-kuromoji analysis-stempel elasticsearch-inquisitor head langdetect statsd/]
Description of the problem including expected versus actual behavior:
I want to index edgengrams from 3 to 15 chars, but also keep the original token in the field as well. This is being used for search as you type functionality. For both speed and relevancy reasons we've settled on 3 being the min num of chars that makes sense, but it leaves some gaps for non-whitespace separated languages and for words like 'pi'.
I thought I could do this using keyword_repeat and unique filters in my analyzer, but that doesn't seem to work with edgengram filters. Maybe I'm doing it wrong, but I haven't come up with a workaround yet.
Steps to reproduce:
PUT test_analyzer
{
"settings": {
"analysis": {
"analyzer": {
"edgengram_analyzer": {
"filter": [
"icu_normalizer",
"icu_folding",
"keyword_repeat",
"edgengram_filter",
"unique_filter"
],
"type": "custom",
"tokenizer": "icu_tokenizer"
},
"default": {
"filter": [
"icu_normalizer",
"icu_folding"
],
"type": "custom",
"tokenizer": "icu_tokenizer"
}
},
"filter": {
"unique_filter": {
"type": "unique",
"only_on_same_position": "true"
},
"edgengram_filter": {
"type": "edgeNGram",
"min_gram": "3",
"max_gram": "15"
}
}
}
},
"mappings": {
"my_type": {
"properties": {
"my_text": {
"type": "string",
"similarity": "BM25",
"analyzer": "default",
"fields": {
"ngram": {
"type": "string",
"term_vector": "with_positions_offsets",
"similarity": "BM25",
"analyzer": "edgengram_analyzer",
"search_analyzer": "default"
},
"word_count": {
"type": "token_count",
"analyzer": "default"
}
}
}
}
}
}
}
GET test_analyzer/_analyze
{
"analyzer": "edgengram_analyzer",
"text": "Is this déjà vu?"
}
Output:
{
"tokens": [
{
"token": "thi",
"start_offset": 3,
"end_offset": 7,
"type": "word",
"position": 1
},
{
"token": "this",
"start_offset": 3,
"end_offset": 7,
"type": "word",
"position": 1
},
{
"token": "dej",
"start_offset": 8,
"end_offset": 12,
"type": "word",
"position": 2
},
{
"token": "deja",
"start_offset": 8,
"end_offset": 12,
"type": "word",
"position": 2
}
]
}
I'd expect to get the tokens: is, thi, this, dej, deja, vu
The problem gets worse when looking at non-whitespace languages where many characters are tokenized into one character per token.
I could search across multiple fields, but that prevents me from matching on phrases and using those phrase matches to boost results. For instance if the user types in "hi ther" we should be able to match instances where the content had "hi there" and use that to boost those exact matches. We do this by adding a simple should clause:
"bool": {
"must": [
{
"multi_match": {
"fields": [
"mlt_content.default.ngram"
],
"query": "hi ther",
"operator": "and",
"type": "cross_fields"
}
}
],
"should": [
{
"multi_match": {
"type": "phrase",
"fields": [
"mlt_content.default.ngram"
],
"query": "hi ther"
}
}
]
}
},