[DOCS] Reformat n-gram token filter docs#49438
[DOCS] Reformat n-gram token filter docs#49438jrodewig merged 4 commits intoelastic:masterfrom jrodewig:reformat.gram-token-filters
Conversation
Reformats the edge n-gram and n-gram token filter docs. Changes include: * Adds title abbreviations * Updates the descriptions and adds Lucene links * Reformats parameter definitions * Adds analyze and custom analyzer snippets * Adds notes explaining differences between the edge n-gram and n-gram filters Additional changes: * Switches titles to use "n-gram" throughout. * Fixes a typo in the edge n-gram tokenizer docs * Adds an explicit anchor for the `index.max_ngram_diff` setting
|
Pinging @elastic/es-docs (>docs) |
|
Pinging @elastic/es-search (:Search/Analysis) |
romseygeek
left a comment
There was a problem hiding this comment.
Thanks @jrodewig - I left one question and one comment.
| `side`:: | ||
| (Optional, string) | ||
| Deprecated. Indicates whether to truncate tokens from the `front` or `back`. | ||
| Defaults to `front`. |
There was a problem hiding this comment.
Maybe add a note here that rather than using side:back, users should add a reverse filter before and after this filter.
There was a problem hiding this comment.
Thanks for this suggestion. Added with 111bf9b.
| -------------------------------------------------- | ||
| [ t, q, b, f, j ] | ||
| -------------------------------------------------- | ||
|
|
There was a problem hiding this comment.
I'm a bit confused here, as the default settings are min_gram of 1 and max_gram of 2, which should surely produce [ t, th, q, qu, b, br, f, fo, j, ju ]?
There was a problem hiding this comment.
I experimented with this a bit more in 8.0 and 7.4.2 and found some odd behavior.
The following _analyze request produces only unigrams:
GET _analyze
{
"tokenizer": "standard",
"filter": [ "edge_ngram" ],
"text": "the quick brown fox jumps"
}
However, treating edge_ngram as a custom filter with the standard defaults produces both unigrams and bigrams:
GET _analyze
{
"tokenizer": "standard",
"filter": [
{ "type": "edge_ngram" }
],
"text": "the quick brown fox jumps"
}
I updated the analyze example to use the custom filter format with aeab02b.
If you can, let me know if this is a bug, undocumented but expected behavior, or just my misunderstanding of how the _analyze API works. I'm happy to create a bug issue or document this behavior if needed.
Thanks!
There was a problem hiding this comment.
This looks like a discrepancy between the pre-configured token filter and the default settings for a custom filter; the pre-configured filter uses min & max gram of 1, but the custom defaults are 1 and 2. It's a bit weird, but it's always been done like that, so I guess we just explicitly call it out in the documentation?
Reformats the edge n-gram and n-gram token filter docs. Changes include: * Adds title abbreviations * Updates the descriptions and adds Lucene links * Reformats parameter definitions * Adds analyze and custom analyzer snippets * Adds notes explaining differences between the edge n-gram and n-gram filters Additional changes: * Switches titles to use "n-gram" throughout. * Fixes a typo in the edge n-gram tokenizer docs * Adds an explicit anchor for the `index.max_ngram_diff` setting
Reformats the edge n-gram and n-gram token filter docs. Changes include: * Adds title abbreviations * Updates the descriptions and adds Lucene links * Reformats parameter definitions * Adds analyze and custom analyzer snippets * Adds notes explaining differences between the edge n-gram and n-gram filters Additional changes: * Switches titles to use "n-gram" throughout. * Fixes a typo in the edge n-gram tokenizer docs * Adds an explicit anchor for the `index.max_ngram_diff` setting
Reformats the edge n-gram and n-gram token filter docs. Changes include: * Adds title abbreviations * Updates the descriptions and adds Lucene links * Reformats parameter definitions * Adds analyze and custom analyzer snippets * Adds notes explaining differences between the edge n-gram and n-gram filters Additional changes: * Switches titles to use "n-gram" throughout. * Fixes a typo in the edge n-gram tokenizer docs * Adds an explicit anchor for the `index.max_ngram_diff` setting
Reformats the edge n-gram and n-gram token filter docs as part of #44726. Changes include:
filters
Supporting changes:
index.max_ngram_diffsetting