|
| 1 | +[[stemming]] |
| 2 | +=== Stemming |
| 3 | + |
| 4 | +_Stemming_ is the process of reducing a word to its root form. This ensures |
| 5 | +variants of a word match during a search. |
| 6 | + |
| 7 | +For example, `walking` and `walked` can be stemmed to the same root word: |
| 8 | +`walk`. Once stemmed, an occurrence of either word would match the other in a |
| 9 | +search. |
| 10 | + |
| 11 | +Stemming is language-dependent but often involves removing prefixes and |
| 12 | +suffixes from words. |
| 13 | + |
| 14 | +In some cases, the root form of a stemmed word may not be a real word. For |
| 15 | +example, `jumping` and `jumpiness` can both be stemmed to `jumpi`. While `jumpi` |
| 16 | +isn't a real English word, it doesn't matter for search; if all variants of a |
| 17 | +word are reduced to the same root form, they will match correctly. |
| 18 | + |
| 19 | +[[temmer-token-filters]] |
| 20 | +==== Stemmer token filters |
| 21 | + |
| 22 | +In {es}, stemming is handled by stemmer <<analyzer-anatomy-token-filters,token |
| 23 | +filters>>. These token filters can be categorized based on how they stem words: |
| 24 | + |
| 25 | +* <<algorithmic-stemmers,Algorithmic stemmers>>, which stem words based on a set |
| 26 | +of rules |
| 27 | +* <<dictionary-stemmers,Dictionary stemmers>>, which stem words by looking them |
| 28 | +up in a dictionary |
| 29 | + |
| 30 | +Because stemming changes tokens, we recommend using the same stemmer token |
| 31 | +filters during <<analysis-index-search-time,index and search analysis>>. |
| 32 | + |
| 33 | +[[algorithmic-stemmers]] |
| 34 | +==== Algorithmic stemmers |
| 35 | + |
| 36 | +Algorithmic stemmers apply a series of rules to each word to reduce it to its |
| 37 | +root form. For example, an algorithmic stemmer for English may remove the `-s` |
| 38 | +and `-es` prefixes from the end of plural words. |
| 39 | + |
| 40 | +Algorithmic stemmers have a few advantages: |
| 41 | + |
| 42 | +* They require little setup and usually work well out of the box. |
| 43 | +* They use little memory. |
| 44 | +* They are typically faster than <<dictionary-stemmers,dictionary stemmers>>. |
| 45 | + |
| 46 | +However, most algorithmic stemmers only alter the existing text of a word. This |
| 47 | +means they may not work well with irregular words that don't contain their root |
| 48 | +form, such as: |
| 49 | + |
| 50 | +* `be`, `are`, and `am` |
| 51 | +* `mouse` and `mice` |
| 52 | +* `foot` and `feet` |
| 53 | + |
| 54 | +The following token filters use algorithmic stemming: |
| 55 | + |
| 56 | +* <<analysis-stemmer-tokenfilter,`stemmer`>>, which provides algorithmic |
| 57 | +stemming for several languages, some with additional variants. |
| 58 | +* <<analysis-kstem-tokenfilter,`kstem`>>, a stemmer for English that combines |
| 59 | +algorithmic stemming with a built-in dictionary. |
| 60 | +* <<analysis-porterstem-tokenfilter,`porter_stem`>>, our recommended algorithmic |
| 61 | +stemmer for English. |
| 62 | +* <<analysis-snowball-tokenfilter,`snowball`>>, which uses |
| 63 | +http://snowball.tartarus.org/[Snowball]-based stemming rules for several |
| 64 | +languages. |
| 65 | + |
| 66 | +[[dictionary-stemmers]] |
| 67 | +==== Dictionary stemmers |
| 68 | + |
| 69 | +Dictionary stemmers look up words in a provided dictionary, replacing unstemmed |
| 70 | +word variants with stemmed words from the dictionary. |
| 71 | + |
| 72 | +In theory, dictionary stemmers are well suited for: |
| 73 | + |
| 74 | +* Stemming irregular words |
| 75 | +* Discerning between words that are spelled similarly but not related |
| 76 | +conceptually, such as: |
| 77 | +** `organ` and `organization` |
| 78 | +** `broker` and `broken` |
| 79 | + |
| 80 | +In practice, algorithmic stemmers typically outperform dictionary stemmers. This |
| 81 | +is because dictionary stemmers have the following disadvantages: |
| 82 | + |
| 83 | +* *Dictionary quality* + |
| 84 | +A dictionary stemmer is only as good as its dictionary. To work well, these |
| 85 | +dictionaries must include a significant number of words, be updated regularly, |
| 86 | +and change with language trends. Often, by the time a dictionary has been made |
| 87 | +available, it's incomplete and some of its entries are already outdated. |
| 88 | + |
| 89 | +* *Size and performance* + |
| 90 | +Dictionary stemmers must load all words, prefixes, and suffixes from its |
| 91 | +dictionary into memory. This can use a significant amount of RAM. Low-quality |
| 92 | +dictionaries may also be less efficient with prefix and suffix removal, which |
| 93 | +can slow the stemming process significantly. |
| 94 | + |
| 95 | +You can use the <<analysis-hunspell-tokenfilter,`hunspell`>> token filter to |
| 96 | +perform dictionary stemming. |
| 97 | + |
| 98 | +[TIP] |
| 99 | +==== |
| 100 | +If available, we recommend trying an algorithmic stemmer for your language |
| 101 | +before using the <<analysis-hunspell-tokenfilter,`hunspell`>> token filter. |
| 102 | +==== |
| 103 | + |
| 104 | +[[control-stemming]] |
| 105 | +==== Control stemming |
| 106 | + |
| 107 | +Sometimes stemming can produce shared root words that are spelled similarly but |
| 108 | +not related conceptually. For example, a stemmer may reduce both `skies` and |
| 109 | +`skiing` to the same root word: `ski`. |
| 110 | + |
| 111 | +To prevent this and better control stemming, you can use the following token |
| 112 | +filters: |
| 113 | + |
| 114 | +* <<analysis-stemmer-override-tokenfilter,`stemmer_override`>>, which lets you |
| 115 | +define rules for stemming specific tokens. |
| 116 | +* <<analysis-keyword-marker-tokenfilter,`keyword_marker`>>, which marks |
| 117 | +specified tokens as keywords. Keyword tokens are not stemmed by subsequent |
| 118 | +stemmer token filters. |
| 119 | +* <<analysis-condition-tokenfilter,`conditional`>>, which can be used to mark |
| 120 | +tokens as keywords, similar to the `keyword_marker` filter. |
| 121 | + |
| 122 | + |
| 123 | +For built-in <<analysis-lang-analyzer,language analyzers>>, you also can use the |
| 124 | +<<_excluding_words_from_stemming,`stem_exclusion`>> parameter to specify a list |
| 125 | +of words that won't be stemmed. |
0 commit comments