Skip to content

Commit 1c4e60e

Browse files
authored
[DOCS] Add stemming concept docs (#55156)
Adds conceptual documentation for stemming, including: * An overview of why stemming is helpful in search * Algorithmic vs. dictionary stemming * Token filters used to control stemming, such as `stemmer_override`, `keyword_marker`, and `conditional`
1 parent 76170ed commit 1c4e60e

3 files changed

Lines changed: 128 additions & 0 deletions

File tree

docs/reference/analysis/analyzers/lang-analyzer.asciidoc

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -48,6 +48,7 @@ the config, or by using an external stopwords file by setting
4848
`stopwords_path`. Check <<analysis-stop-analyzer,Stop Analyzer>> for
4949
more details.
5050

51+
[[_excluding_words_from_stemming]]
5152
===== Excluding words from stemming
5253

5354
The `stem_exclusion` parameter allows you to specify an array

docs/reference/analysis/concepts.asciidoc

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -8,8 +8,10 @@ This section explains the fundamental concepts of text analysis in {es}.
88

99
* <<analyzer-anatomy>>
1010
* <<analysis-index-search-time>>
11+
* <<stemming>>
1112
* <<token-graphs>>
1213

1314
include::anatomy.asciidoc[]
1415
include::index-search-time.asciidoc[]
16+
include::stemming.asciidoc[]
1517
include::token-graphs.asciidoc[]
Lines changed: 125 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,125 @@
1+
[[stemming]]
2+
=== Stemming
3+
4+
_Stemming_ is the process of reducing a word to its root form. This ensures
5+
variants of a word match during a search.
6+
7+
For example, `walking` and `walked` can be stemmed to the same root word:
8+
`walk`. Once stemmed, an occurrence of either word would match the other in a
9+
search.
10+
11+
Stemming is language-dependent but often involves removing prefixes and
12+
suffixes from words.
13+
14+
In some cases, the root form of a stemmed word may not be a real word. For
15+
example, `jumping` and `jumpiness` can both be stemmed to `jumpi`. While `jumpi`
16+
isn't a real English word, it doesn't matter for search; if all variants of a
17+
word are reduced to the same root form, they will match correctly.
18+
19+
[[temmer-token-filters]]
20+
==== Stemmer token filters
21+
22+
In {es}, stemming is handled by stemmer <<analyzer-anatomy-token-filters,token
23+
filters>>. These token filters can be categorized based on how they stem words:
24+
25+
* <<algorithmic-stemmers,Algorithmic stemmers>>, which stem words based on a set
26+
of rules
27+
* <<dictionary-stemmers,Dictionary stemmers>>, which stem words by looking them
28+
up in a dictionary
29+
30+
Because stemming changes tokens, we recommend using the same stemmer token
31+
filters during <<analysis-index-search-time,index and search analysis>>.
32+
33+
[[algorithmic-stemmers]]
34+
==== Algorithmic stemmers
35+
36+
Algorithmic stemmers apply a series of rules to each word to reduce it to its
37+
root form. For example, an algorithmic stemmer for English may remove the `-s`
38+
and `-es` prefixes from the end of plural words.
39+
40+
Algorithmic stemmers have a few advantages:
41+
42+
* They require little setup and usually work well out of the box.
43+
* They use little memory.
44+
* They are typically faster than <<dictionary-stemmers,dictionary stemmers>>.
45+
46+
However, most algorithmic stemmers only alter the existing text of a word. This
47+
means they may not work well with irregular words that don't contain their root
48+
form, such as:
49+
50+
* `be`, `are`, and `am`
51+
* `mouse` and `mice`
52+
* `foot` and `feet`
53+
54+
The following token filters use algorithmic stemming:
55+
56+
* <<analysis-stemmer-tokenfilter,`stemmer`>>, which provides algorithmic
57+
stemming for several languages, some with additional variants.
58+
* <<analysis-kstem-tokenfilter,`kstem`>>, a stemmer for English that combines
59+
algorithmic stemming with a built-in dictionary.
60+
* <<analysis-porterstem-tokenfilter,`porter_stem`>>, our recommended algorithmic
61+
stemmer for English.
62+
* <<analysis-snowball-tokenfilter,`snowball`>>, which uses
63+
http://snowball.tartarus.org/[Snowball]-based stemming rules for several
64+
languages.
65+
66+
[[dictionary-stemmers]]
67+
==== Dictionary stemmers
68+
69+
Dictionary stemmers look up words in a provided dictionary, replacing unstemmed
70+
word variants with stemmed words from the dictionary.
71+
72+
In theory, dictionary stemmers are well suited for:
73+
74+
* Stemming irregular words
75+
* Discerning between words that are spelled similarly but not related
76+
conceptually, such as:
77+
** `organ` and `organization`
78+
** `broker` and `broken`
79+
80+
In practice, algorithmic stemmers typically outperform dictionary stemmers. This
81+
is because dictionary stemmers have the following disadvantages:
82+
83+
* *Dictionary quality* +
84+
A dictionary stemmer is only as good as its dictionary. To work well, these
85+
dictionaries must include a significant number of words, be updated regularly,
86+
and change with language trends. Often, by the time a dictionary has been made
87+
available, it's incomplete and some of its entries are already outdated.
88+
89+
* *Size and performance* +
90+
Dictionary stemmers must load all words, prefixes, and suffixes from its
91+
dictionary into memory. This can use a significant amount of RAM. Low-quality
92+
dictionaries may also be less efficient with prefix and suffix removal, which
93+
can slow the stemming process significantly.
94+
95+
You can use the <<analysis-hunspell-tokenfilter,`hunspell`>> token filter to
96+
perform dictionary stemming.
97+
98+
[TIP]
99+
====
100+
If available, we recommend trying an algorithmic stemmer for your language
101+
before using the <<analysis-hunspell-tokenfilter,`hunspell`>> token filter.
102+
====
103+
104+
[[control-stemming]]
105+
==== Control stemming
106+
107+
Sometimes stemming can produce shared root words that are spelled similarly but
108+
not related conceptually. For example, a stemmer may reduce both `skies` and
109+
`skiing` to the same root word: `ski`.
110+
111+
To prevent this and better control stemming, you can use the following token
112+
filters:
113+
114+
* <<analysis-stemmer-override-tokenfilter,`stemmer_override`>>, which lets you
115+
define rules for stemming specific tokens.
116+
* <<analysis-keyword-marker-tokenfilter,`keyword_marker`>>, which marks
117+
specified tokens as keywords. Keyword tokens are not stemmed by subsequent
118+
stemmer token filters.
119+
* <<analysis-condition-tokenfilter,`conditional`>>, which can be used to mark
120+
tokens as keywords, similar to the `keyword_marker` filter.
121+
122+
123+
For built-in <<analysis-lang-analyzer,language analyzers>>, you also can use the
124+
<<_excluding_words_from_stemming,`stem_exclusion`>> parameter to specify a list
125+
of words that won't be stemmed.

0 commit comments

Comments
 (0)