@@ -15,28 +15,31 @@ documents that contain related words like `fast fox` or `foxes leap`.
1515[[tokenization]]
1616=== Tokenization
1717
18- Analysis makes full-text search possible by breaking an text down into smaller
19- chunks, called _tokens_. In most cases, these tokens are individual words.
18+ Analysis makes full-text search possible through _tokenization_: breaking a text
19+ down into smaller chunks, called _tokens_. In most cases, these tokens are
20+ individual words.
2021
21- For example, without analysis, the text `Quick brown fox` can only be
22- matched by searches for the exact string `Quick brown fox`. With analysis,
23- the text is converted to the tokens `[Quick, brown, fox]`, which can
24- be matched by searches for `Quick fox`, `fox brown`, or other variations.
22+ If you index the phrase `the quick brown fox jumps` as a single string and the
23+ user searches for `quick fox`, it isn't considered a match. However, if you
24+ tokenize the phrase and index each word separately, the terms in the query
25+ string can be looked up individually. This means they can be matched by searches
26+ for `quick fox`, `fox brown`, or other variations.
2527
26- While improved, this example search experience still has a few problems:
28+ [discrete]
29+ [[normalization]]
30+ === Normalization
31+
32+ Tokenization enables matching on individual terms, but each token is still
33+ matched literally. This means:
2734
2835* A search for `Quick` would not match `quick`, even though you likely want
2936either term to match the other
3037
31- * `fox` and `foxes` share the same root word. However,
32- a search for `foxes` would not match `fox` or vice versa.
38+ * Although `fox` and `foxes` share the same root word, a search for `foxes`
39+ would not match `fox` or vice versa.
3340
34- * While `jumps` and `leaps` don't share a root word, they are synonyms and have
35- a similar meaning. However, a search for one would not match the other.
36-
37- [discrete]
38- [[normalization]]
39- === Normalization
41+ * A search for `jumps` would not match `leaps`. While they don't share a root
42+ word, they are synonyms and have a similar meaning.
4043
4144To solve these problems, text analysis can _normalize_ these tokens into a
4245standard format. This allows you to match tokens that are not exactly the same
0 commit comments