Skip to content

Commit b476887

Browse files
committed
Tokenization rework
1 parent 8b6766a commit b476887

1 file changed

Lines changed: 18 additions & 15 deletions

File tree

docs/reference/analysis/overview.asciidoc

Lines changed: 18 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -15,28 +15,31 @@ documents that contain related words like `fast fox` or `foxes leap`.
1515
[[tokenization]]
1616
=== Tokenization
1717

18-
Analysis makes full-text search possible by breaking an text down into smaller
19-
chunks, called _tokens_. In most cases, these tokens are individual words.
18+
Analysis makes full-text search possible through _tokenization_: breaking a text
19+
down into smaller chunks, called _tokens_. In most cases, these tokens are
20+
individual words.
2021

21-
For example, without analysis, the text `Quick brown fox` can only be
22-
matched by searches for the exact string `Quick brown fox`. With analysis,
23-
the text is converted to the tokens `[Quick, brown, fox]`, which can
24-
be matched by searches for `Quick fox`, `fox brown`, or other variations.
22+
If you index the phrase `the quick brown fox jumps` as a single string and the
23+
user searches for `quick fox`, it isn't considered a match. However, if you
24+
tokenize the phrase and index each word separately, the terms in the query
25+
string can be looked up individually. This means they can be matched by searches
26+
for `quick fox`, `fox brown`, or other variations.
2527

26-
While improved, this example search experience still has a few problems:
28+
[discrete]
29+
[[normalization]]
30+
=== Normalization
31+
32+
Tokenization enables matching on individual terms, but each token is still
33+
matched literally. This means:
2734

2835
* A search for `Quick` would not match `quick`, even though you likely want
2936
either term to match the other
3037

31-
* `fox` and `foxes` share the same root word. However,
32-
a search for `foxes` would not match `fox` or vice versa.
38+
* Although `fox` and `foxes` share the same root word, a search for `foxes`
39+
would not match `fox` or vice versa.
3340

34-
* While `jumps` and `leaps` don't share a root word, they are synonyms and have
35-
a similar meaning. However, a search for one would not match the other.
36-
37-
[discrete]
38-
[[normalization]]
39-
=== Normalization
41+
* A search for `jumps` would not match `leaps`. While they don't share a root
42+
word, they are synonyms and have a similar meaning.
4043

4144
To solve these problems, text analysis can _normalize_ these tokens into a
4245
standard format. This allows you to match tokens that are not exactly the same

0 commit comments

Comments
 (0)