[DOCS] Add overview page to analysis topic#50515
[DOCS] Add overview page to analysis topic#50515jrodewig merged 4 commits intoelastic:masterfrom jrodewig:analysis-overview
Conversation
|
Pinging @elastic/es-docs (>docs) |
|
Pinging @elastic/es-search (:Search/Analysis) |
debadair
left a comment
There was a problem hiding this comment.
Left some editorial suggestions. I think we need to make a call WRT what goes in the part intros vs subsections.
| @@ -0,0 +1,75 @@ | |||
|
|
|||
There was a problem hiding this comment.
It feels like the "overview" needs to come before the discussion and examples of index time vs search time analysis.
Do we want to follow the pattern of having separate overview sections for parts, or incorporate the overview into the part intros? It feels like we either want the part intros to be brief and contain the part nav (like the Anomaly detection topic), or provide the high-level overview for the part. I think we want to avoid having a meaty part intro AND a separate overview section.
There was a problem hiding this comment.
+1 to moving index time vs search time analysis after this section. I plan to make that change, but I think it works better as a separate PR to keep the diff size manageable. IMO the index time vs search time analysis section needs some work and could probably be split into a concept section and separate configuration tutorials.
Also agree re: avoiding a meaty intro AND a separate overview. My intent is to mirror the structure in machine learning and only provide a brief overview + nav: https://www.elastic.co/guide/en/machine-learning/master/xpack-ml.html
I plan to approach these changes iteratively, which means leaving the current top-level analysis topic as-is until I open those PRs.
| _tokens_ or _terms_ which are added to the inverted index for searching. | ||
| _Text analysis_ is the process of converting text, like the body of any email, | ||
| into _tokens_ or _terms_ which are added to the inverted index for searching. | ||
| Analysis is performed by an <<analysis-analyzers,_analyzer_>> which can be |
There was a problem hiding this comment.
This makes it sound like text analysis is only tokenization. It's almost impossible to pack everything into a single sentence. I'd expand on this and incorporate some mention of filters. Maybe something like:
Text analysis applies a combination of character filters, tokenizers, and token filters to transform a string of text into a collection of tokens that can be efficiently indexed and searched. Character filters add, remove, or modify characters in the text before a tokenizer breaks it down into individual tokens. For example, you could use a character filter to remove HTML tags so they aren't indexed. Once the text is tokenized, token filters add, remove, or modify tokens to inject synonyms, remove stopwords, and reduce each remaining token to its most basic form (stem).
At index time, the analyzed tokens are added to the index. At search time, the search terms undergo the same analysis so they can be compared to the indexed tokens.
There was a problem hiding this comment.
My goal was just to change "Analysis" to "Text analysis." I feel like this better differentiates the topic from other forms of analysis, like machine learning, etc.
I agree that the definition here needs a refresh, but I'd prefer to do that as part of a separate PR.
| Text analysis enables {es} to perform full-text search, where the search returns | ||
| all _relevant_ results rather than just exact matches. | ||
|
|
||
| For example, a full-text search for `Quick fox jumps` should not only return | ||
| documents containing that exact search string but also documents containing | ||
| similar or related words, like `fast fox` or even `foxes leap`. |
There was a problem hiding this comment.
Maybe flip it around to something like:
Searching unstructured text and returning the most relevant results requires being able to find more than exact matches. If you search for Quick fox jumps, you probably want the document that contains A quick brown fox jumps over the lazy dog, and you might also want documents that contain related phrases like fast fox or foxes leap. To enable this behavior, ES can perform text analysis during both index and query time.
It feels like this rationale should really come before the definition of text analysis, but if you keep the separate overview, you could follow this with a more detailed description of text analysis as an intro to the tokenization and normalization info.
There was a problem hiding this comment.
Reworded this based on your suggestion with 8b6766a.
However, I did remove the bit about index and query time analysis. I don't feel that level of detail is needed at this point. It'll also make it easier to relocate the current index/search time analysis info into their new home.
| [discrete] | ||
| [[analysis-customization]] | ||
| === Customize text analysis | ||
|
|
There was a problem hiding this comment.
I think might be helpful to introduce the notion of an analysis chain, and use that to lead into defining what an analyzer is. (Which is essentially the embodiment of the analysis chain.)
It would be nice to illustrate this with a diagram.
There was a problem hiding this comment.
My primary goal with this section is to show that users can customize analysis at the granular level. I didn't feel I could do that without a very high-level explanation of an analyzer, but I want to avoid mentioning specific components if possible.
I feel like a rework of "Anatomy of an analyzer" would be a better home for the analysis chain info:
https://www.elastic.co/guide/en/elasticsearch/reference/master/analyzer-anatomy.html
| [[tokenization]] | ||
| === Tokenization | ||
|
|
||
| Analysis makes full-text search possible by breaking an text down into smaller |
There was a problem hiding this comment.
I'd put the focus on tokenization over analysis, and provide a bit more of the "why". Something like:
Indexing the contents of a field as individual tokens rather than a single string makes full-text search possible. If you index the phrase the quick brown fox jumps as a single string and the user searches for quick fox, it isn't considered a match. However, if you tokenize the phrase and index each word separately, the terms in the query string can be looked up individually.
There was a problem hiding this comment.
Reworded with b476887. I did keep an updated version of the original intro para.
| the text is converted to the tokens `[Quick, brown, fox]`, which can | ||
| be matched by searches for `Quick fox`, `fox brown`, or other variations. | ||
|
|
||
| While improved, this example search experience still has a few problems: |
There was a problem hiding this comment.
I'd be inclined to push the bulk of this transition to the beginning of the following section.
"This example search experience..." is kind of awkward. I'd be inclined to say something like"
Tokenization enables matching on individual terms, but each token is still matched literally. This means that a search for "quick" won't match "Quick", and searching for "foxes" won't match "fox". You can improve matching by normalizing the tokens to a standard form.
There was a problem hiding this comment.
I reworded as suggested and moved most of this content into the Normalization section: 8b6766a
| <<analysis-standard-analyzer,standard analyzer>>, which works well for most use | ||
| cases right out of the box. | ||
|
|
||
| If you want to further tailor your search experience, you can choose a different |
There was a problem hiding this comment.
"Further tailor" doesn't seem right here, as you haven't tailored it yet.
|
Thanks as always for your review @debadair. I made several updates based on your suggestions. I also left some responses to better clarify my intent and eventual plans for the Analysis topic. Re: part intros, I like the brief overview + nav structure used in the machine learning docs: That means I'll eventually need to overhaul and relocate the current part intro for Analysis. However, I feel that's a work effort worthy of its own PR. |
debadair
left a comment
There was a problem hiding this comment.
++ For addressing the content changes separately.
|
Thanks as always @debadair! |
Adds a 'text analysis overview' page to the analysis topic docs. The goals of this page are: * Concisely summarize the analysis process while avoiding in-depth concepts, tutorials, or API examples * Explain why analysis is important, largely through highlighting problems with full-text searches missing analysis * Highlight how analysis can be used to improve search results
Adds a 'text analysis overview' page to the analysis topic docs. The goals of this page are: * Concisely summarize the analysis process while avoiding in-depth concepts, tutorials, or API examples * Explain why analysis is important, largely through highlighting problems with full-text searches missing analysis * Highlight how analysis can be used to improve search results
Adds a 'text analysis overview' page to the analysis topic docs. The goals of this page are: * Concisely summarize the analysis process while avoiding in-depth concepts, tutorials, or API examples * Explain why analysis is important, largely through highlighting problems with full-text searches missing analysis * Highlight how analysis can be used to improve search results
Adds a 'text analysis overview' page to the analysis topic docs.
The goals of this page are:
This page is inspired by the machine learning overview pages: