[DOCS] Add overview page to analysis topic by jrodewig · Pull Request #50515 · elastic/elasticsearch

jrodewig · 2019-12-27T22:57:56Z

Adds a 'text analysis overview' page to the analysis topic docs.

The goals of this page are:

Concisely summarize the analysis process while avoiding in-depth concepts, tutorials, or API examples
Explain why analysis is important, largely through highlighting problems with full-text searches missing analysis
Highlight how analysis can be used to improve search results

This page is inspired by the machine learning overview pages:

elasticmachine · 2019-12-27T22:57:58Z

Pinging @elastic/es-docs (>docs)

elasticmachine · 2019-12-27T22:57:59Z

Pinging @elastic/es-search (:Search/Analysis)

debadair

Left some editorial suggestions. I think we need to make a call WRT what goes in the part intros vs subsections.

debadair · 2020-01-07T00:47:08Z

docs/reference/analysis/overview.asciidoc

@@ -0,0 +1,75 @@
+


It feels like the "overview" needs to come before the discussion and examples of index time vs search time analysis.

Do we want to follow the pattern of having separate overview sections for parts, or incorporate the overview into the part intros? It feels like we either want the part intros to be brief and contain the part nav (like the Anomaly detection topic), or provide the high-level overview for the part. I think we want to avoid having a meaty part intro AND a separate overview section.

+1 to moving index time vs search time analysis after this section. I plan to make that change, but I think it works better as a separate PR to keep the diff size manageable. IMO the index time vs search time analysis section needs some work and could probably be split into a concept section and separate configuration tutorials.

Also agree re: avoiding a meaty intro AND a separate overview. My intent is to mirror the structure in machine learning and only provide a brief overview + nav: https://www.elastic.co/guide/en/machine-learning/master/xpack-ml.html

I plan to approach these changes iteratively, which means leaving the current top-level analysis topic as-is until I open those PRs.

debadair · 2020-01-07T01:56:40Z

docs/reference/analysis.asciidoc

-_tokens_ or _terms_ which are added to the inverted index for searching.
+_Text analysis_ is the process of converting text, like the body of any email,
+into _tokens_ or _terms_ which are added to the inverted index for searching.
 Analysis is performed by an <<analysis-analyzers,_analyzer_>> which can be


This makes it sound like text analysis is only tokenization. It's almost impossible to pack everything into a single sentence. I'd expand on this and incorporate some mention of filters. Maybe something like:

Text analysis applies a combination of character filters, tokenizers, and token filters to transform a string of text into a collection of tokens that can be efficiently indexed and searched. Character filters add, remove, or modify characters in the text before a tokenizer breaks it down into individual tokens. For example, you could use a character filter to remove HTML tags so they aren't indexed. Once the text is tokenized, token filters add, remove, or modify tokens to inject synonyms, remove stopwords, and reduce each remaining token to its most basic form (stem).

At index time, the analyzed tokens are added to the index. At search time, the search terms undergo the same analysis so they can be compared to the indexed tokens.

My goal was just to change "Analysis" to "Text analysis." I feel like this better differentiates the topic from other forms of analysis, like machine learning, etc.

I agree that the definition here needs a refresh, but I'd prefer to do that as part of a separate PR.

debadair · 2020-01-07T02:23:06Z

docs/reference/analysis/overview.asciidoc

+Text analysis enables {es} to perform full-text search, where the search returns
+all _relevant_ results rather than just exact matches.
+
+For example, a full-text search for `Quick fox jumps` should not only return
+documents containing that exact search string but also documents containing
+similar or related words, like  `fast fox` or even `foxes leap`.


Maybe flip it around to something like:

Searching unstructured text and returning the most relevant results requires being able to find more than exact matches. If you search for Quick fox jumps, you probably want the document that contains A quick brown fox jumps over the lazy dog, and you might also want documents that contain related phrases like fast fox or foxes leap. To enable this behavior, ES can perform text analysis during both index and query time.

It feels like this rationale should really come before the definition of text analysis, but if you keep the separate overview, you could follow this with a more detailed description of text analysis as an intro to the tokenization and normalization info.

Reworded this based on your suggestion with 8b6766a.

However, I did remove the bit about index and query time analysis. I don't feel that level of detail is needed at this point. It'll also make it easier to relocate the current index/search time analysis info into their new home.

debadair · 2020-01-07T02:38:28Z

docs/reference/analysis/overview.asciidoc

+[discrete]
+[[analysis-customization]]
+=== Customize text analysis
+


I think might be helpful to introduce the notion of an analysis chain, and use that to lead into defining what an analyzer is. (Which is essentially the embodiment of the analysis chain.)

It would be nice to illustrate this with a diagram.

My primary goal with this section is to show that users can customize analysis at the granular level. I didn't feel I could do that without a very high-level explanation of an analyzer, but I want to avoid mentioning specific components if possible.

I feel like a rework of "Anatomy of an analyzer" would be a better home for the analysis chain info:
https://www.elastic.co/guide/en/elasticsearch/reference/master/analyzer-anatomy.html

Makes sense!

debadair · 2020-01-07T03:34:55Z

docs/reference/analysis/overview.asciidoc

+[[tokenization]]
+=== Tokenization
+
+Analysis makes full-text search possible by breaking an text down into smaller


I'd put the focus on tokenization over analysis, and provide a bit more of the "why". Something like:

Indexing the contents of a field as individual tokens rather than a single string makes full-text search possible. If you index the phrase the quick brown fox jumps as a single string and the user searches for quick fox, it isn't considered a match. However, if you tokenize the phrase and index each word separately, the terms in the query string can be looked up individually.

Reworded with b476887. I did keep an updated version of the original intro para.

debadair · 2020-01-07T03:40:12Z

docs/reference/analysis/overview.asciidoc

+the text is converted to the tokens `[Quick, brown, fox]`, which can
+be matched by searches for `Quick fox`, `fox brown`, or other variations.
+
+While improved, this example search experience still has a few problems:


I'd be inclined to push the bulk of this transition to the beginning of the following section.

"This example search experience..." is kind of awkward. I'd be inclined to say something like"

Tokenization enables matching on individual terms, but each token is still matched literally. This means that a search for "quick" won't match "Quick", and searching for "foxes" won't match "fox". You can improve matching by normalizing the tokens to a standard form.

I reworded as suggested and moved most of this content into the Normalization section: 8b6766a

debadair · 2020-01-07T03:41:47Z

docs/reference/analysis/overview.asciidoc

+<<analysis-standard-analyzer,standard analyzer>>, which works well for most use
+cases right out of the box.
+
+If you want to further tailor your search experience, you can choose a different


"Further tailor" doesn't seem right here, as you haven't tailored it yet.

Removed "further" with 2075e5c.

jrodewig · 2020-01-07T14:39:39Z

Thanks as always for your review @debadair.

I made several updates based on your suggestions. I also left some responses to better clarify my intent and eventual plans for the Analysis topic.

Re: part intros, I like the brief overview + nav structure used in the machine learning docs:
https://www.elastic.co/guide/en/machine-learning/master/xpack-ml.html

That means I'll eventually need to overhaul and relocate the current part intro for Analysis. However, I feel that's a work effort worthy of its own PR.

debadair

++ For addressing the content changes separately.

jrodewig · 2020-01-08T17:35:46Z

Thanks as always @debadair!

Adds a 'text analysis overview' page to the analysis topic docs. The goals of this page are: * Concisely summarize the analysis process while avoiding in-depth concepts, tutorials, or API examples * Explain why analysis is important, largely through highlighting problems with full-text searches missing analysis * Highlight how analysis can be used to improve search results

jrodewig · 2020-01-08T18:57:09Z

master: 495ce1a
7.x: 9d1567b
7.5: 973b9a2

Adds a 'text analysis overview' page to the analysis topic docs. The goals of this page are: * Concisely summarize the analysis process while avoiding in-depth concepts, tutorials, or API examples * Explain why analysis is important, largely through highlighting problems with full-text searches missing analysis * Highlight how analysis can be used to improve search results

[DOCS] Add overview page to analysis topic

ef62ff7

jrodewig added >docs General docs changes :Search Relevance/Analysis How text is split into tokens labels Dec 27, 2019

jrodewig requested a review from debadair December 27, 2019 22:57

kat257 mentioned this pull request Jan 6, 2020

[DOCS] Reorganize, rewrite and add examples to analysis topics #44726

Closed

82 tasks

debadair suggested changes Jan 7, 2020

View reviewed changes

jrodewig added 3 commits January 7, 2020 08:08

Reword intro para

8b6766a

Tokenization rework

b476887

Remove further

2075e5c

jrodewig requested a review from debadair January 7, 2020 15:49

debadair approved these changes Jan 8, 2020

View reviewed changes

jrodewig added v7.5.2 v7.6.0 v8.0.0 labels Jan 8, 2020

jrodewig merged commit 495ce1a into elastic:master Jan 8, 2020

jrodewig deleted the analysis-overview branch January 8, 2020 18:53

jakelandis added v8.0.0-alpha1 and removed v8.0.0 labels Jul 26, 2021

Conversation

jrodewig commented Dec 27, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

elasticmachine commented Dec 27, 2019

Uh oh!

elasticmachine commented Dec 27, 2019

Uh oh!

debadair left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jrodewig Jan 7, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jrodewig Jan 7, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jrodewig Jan 7, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jrodewig commented Jan 7, 2020

Uh oh!

debadair left a comment

Choose a reason for hiding this comment

Uh oh!

jrodewig commented Jan 8, 2020

Uh oh!

jrodewig commented Jan 8, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

jrodewig commented Dec 27, 2019 •

edited

Loading

jrodewig Jan 7, 2020 •

edited

Loading

jrodewig Jan 7, 2020 •

edited

Loading

jrodewig Jan 7, 2020 •

edited

Loading