Skip to content

[DOCS] Add overview page to analysis topic#50515

Merged
jrodewig merged 4 commits intoelastic:masterfrom
jrodewig:analysis-overview
Jan 8, 2020
Merged

[DOCS] Add overview page to analysis topic#50515
jrodewig merged 4 commits intoelastic:masterfrom
jrodewig:analysis-overview

Conversation

@jrodewig
Copy link
Copy Markdown
Contributor

@jrodewig jrodewig commented Dec 27, 2019

Adds a 'text analysis overview' page to the analysis topic docs.

The goals of this page are:

  • Concisely summarize the analysis process while avoiding in-depth concepts, tutorials, or API examples
  • Explain why analysis is important, largely through highlighting problems with full-text searches missing analysis
  • Highlight how analysis can be used to improve search results

This page is inspired by the machine learning overview pages:

@jrodewig jrodewig added >docs General docs changes :Search Relevance/Analysis How text is split into tokens labels Dec 27, 2019
@jrodewig jrodewig requested a review from debadair December 27, 2019 22:57
@elasticmachine
Copy link
Copy Markdown
Collaborator

Pinging @elastic/es-docs (>docs)

@elasticmachine
Copy link
Copy Markdown
Collaborator

Pinging @elastic/es-search (:Search/Analysis)

Copy link
Copy Markdown
Contributor

@debadair debadair left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some editorial suggestions. I think we need to make a call WRT what goes in the part intros vs subsections.

@@ -0,0 +1,75 @@

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It feels like the "overview" needs to come before the discussion and examples of index time vs search time analysis.

Do we want to follow the pattern of having separate overview sections for parts, or incorporate the overview into the part intros? It feels like we either want the part intros to be brief and contain the part nav (like the Anomaly detection topic), or provide the high-level overview for the part. I think we want to avoid having a meaty part intro AND a separate overview section.

Copy link
Copy Markdown
Contributor Author

@jrodewig jrodewig Jan 7, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to moving index time vs search time analysis after this section. I plan to make that change, but I think it works better as a separate PR to keep the diff size manageable. IMO the index time vs search time analysis section needs some work and could probably be split into a concept section and separate configuration tutorials.

Also agree re: avoiding a meaty intro AND a separate overview. My intent is to mirror the structure in machine learning and only provide a brief overview + nav: https://www.elastic.co/guide/en/machine-learning/master/xpack-ml.html

I plan to approach these changes iteratively, which means leaving the current top-level analysis topic as-is until I open those PRs.

_tokens_ or _terms_ which are added to the inverted index for searching.
_Text analysis_ is the process of converting text, like the body of any email,
into _tokens_ or _terms_ which are added to the inverted index for searching.
Analysis is performed by an <<analysis-analyzers,_analyzer_>> which can be
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes it sound like text analysis is only tokenization. It's almost impossible to pack everything into a single sentence. I'd expand on this and incorporate some mention of filters. Maybe something like:

Text analysis applies a combination of character filters, tokenizers, and token filters to transform a string of text into a collection of tokens that can be efficiently indexed and searched. Character filters add, remove, or modify characters in the text before a tokenizer breaks it down into individual tokens. For example, you could use a character filter to remove HTML tags so they aren't indexed. Once the text is tokenized, token filters add, remove, or modify tokens to inject synonyms, remove stopwords, and reduce each remaining token to its most basic form (stem).

At index time, the analyzed tokens are added to the index. At search time, the search terms undergo the same analysis so they can be compared to the indexed tokens.

Copy link
Copy Markdown
Contributor Author

@jrodewig jrodewig Jan 7, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My goal was just to change "Analysis" to "Text analysis." I feel like this better differentiates the topic from other forms of analysis, like machine learning, etc.

I agree that the definition here needs a refresh, but I'd prefer to do that as part of a separate PR.

Comment on lines +7 to +12
Text analysis enables {es} to perform full-text search, where the search returns
all _relevant_ results rather than just exact matches.

For example, a full-text search for `Quick fox jumps` should not only return
documents containing that exact search string but also documents containing
similar or related words, like `fast fox` or even `foxes leap`.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe flip it around to something like:

Searching unstructured text and returning the most relevant results requires being able to find more than exact matches. If you search for Quick fox jumps, you probably want the document that contains A quick brown fox jumps over the lazy dog, and you might also want documents that contain related phrases like fast fox or foxes leap. To enable this behavior, ES can perform text analysis during both index and query time.

It feels like this rationale should really come before the definition of text analysis, but if you keep the separate overview, you could follow this with a more detailed description of text analysis as an intro to the tokenization and normalization info.

Copy link
Copy Markdown
Contributor Author

@jrodewig jrodewig Jan 7, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reworded this based on your suggestion with 8b6766a.

However, I did remove the bit about index and query time analysis. I don't feel that level of detail is needed at this point. It'll also make it easier to relocate the current index/search time analysis info into their new home.

[discrete]
[[analysis-customization]]
=== Customize text analysis

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think might be helpful to introduce the notion of an analysis chain, and use that to lead into defining what an analyzer is. (Which is essentially the embodiment of the analysis chain.)

It would be nice to illustrate this with a diagram.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My primary goal with this section is to show that users can customize analysis at the granular level. I didn't feel I could do that without a very high-level explanation of an analyzer, but I want to avoid mentioning specific components if possible.

I feel like a rework of "Anatomy of an analyzer" would be a better home for the analysis chain info:
https://www.elastic.co/guide/en/elasticsearch/reference/master/analyzer-anatomy.html

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense!

[[tokenization]]
=== Tokenization

Analysis makes full-text search possible by breaking an text down into smaller
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd put the focus on tokenization over analysis, and provide a bit more of the "why". Something like:

Indexing the contents of a field as individual tokens rather than a single string makes full-text search possible. If you index the phrase the quick brown fox jumps as a single string and the user searches for quick fox, it isn't considered a match. However, if you tokenize the phrase and index each word separately, the terms in the query string can be looked up individually.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reworded with b476887. I did keep an updated version of the original intro para.

the text is converted to the tokens `[Quick, brown, fox]`, which can
be matched by searches for `Quick fox`, `fox brown`, or other variations.

While improved, this example search experience still has a few problems:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd be inclined to push the bulk of this transition to the beginning of the following section.

"This example search experience..." is kind of awkward. I'd be inclined to say something like"

Tokenization enables matching on individual terms, but each token is still matched literally. This means that a search for "quick" won't match "Quick", and searching for "foxes" won't match "fox". You can improve matching by normalizing the tokens to a standard form.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reworded as suggested and moved most of this content into the Normalization section: 8b6766a

<<analysis-standard-analyzer,standard analyzer>>, which works well for most use
cases right out of the box.

If you want to further tailor your search experience, you can choose a different
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Further tailor" doesn't seem right here, as you haven't tailored it yet.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed "further" with 2075e5c.

@jrodewig
Copy link
Copy Markdown
Contributor Author

jrodewig commented Jan 7, 2020

Thanks as always for your review @debadair.

I made several updates based on your suggestions. I also left some responses to better clarify my intent and eventual plans for the Analysis topic.

Re: part intros, I like the brief overview + nav structure used in the machine learning docs:
https://www.elastic.co/guide/en/machine-learning/master/xpack-ml.html

That means I'll eventually need to overhaul and relocate the current part intro for Analysis. However, I feel that's a work effort worthy of its own PR.

@jrodewig jrodewig requested a review from debadair January 7, 2020 15:49
Copy link
Copy Markdown
Contributor

@debadair debadair left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

++ For addressing the content changes separately.

@jrodewig
Copy link
Copy Markdown
Contributor Author

jrodewig commented Jan 8, 2020

Thanks as always @debadair!

@jrodewig jrodewig merged commit 495ce1a into elastic:master Jan 8, 2020
@jrodewig jrodewig deleted the analysis-overview branch January 8, 2020 18:53
jrodewig added a commit that referenced this pull request Jan 8, 2020
Adds a 'text analysis overview' page to the analysis topic docs.

The goals of this page are:

* Concisely summarize the analysis process while avoiding in-depth concepts, tutorials, or API examples
* Explain why analysis is important, largely through highlighting problems with full-text searches missing analysis
* Highlight how analysis can be used to improve search results
jrodewig added a commit that referenced this pull request Jan 8, 2020
Adds a 'text analysis overview' page to the analysis topic docs.

The goals of this page are:

* Concisely summarize the analysis process while avoiding in-depth concepts, tutorials, or API examples
* Explain why analysis is important, largely through highlighting problems with full-text searches missing analysis
* Highlight how analysis can be used to improve search results
@jrodewig
Copy link
Copy Markdown
Contributor Author

jrodewig commented Jan 8, 2020

master: 495ce1a
7.x: 9d1567b
7.5: 973b9a2

SivagurunathanV pushed a commit to SivagurunathanV/elasticsearch that referenced this pull request Jan 23, 2020
Adds a 'text analysis overview' page to the analysis topic docs.

The goals of this page are:

* Concisely summarize the analysis process while avoiding in-depth concepts, tutorials, or API examples
* Explain why analysis is important, largely through highlighting problems with full-text searches missing analysis
* Highlight how analysis can be used to improve search results
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

>docs General docs changes :Search Relevance/Analysis How text is split into tokens v7.5.2 v7.6.0 v8.0.0-alpha1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants