-
Notifications
You must be signed in to change notification settings - Fork 25.8k
[Docs] Clarify Whitespace tokenizer behaviour with tokens longer than 255 characters #26641
Copy link
Copy link
Closed
Closed
Copy link
Labels
:Search Relevance/AnalysisHow text is split into tokensHow text is split into tokens>docsGeneral docs changesGeneral docs changesTeam:Search RelevanceMeta label for the Search Relevance team in ElasticsearchMeta label for the Search Relevance team in Elasticsearchhelp wantedadoptmeadoptme
Description
Currently the Whitespace tokenizer splits tokens longer than 255 characters into separate tokens by default. This is suprising to some users (see #26601 for an example why this can be confusing). We should document this better (other tokenizers like the Standard tokenizer have some explanation for this in the docs where we talk about the max_token_length parameter). Maybe we should also check that other tokenizers exhibiting this behaviour have a small not in the docs.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
:Search Relevance/AnalysisHow text is split into tokensHow text is split into tokens>docsGeneral docs changesGeneral docs changesTeam:Search RelevanceMeta label for the Search Relevance team in ElasticsearchMeta label for the Search Relevance team in Elasticsearchhelp wantedadoptmeadoptme
Type
Fields
Give feedbackNo fields configured for issues without a type.