Skip to content

Commit d755fbd

Browse files
committed
[DOCS] Reformat length token filter docs (#49805)
* Adds a title abbreviation * Updates the description and adds a Lucene link * Reformats the parameters section * Adds analyze, custom analyzer, and custom filter snippets Relates to #44726.
1 parent 605bd22 commit d755fbd

1 file changed

Lines changed: 165 additions & 11 deletions

File tree

Lines changed: 165 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -1,16 +1,170 @@
11
[[analysis-length-tokenfilter]]
2-
=== Length Token Filter
2+
=== Length token filter
3+
++++
4+
<titleabbrev>Length</titleabbrev>
5+
++++
36

4-
A token filter of type `length` that removes words that are too long or
5-
too short for the stream.
7+
Removes tokens shorter or longer than specified character lengths.
8+
For example, you can use the `length` filter to exclude tokens shorter than 2
9+
characters and tokens longer than 5 characters.
610

7-
The following are settings that can be set for a `length` token filter
8-
type:
11+
This filter uses Lucene's
12+
https://lucene.apache.org/core/{lucene_version_path}/analyzers-common/org/apache/lucene/analysis/miscellaneous/LengthFilter.html[LengthFilter].
913

10-
[cols="<,<",options="header",]
11-
|===========================================================
12-
|Setting |Description
13-
|`min` |The minimum number. Defaults to `0`.
14-
|`max` |The maximum number. Defaults to `Integer.MAX_VALUE`, which is `2^31-1` or 2147483647.
15-
|===========================================================
14+
[TIP]
15+
====
16+
The `length` filter removes entire tokens. If you'd prefer to shorten tokens to
17+
a specific length, use the <<analysis-truncate-tokenfilter,`truncate`>> filter.
18+
====
1619

20+
[[analysis-length-tokenfilter-analyze-ex]]
21+
==== Example
22+
23+
The following <<indices-analyze,analyze API>> request uses the `length`
24+
filter to remove tokens longer than 4 characters:
25+
26+
[source,console]
27+
--------------------------------------------------
28+
GET _analyze
29+
{
30+
"tokenizer": "whitespace",
31+
"filter": [
32+
{
33+
"type": "length",
34+
"min": 0,
35+
"max": 4
36+
}
37+
],
38+
"text": "the quick brown fox jumps over the lazy dog"
39+
}
40+
--------------------------------------------------
41+
42+
The filter produces the following tokens:
43+
44+
[source,text]
45+
--------------------------------------------------
46+
[ the, fox, over, the, lazy, dog ]
47+
--------------------------------------------------
48+
49+
/////////////////////
50+
[source,console-result]
51+
--------------------------------------------------
52+
{
53+
"tokens": [
54+
{
55+
"token": "the",
56+
"start_offset": 0,
57+
"end_offset": 3,
58+
"type": "word",
59+
"position": 0
60+
},
61+
{
62+
"token": "fox",
63+
"start_offset": 16,
64+
"end_offset": 19,
65+
"type": "word",
66+
"position": 3
67+
},
68+
{
69+
"token": "over",
70+
"start_offset": 26,
71+
"end_offset": 30,
72+
"type": "word",
73+
"position": 5
74+
},
75+
{
76+
"token": "the",
77+
"start_offset": 31,
78+
"end_offset": 34,
79+
"type": "word",
80+
"position": 6
81+
},
82+
{
83+
"token": "lazy",
84+
"start_offset": 35,
85+
"end_offset": 39,
86+
"type": "word",
87+
"position": 7
88+
},
89+
{
90+
"token": "dog",
91+
"start_offset": 40,
92+
"end_offset": 43,
93+
"type": "word",
94+
"position": 8
95+
}
96+
]
97+
}
98+
--------------------------------------------------
99+
/////////////////////
100+
101+
[[analysis-length-tokenfilter-analyzer-ex]]
102+
==== Add to an analyzer
103+
104+
The following <<indices-create-index,create index API>> request uses the
105+
`length` filter to configure a new
106+
<<analysis-custom-analyzer,custom analyzer>>.
107+
108+
[source,console]
109+
--------------------------------------------------
110+
PUT length_example
111+
{
112+
"settings": {
113+
"analysis": {
114+
"analyzer": {
115+
"standard_length": {
116+
"tokenizer": "standard",
117+
"filter": [ "length" ]
118+
}
119+
}
120+
}
121+
}
122+
}
123+
--------------------------------------------------
124+
125+
[[analysis-length-tokenfilter-configure-parms]]
126+
==== Configurable parameters
127+
128+
`min`::
129+
(Optional, integer)
130+
Minimum character length of a token. Shorter tokens are excluded from the
131+
output. Defaults to `0`.
132+
133+
`max`::
134+
(Optional, integer)
135+
Maximum character length of a token. Longer tokens are excluded from the output.
136+
Defaults to `Integer.MAX_VALUE`, which is `2^31-1` or `2147483647`.
137+
138+
[[analysis-length-tokenfilter-customize]]
139+
==== Customize
140+
141+
To customize the `length` filter, duplicate it to create the basis
142+
for a new custom token filter. You can modify the filter using its configurable
143+
parameters.
144+
145+
For example, the following request creates a custom `length` filter that removes
146+
tokens shorter than 2 characters and tokens longer than 10 characters:
147+
148+
[source,console]
149+
--------------------------------------------------
150+
PUT length_custom_example
151+
{
152+
"settings": {
153+
"analysis": {
154+
"analyzer": {
155+
"whitespace_length_2_to_10_char": {
156+
"tokenizer": "whitespace",
157+
"filter": [ "length_2_to_10_char" ]
158+
}
159+
},
160+
"filter": {
161+
"length_2_to_10_char": {
162+
"type": "length",
163+
"min": 2,
164+
"max": 10
165+
}
166+
}
167+
}
168+
}
169+
}
170+
--------------------------------------------------

0 commit comments

Comments
 (0)