Skip to content

Commit ade72b9

Browse files
committed
[DOCS] Reformat keep types and keep words token filter docs (#49604)
* Adds title abbreviations * Updates the descriptions and adds Lucene links * Reformats parameter definitions * Adds analyze and custom analyzer snippets * Adds explanations of token types to keep types token filter and tokenizer docs
1 parent 86a40f6 commit ade72b9

3 files changed

Lines changed: 283 additions & 119 deletions

File tree

Lines changed: 145 additions & 80 deletions
Original file line numberDiff line numberDiff line change
@@ -1,137 +1,202 @@
11
[[analysis-keep-types-tokenfilter]]
2-
=== Keep Types Token Filter
2+
=== Keep types token filter
3+
++++
4+
<titleabbrev>Keep types</titleabbrev>
5+
++++
36

4-
A token filter of type `keep_types` that only keeps tokens with a token type
5-
contained in a predefined set.
7+
Keeps or removes tokens of a specific type. For example, you can use this filter
8+
to change `3 quick foxes` to `quick foxes` by keeping only `<ALPHANUM>`
9+
(alphanumeric) tokens.
610

11+
[NOTE]
12+
.Token types
13+
====
14+
Token types are set by the <<analysis-tokenizers,tokenizer>> when converting
15+
characters to tokens. Token types can vary between tokenizers.
716
8-
[float]
9-
=== Options
10-
[horizontal]
11-
types:: a list of types to include (default mode) or exclude
12-
mode:: if set to `include` (default) the specified token types will be kept,
13-
if set to `exclude` the specified token types will be removed from the stream
17+
For example, the <<analysis-standard-tokenizer,`standard`>> tokenizer can
18+
produce a variety of token types, including `<ALPHANUM>`, `<HANGUL>`, and
19+
`<NUM>`. Simpler analyzers, like the
20+
<<analysis-lowercase-tokenizer,`lowercase`>> tokenizer, only produce the `word`
21+
token type.
1422
15-
[float]
16-
=== Settings example
23+
Certain token filters can also add token types. For example, the
24+
<<analysis-synonym-tokenfilter,`synonym`>> filter can add the `<SYNONYM>` token
25+
type.
26+
====
1727

18-
You can set it up like:
28+
This filter uses Lucene's
29+
https://lucene.apache.org/core/{lucene_version_path}/analyzers-common/org/apache/lucene/analysis/core/TypeTokenFilter.html[TypeTokenFilter].
30+
31+
[[analysis-keep-types-tokenfilter-analyze-include-ex]]
32+
==== Include example
33+
34+
The following <<indices-analyze,analyze API>> request uses the `keep_types`
35+
filter to keep only `<NUM>` (numeric) tokens from `1 quick fox 2 lazy dogs`.
1936

2037
[source,console]
2138
--------------------------------------------------
22-
PUT /keep_types_example
39+
GET _analyze
2340
{
24-
"settings" : {
25-
"analysis" : {
26-
"analyzer" : {
27-
"my_analyzer" : {
28-
"tokenizer" : "standard",
29-
"filter" : ["lowercase", "extract_numbers"]
30-
}
31-
},
32-
"filter" : {
33-
"extract_numbers" : {
34-
"type" : "keep_types",
35-
"types" : [ "<NUM>" ]
36-
}
37-
}
38-
}
41+
"tokenizer": "standard",
42+
"filter": [
43+
{
44+
"type": "keep_types",
45+
"types": [ "<NUM>" ]
3946
}
47+
],
48+
"text": "1 quick fox 2 lazy dogs"
4049
}
4150
--------------------------------------------------
4251

43-
And test it like:
52+
The filter produces the following tokens:
4453

45-
[source,console]
54+
[source,text]
4655
--------------------------------------------------
47-
POST /keep_types_example/_analyze
48-
{
49-
"analyzer" : "my_analyzer",
50-
"text" : "this is just 1 a test"
51-
}
56+
[ 1, 2 ]
5257
--------------------------------------------------
53-
// TEST[continued]
54-
55-
The response will be:
5658

59+
/////////////////////
5760
[source,console-result]
5861
--------------------------------------------------
5962
{
6063
"tokens": [
6164
{
6265
"token": "1",
63-
"start_offset": 13,
64-
"end_offset": 14,
66+
"start_offset": 0,
67+
"end_offset": 1,
68+
"type": "<NUM>",
69+
"position": 0
70+
},
71+
{
72+
"token": "2",
73+
"start_offset": 12,
74+
"end_offset": 13,
6575
"type": "<NUM>",
6676
"position": 3
6777
}
6878
]
6979
}
7080
--------------------------------------------------
81+
/////////////////////
7182

72-
Note how only the `<NUM>` token is in the output.
73-
74-
[discrete]
75-
=== Exclude mode settings example
83+
[[analysis-keep-types-tokenfilter-analyze-exclude-ex]]
84+
==== Exclude example
7685

77-
If the `mode` parameter is set to `exclude` like in the following example:
86+
The following <<indices-analyze,analyze API>> request uses the `keep_types`
87+
filter to remove `<NUM>` tokens from `1 quick fox 2 lazy dogs`. Note the `mode`
88+
parameter is set to `exclude`.
7889

7990
[source,console]
8091
--------------------------------------------------
81-
PUT /keep_types_exclude_example
92+
GET _analyze
8293
{
83-
"settings" : {
84-
"analysis" : {
85-
"analyzer" : {
86-
"my_analyzer" : {
87-
"tokenizer" : "standard",
88-
"filter" : ["lowercase", "remove_numbers"]
89-
}
90-
},
91-
"filter" : {
92-
"remove_numbers" : {
93-
"type" : "keep_types",
94-
"mode" : "exclude",
95-
"types" : [ "<NUM>" ]
96-
}
97-
}
98-
}
94+
"tokenizer": "standard",
95+
"filter": [
96+
{
97+
"type": "keep_types",
98+
"types": [ "<NUM>" ],
99+
"mode": "exclude"
99100
}
101+
],
102+
"text": "1 quick fox 2 lazy dogs"
100103
}
101104
--------------------------------------------------
102105

103-
And we test it like:
106+
The filter produces the following tokens:
104107

105-
[source,console]
108+
[source,text]
106109
--------------------------------------------------
107-
POST /keep_types_exclude_example/_analyze
108-
{
109-
"analyzer" : "my_analyzer",
110-
"text" : "hello 101 world"
111-
}
110+
[ quick, fox, lazy, dogs ]
112111
--------------------------------------------------
113-
// TEST[continued]
114-
115-
The response will be:
116112

113+
/////////////////////
117114
[source,console-result]
118115
--------------------------------------------------
119116
{
120117
"tokens": [
121118
{
122-
"token": "hello",
123-
"start_offset": 0,
124-
"end_offset": 5,
119+
"token": "quick",
120+
"start_offset": 2,
121+
"end_offset": 7,
125122
"type": "<ALPHANUM>",
126-
"position": 0
127-
},
123+
"position": 1
124+
},
128125
{
129-
"token": "world",
130-
"start_offset": 10,
131-
"end_offset": 15,
126+
"token": "fox",
127+
"start_offset": 8,
128+
"end_offset": 11,
132129
"type": "<ALPHANUM>",
133130
"position": 2
131+
},
132+
{
133+
"token": "lazy",
134+
"start_offset": 14,
135+
"end_offset": 18,
136+
"type": "<ALPHANUM>",
137+
"position": 4
138+
},
139+
{
140+
"token": "dogs",
141+
"start_offset": 19,
142+
"end_offset": 23,
143+
"type": "<ALPHANUM>",
144+
"position": 5
134145
}
135146
]
136147
}
137148
--------------------------------------------------
149+
/////////////////////
150+
151+
[[analysis-keep-types-tokenfilter-configure-parms]]
152+
==== Configurable parameters
153+
154+
`types`::
155+
(Required, array of strings)
156+
List of token types to keep or remove.
157+
158+
`mode`::
159+
(Optional, string)
160+
Indicates whether to keep or remove the specified token types.
161+
Valid values are:
162+
163+
`include`:::
164+
(Default) Keep only the specified token types.
165+
166+
`exclude`:::
167+
Remove the specified token types.
168+
169+
[[analysis-keep-types-tokenfilter-customize]]
170+
==== Customize and add to an analyzer
171+
172+
To customize the `keep_types` filter, duplicate it to create the basis
173+
for a new custom token filter. You can modify the filter using its configurable
174+
parameters.
175+
176+
For example, the following <<indices-create-index,create index API>> request
177+
uses a custom `keep_types` filter to configure a new
178+
<<analysis-custom-analyzer,custom analyzer>>. The custom `keep_types` filter
179+
keeps only `<ALPHANUM>` (alphanumeric) tokens.
180+
181+
[source,console]
182+
--------------------------------------------------
183+
PUT keep_types_example
184+
{
185+
"settings": {
186+
"analysis": {
187+
"analyzer": {
188+
"my_analyzer": {
189+
"tokenizer": "standard",
190+
"filter": [ "extract_alpha" ]
191+
}
192+
},
193+
"filter": {
194+
"extract_alpha": {
195+
"type": "keep_types",
196+
"types": [ "<ALPHANUM>" ]
197+
}
198+
}
199+
}
200+
}
201+
}
202+
--------------------------------------------------

0 commit comments

Comments
 (0)