|
1 | 1 | [[analysis-keep-types-tokenfilter]] |
2 | | -=== Keep Types Token Filter |
| 2 | +=== Keep types token filter |
| 3 | +++++ |
| 4 | +<titleabbrev>Keep types</titleabbrev> |
| 5 | +++++ |
3 | 6 |
|
4 | | -A token filter of type `keep_types` that only keeps tokens with a token type |
5 | | -contained in a predefined set. |
| 7 | +Keeps or removes tokens of a specific type. For example, you can use this filter |
| 8 | +to change `3 quick foxes` to `quick foxes` by keeping only `<ALPHANUM>` |
| 9 | +(alphanumeric) tokens. |
6 | 10 |
|
| 11 | +[NOTE] |
| 12 | +.Token types |
| 13 | +==== |
| 14 | +Token types are set by the <<analysis-tokenizers,tokenizer>> when converting |
| 15 | +characters to tokens. Token types can vary between tokenizers. |
7 | 16 |
|
8 | | -[float] |
9 | | -=== Options |
10 | | -[horizontal] |
11 | | -types:: a list of types to include (default mode) or exclude |
12 | | -mode:: if set to `include` (default) the specified token types will be kept, |
13 | | -if set to `exclude` the specified token types will be removed from the stream |
| 17 | +For example, the <<analysis-standard-tokenizer,`standard`>> tokenizer can |
| 18 | +produce a variety of token types, including `<ALPHANUM>`, `<HANGUL>`, and |
| 19 | +`<NUM>`. Simpler analyzers, like the |
| 20 | +<<analysis-lowercase-tokenizer,`lowercase`>> tokenizer, only produce the `word` |
| 21 | +token type. |
14 | 22 |
|
15 | | -[float] |
16 | | -=== Settings example |
| 23 | +Certain token filters can also add token types. For example, the |
| 24 | +<<analysis-synonym-tokenfilter,`synonym`>> filter can add the `<SYNONYM>` token |
| 25 | +type. |
| 26 | +==== |
17 | 27 |
|
18 | | -You can set it up like: |
| 28 | +This filter uses Lucene's |
| 29 | +https://lucene.apache.org/core/{lucene_version_path}/analyzers-common/org/apache/lucene/analysis/core/TypeTokenFilter.html[TypeTokenFilter]. |
| 30 | + |
| 31 | +[[analysis-keep-types-tokenfilter-analyze-include-ex]] |
| 32 | +==== Include example |
| 33 | + |
| 34 | +The following <<indices-analyze,analyze API>> request uses the `keep_types` |
| 35 | +filter to keep only `<NUM>` (numeric) tokens from `1 quick fox 2 lazy dogs`. |
19 | 36 |
|
20 | 37 | [source,console] |
21 | 38 | -------------------------------------------------- |
22 | | -PUT /keep_types_example |
| 39 | +GET _analyze |
23 | 40 | { |
24 | | - "settings" : { |
25 | | - "analysis" : { |
26 | | - "analyzer" : { |
27 | | - "my_analyzer" : { |
28 | | - "tokenizer" : "standard", |
29 | | - "filter" : ["lowercase", "extract_numbers"] |
30 | | - } |
31 | | - }, |
32 | | - "filter" : { |
33 | | - "extract_numbers" : { |
34 | | - "type" : "keep_types", |
35 | | - "types" : [ "<NUM>" ] |
36 | | - } |
37 | | - } |
38 | | - } |
| 41 | + "tokenizer": "standard", |
| 42 | + "filter": [ |
| 43 | + { |
| 44 | + "type": "keep_types", |
| 45 | + "types": [ "<NUM>" ] |
39 | 46 | } |
| 47 | + ], |
| 48 | + "text": "1 quick fox 2 lazy dogs" |
40 | 49 | } |
41 | 50 | -------------------------------------------------- |
42 | 51 |
|
43 | | -And test it like: |
| 52 | +The filter produces the following tokens: |
44 | 53 |
|
45 | | -[source,console] |
| 54 | +[source,text] |
46 | 55 | -------------------------------------------------- |
47 | | -POST /keep_types_example/_analyze |
48 | | -{ |
49 | | - "analyzer" : "my_analyzer", |
50 | | - "text" : "this is just 1 a test" |
51 | | -} |
| 56 | +[ 1, 2 ] |
52 | 57 | -------------------------------------------------- |
53 | | -// TEST[continued] |
54 | | - |
55 | | -The response will be: |
56 | 58 |
|
| 59 | +///////////////////// |
57 | 60 | [source,console-result] |
58 | 61 | -------------------------------------------------- |
59 | 62 | { |
60 | 63 | "tokens": [ |
61 | 64 | { |
62 | 65 | "token": "1", |
63 | | - "start_offset": 13, |
64 | | - "end_offset": 14, |
| 66 | + "start_offset": 0, |
| 67 | + "end_offset": 1, |
| 68 | + "type": "<NUM>", |
| 69 | + "position": 0 |
| 70 | + }, |
| 71 | + { |
| 72 | + "token": "2", |
| 73 | + "start_offset": 12, |
| 74 | + "end_offset": 13, |
65 | 75 | "type": "<NUM>", |
66 | 76 | "position": 3 |
67 | 77 | } |
68 | 78 | ] |
69 | 79 | } |
70 | 80 | -------------------------------------------------- |
| 81 | +///////////////////// |
71 | 82 |
|
72 | | -Note how only the `<NUM>` token is in the output. |
73 | | - |
74 | | -[discrete] |
75 | | -=== Exclude mode settings example |
| 83 | +[[analysis-keep-types-tokenfilter-analyze-exclude-ex]] |
| 84 | +==== Exclude example |
76 | 85 |
|
77 | | -If the `mode` parameter is set to `exclude` like in the following example: |
| 86 | +The following <<indices-analyze,analyze API>> request uses the `keep_types` |
| 87 | +filter to remove `<NUM>` tokens from `1 quick fox 2 lazy dogs`. Note the `mode` |
| 88 | +parameter is set to `exclude`. |
78 | 89 |
|
79 | 90 | [source,console] |
80 | 91 | -------------------------------------------------- |
81 | | -PUT /keep_types_exclude_example |
| 92 | +GET _analyze |
82 | 93 | { |
83 | | - "settings" : { |
84 | | - "analysis" : { |
85 | | - "analyzer" : { |
86 | | - "my_analyzer" : { |
87 | | - "tokenizer" : "standard", |
88 | | - "filter" : ["lowercase", "remove_numbers"] |
89 | | - } |
90 | | - }, |
91 | | - "filter" : { |
92 | | - "remove_numbers" : { |
93 | | - "type" : "keep_types", |
94 | | - "mode" : "exclude", |
95 | | - "types" : [ "<NUM>" ] |
96 | | - } |
97 | | - } |
98 | | - } |
| 94 | + "tokenizer": "standard", |
| 95 | + "filter": [ |
| 96 | + { |
| 97 | + "type": "keep_types", |
| 98 | + "types": [ "<NUM>" ], |
| 99 | + "mode": "exclude" |
99 | 100 | } |
| 101 | + ], |
| 102 | + "text": "1 quick fox 2 lazy dogs" |
100 | 103 | } |
101 | 104 | -------------------------------------------------- |
102 | 105 |
|
103 | | -And we test it like: |
| 106 | +The filter produces the following tokens: |
104 | 107 |
|
105 | | -[source,console] |
| 108 | +[source,text] |
106 | 109 | -------------------------------------------------- |
107 | | -POST /keep_types_exclude_example/_analyze |
108 | | -{ |
109 | | - "analyzer" : "my_analyzer", |
110 | | - "text" : "hello 101 world" |
111 | | -} |
| 110 | +[ quick, fox, lazy, dogs ] |
112 | 111 | -------------------------------------------------- |
113 | | -// TEST[continued] |
114 | | - |
115 | | -The response will be: |
116 | 112 |
|
| 113 | +///////////////////// |
117 | 114 | [source,console-result] |
118 | 115 | -------------------------------------------------- |
119 | 116 | { |
120 | 117 | "tokens": [ |
121 | 118 | { |
122 | | - "token": "hello", |
123 | | - "start_offset": 0, |
124 | | - "end_offset": 5, |
| 119 | + "token": "quick", |
| 120 | + "start_offset": 2, |
| 121 | + "end_offset": 7, |
125 | 122 | "type": "<ALPHANUM>", |
126 | | - "position": 0 |
127 | | - }, |
| 123 | + "position": 1 |
| 124 | + }, |
128 | 125 | { |
129 | | - "token": "world", |
130 | | - "start_offset": 10, |
131 | | - "end_offset": 15, |
| 126 | + "token": "fox", |
| 127 | + "start_offset": 8, |
| 128 | + "end_offset": 11, |
132 | 129 | "type": "<ALPHANUM>", |
133 | 130 | "position": 2 |
| 131 | + }, |
| 132 | + { |
| 133 | + "token": "lazy", |
| 134 | + "start_offset": 14, |
| 135 | + "end_offset": 18, |
| 136 | + "type": "<ALPHANUM>", |
| 137 | + "position": 4 |
| 138 | + }, |
| 139 | + { |
| 140 | + "token": "dogs", |
| 141 | + "start_offset": 19, |
| 142 | + "end_offset": 23, |
| 143 | + "type": "<ALPHANUM>", |
| 144 | + "position": 5 |
134 | 145 | } |
135 | 146 | ] |
136 | 147 | } |
137 | 148 | -------------------------------------------------- |
| 149 | +///////////////////// |
| 150 | + |
| 151 | +[[analysis-keep-types-tokenfilter-configure-parms]] |
| 152 | +==== Configurable parameters |
| 153 | + |
| 154 | +`types`:: |
| 155 | +(Required, array of strings) |
| 156 | +List of token types to keep or remove. |
| 157 | + |
| 158 | +`mode`:: |
| 159 | +(Optional, string) |
| 160 | +Indicates whether to keep or remove the specified token types. |
| 161 | +Valid values are: |
| 162 | + |
| 163 | +`include`::: |
| 164 | +(Default) Keep only the specified token types. |
| 165 | + |
| 166 | +`exclude`::: |
| 167 | +Remove the specified token types. |
| 168 | + |
| 169 | +[[analysis-keep-types-tokenfilter-customize]] |
| 170 | +==== Customize and add to an analyzer |
| 171 | + |
| 172 | +To customize the `keep_types` filter, duplicate it to create the basis |
| 173 | +for a new custom token filter. You can modify the filter using its configurable |
| 174 | +parameters. |
| 175 | + |
| 176 | +For example, the following <<indices-create-index,create index API>> request |
| 177 | +uses a custom `keep_types` filter to configure a new |
| 178 | +<<analysis-custom-analyzer,custom analyzer>>. The custom `keep_types` filter |
| 179 | +keeps only `<ALPHANUM>` (alphanumeric) tokens. |
| 180 | + |
| 181 | +[source,console] |
| 182 | +-------------------------------------------------- |
| 183 | +PUT keep_types_example |
| 184 | +{ |
| 185 | + "settings": { |
| 186 | + "analysis": { |
| 187 | + "analyzer": { |
| 188 | + "my_analyzer": { |
| 189 | + "tokenizer": "standard", |
| 190 | + "filter": [ "extract_alpha" ] |
| 191 | + } |
| 192 | + }, |
| 193 | + "filter": { |
| 194 | + "extract_alpha": { |
| 195 | + "type": "keep_types", |
| 196 | + "types": [ "<ALPHANUM>" ] |
| 197 | + } |
| 198 | + } |
| 199 | + } |
| 200 | + } |
| 201 | +} |
| 202 | +-------------------------------------------------- |
0 commit comments