Analysis: Pattern Tokenizer

Pattern tokenizer allows to define a tokenizer that uses regex to break text into tokens. The `pattern` parameter accepts the regex expression (and flags the common ES level regex flags).

It also accepts `group` (defaults to -1), from teh docs:

group=-1 (the default) is equivalent to "split".  In this case, the tokens will be equivalent to the output from (without empty tokens):String#split(java.lang.String)

Using group >= 0 selects the matching group as the token.  For example, if you have:

```
pattern = \'([^\']+)\'
group = 0

input = aaa 'bbb' 'ccc'
```

the output will be two tokens: 'bbb' and 'ccc' (including the ' marks).  With the same input but using group=1, the output would be: bbb and ccc (no ' marks).


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Analysis: Pattern Tokenizer #928

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Analysis: Pattern Tokenizer #928

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions