Skip to content

Analysis: Pattern Tokenizer #928

@kimchy

Description

@kimchy

Pattern tokenizer allows to define a tokenizer that uses regex to break text into tokens. The pattern parameter accepts the regex expression (and flags the common ES level regex flags).

It also accepts group (defaults to -1), from teh docs:

group=-1 (the default) is equivalent to "split". In this case, the tokens will be equivalent to the output from (without empty tokens):String#split(java.lang.String)

Using group >= 0 selects the matching group as the token. For example, if you have:

pattern = \'([^\']+)\'
group = 0

input = aaa 'bbb' 'ccc'

the output will be two tokens: 'bbb' and 'ccc' (including the ' marks). With the same input but using group=1, the output would be: bbb and ccc (no ' marks).

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions