Use case: Lexing/Tokenization

I am looking a writing a lexer/tokenizer using this crate. It seems RegexSet is not quite what I need since it matches all regexes in parallel. For a lexer/tokenizer I would want to specify regexes in priority order and avoid matching a lower priority regex if a higher priorty regex is already matched. Consider the following simple example:

```rust
let tokenizer = regex::RegexTokenizer::new([
    "\"(.*?)\"",
    "[a-zA-Z_]+"
    "-?[1-9][0-9]*"])

// NOTE: My specific use case requires stateful lexing 
//       where I need to switch regex based on previous token 
//       so I do not use an iterator over matches
let mut start = 0;
while start < code.len() {
  match tokenizer.match(code[start..]) {
     Some(m) => { 
         match m.index {
             0 => println!("string = \"{}\"", m.get(1));
             1 => println!("identifier = \"{}\"", m.get(0));
             2 => println!("integer = \"{}\"", m.get(0));
         };
         start += m.len()
      };
     None => {
         println!("error");
         break;
     };
  };
};

```
So my question: Is there anything the regex create can do to make this use case easier and more well performing? Currently I build one single normal regex using | (or operator) and (?P<nameX>) for each token type followed by long list of if-statements checking captures.name("nameX"). Is there a better way to solve this use case using the regex crate?


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use case: Lexing/Tokenization #445

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Use case: Lexing/Tokenization #445

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions