Skip to content

Use case: Lexing/Tokenization #445

@kraigher

Description

@kraigher

I am looking a writing a lexer/tokenizer using this crate. It seems RegexSet is not quite what I need since it matches all regexes in parallel. For a lexer/tokenizer I would want to specify regexes in priority order and avoid matching a lower priority regex if a higher priorty regex is already matched. Consider the following simple example:

let tokenizer = regex::RegexTokenizer::new([
    "\"(.*?)\"",
    "[a-zA-Z_]+"
    "-?[1-9][0-9]*"])

// NOTE: My specific use case requires stateful lexing 
//       where I need to switch regex based on previous token 
//       so I do not use an iterator over matches
let mut start = 0;
while start < code.len() {
  match tokenizer.match(code[start..]) {
     Some(m) => { 
         match m.index {
             0 => println!("string = \"{}\"", m.get(1));
             1 => println!("identifier = \"{}\"", m.get(0));
             2 => println!("integer = \"{}\"", m.get(0));
         };
         start += m.len()
      };
     None => {
         println!("error");
         break;
     };
  };
};

So my question: Is there anything the regex create can do to make this use case easier and more well performing? Currently I build one single normal regex using | (or operator) and (?P) for each token type followed by long list of if-statements checking captures.name("nameX"). Is there a better way to solve this use case using the regex crate?

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions