-
Notifications
You must be signed in to change notification settings - Fork 494
Description
Please refer to the following gist for the test code: https://gist.github.com/birkenfeld/ffc2f0da8b93da33f9fbd5a08a8790f3
In short, I'm experimenting with porting Pygments to Rust. The main task is implementing a scanner/lexer using lists of regexes for each state of the scanner. The scanner is an Iterator, and on each iteration, it goes through the regex list in order and tries to match every regex against the remaining text (all regexes are anchored at the beginning with \A, in Python re.match() is used). When a match is found, the remaining string is updated, and the search begins at the beginning of the list on the next iteration. The scanner uses captures() instead of find() because I want to use subgroups to assign different token types to them within one match, without having to use states for that.
Now, I've noticed that making the haystack longer, e.g. by appending copies of the test input, makes the matching slower and slower. You can see that in the benchmark output in the gist, where the timing is for scanning through the same string every time, but within a slice that is 1x, 2x, etc. this base string. I.e. the timings should stay constant (and they do when using .find()).
It seems strange to me that when the patterns are anchored at the start of the haystack, and are structured like the ones in the example (i.e. there are no patterns that have to scan through to the end of the haystack) that the time to match should not depend on the haystack length. Is this a problem of optimization (e.g. looking in the whole haystack for literals without applying the special constraints due to the \A assertion)?
Sorry for the long issue and code, hope it's understandable.