Finding matches with capturing subgroups anchored at beginning depends on length of haystack?

Please refer to the following gist for the test code: https://gist.github.com/birkenfeld/ffc2f0da8b93da33f9fbd5a08a8790f3

In short, I'm experimenting with porting Pygments to Rust. The main task is implementing a scanner/lexer using lists of regexes for each state of the scanner. The scanner is an `Iterator`, and on each iteration, it goes through the regex list in order and tries to match every regex against the remaining text (all regexes are anchored at the beginning with `\A`, in Python `re.match()` is used). When a match is found, the remaining string is updated, and the search begins at the beginning of the list on the next iteration. The scanner uses `captures()` instead of `find()` because I want to use subgroups to assign different token types to them within one match, without having to use states for that.

Now, I've noticed that making the haystack longer, e.g. by appending copies of the test input, makes the matching slower and slower. You can see that in the benchmark output in the gist, where the timing is for scanning through the same string every time, but within a slice that is 1x, 2x, etc. this base string. I.e. the timings should stay constant (and they do when using `.find()`).

It seems strange to me that when the patterns are anchored at the start of the haystack, and are structured like the ones in the example (i.e. there are no patterns that have to scan through to the end of the haystack) that the time to match should not depend on the haystack length. Is this a problem of optimization (e.g. looking in the whole haystack for literals without applying the special constraints due to the `\A` assertion)?

Sorry for the long issue and code, hope it's understandable.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Finding matches with capturing subgroups anchored at beginning depends on length of haystack? #215

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Finding matches with capturing subgroups anchored at beginning depends on length of haystack? #215

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions