Skip to content

Finding matches with capturing subgroups anchored at beginning depends on length of haystack? #215

@birkenfeld

Description

@birkenfeld

Please refer to the following gist for the test code: https://gist.github.com/birkenfeld/ffc2f0da8b93da33f9fbd5a08a8790f3

In short, I'm experimenting with porting Pygments to Rust. The main task is implementing a scanner/lexer using lists of regexes for each state of the scanner. The scanner is an Iterator, and on each iteration, it goes through the regex list in order and tries to match every regex against the remaining text (all regexes are anchored at the beginning with \A, in Python re.match() is used). When a match is found, the remaining string is updated, and the search begins at the beginning of the list on the next iteration. The scanner uses captures() instead of find() because I want to use subgroups to assign different token types to them within one match, without having to use states for that.

Now, I've noticed that making the haystack longer, e.g. by appending copies of the test input, makes the matching slower and slower. You can see that in the benchmark output in the gist, where the timing is for scanning through the same string every time, but within a slice that is 1x, 2x, etc. this base string. I.e. the timings should stay constant (and they do when using .find()).

It seems strange to me that when the patterns are anchored at the start of the haystack, and are structured like the ones in the example (i.e. there are no patterns that have to scan through to the end of the haystack) that the time to match should not depend on the haystack length. Is this a problem of optimization (e.g. looking in the whole haystack for literals without applying the special constraints due to the \A assertion)?

Sorry for the long issue and code, hope it's understandable.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions