Conversation
This enables RegexSets to short-circuit when: 1. All patterns are anchored to the beginning of the input. 2. All patterns have either matched or will never match. We make this happen by checking whether all NFA states in a DFA state are match states, when a DFA match is observed. If all NFA states are match states, and since all match states are final states, we know that the current set of matches will never change. Since we don't care about reporting location information, we can quit. N.B. If no matches can be found, then the DFA will short circuit using its normal mechanism.
|
Thanks for the note! You'll be pleased to know that for my use case, the timings dropped from to a nice Alas, there's still no speedup compared to sequentially matching all expressions, which clocks in at This is possibly quite dependent on the individual lexer (the number of regexes per state, for example), so I'll definitely continue to try RegexSet with different configurations as I add them. |
|
That is great news that And you're right, it is dependent. In particular, on the number of mismatches before finding a match. For example, if you order your regexes by most likely to match to least likely, then depending on the distribution, one could definitely get comparable performance. In any case, thank you so much for testing out this PR and confirming that it is, in fact, fixed. :-) |
This enables RegexSets to short-circuit when:
We make this happen by checking whether all NFA states in a DFA state
are match states, when a DFA match is observed. If all NFA states are
match states, and since all match states are final states, we know that
the current set of matches will never change. Since we don't care about
reporting location information, we can quit.
N.B. If no matches can be found, then the DFA will short circuit using its
normal mechanism.