Merged
Conversation
Typically, when a DFA blows up in size, it happens for two reasons: 1. It accumulates many states. 2. Each state accumulates more and more NFA states. Our previous approximation for the size of the DFA accounted for (1) but used a constant for the size of (2). This can turn out to result in very large differences (in the MBs) between the approximate and actual size of the DFA. Since computing the actual size is expensive, we compute it as a sum as states are added. The end result is that we more stringently respect the memory set by the caller.
The specific problem here is that our literal search doesn't know about anchors, so it will try to search all of the detected literals in a regex. In a regex like `a|^b`, the literal `b` should only be searched for at the beginning of the haystack and in no other place. The right way to fix this is probably to make the literal detector smarter, but the literal detector is already too complex. Instead, this commit detects whether a regex is partially anchored (that is, when the regex has at least one matchable sub-expression that is anchored), and if so, disables the literal engine. Note that this doesn't disable all literal optimizations, just the optimization that opts out of regex engines entirely. Both the DFA and the NFA will still use literal prefixes to search. Namely, if it searches and finds a literal that needs to be anchored but isn't in the haystack, then the regex engine rules it out as a false positive. Fixes rust-lang#268.
If the caller asks for captures, and the DFA runs, and there's a match, and there are actually captures in the regex, then the haystack sent to the NFA is shortened to correspond to only the match plus some room at the end for matching zero-width assertions. This "room at the end" needs to be big enough to at least fit an UTF-8 encoded Unicode codepoint. Fixes rust-lang#264.
Docopt uses lazy_static! 2.x, but lazy_static required a new minimum Rust version in 2.1.
This was referenced Oct 13, 2025
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This fixes both #264 and #268. It also fixes an unreported bug where the DFA cache size could grow (a lot) bigger than the bound set by the caller.