Fix stack exhaustion when parsing large regex alternating sequences.#58254
Fix stack exhaustion when parsing large regex alternating sequences.#58254CyrusNajmabadi merged 11 commits intodotnet:mainfrom
Conversation
| if (node is RegexAlternationNode alternationNode) | ||
| { | ||
| return AlternationToElement(alternationNode, alternationNode.SequenceList.NodesAndTokens.Length); | ||
| } |
There was a problem hiding this comment.
this change in the test harness was so that we could keep all our current baselines. without this, 90% of all tests needed a baseline change.
| } | ||
| } | ||
| } | ||
| } |
There was a problem hiding this comment.
the file that follows is a large json file of 19000 tests the runtime validates their regex system with.
| var tokenText = token.Text; | ||
|
|
||
| if (startDelimiter.Length > 0 && !tokenText.StartsWith(startDelimiter)) | ||
| if (startDelimiter.Length > 0 && !tokenText.StartsWith(startDelimiter, StringComparison.Ordinal)) |
There was a problem hiding this comment.
assert fired before in cases with interesting unicode strings where the local-based StartsWith was not the same as the Ordinal StartsWith.
...Core/Portable/EmbeddedLanguages/RegularExpressions/LanguageServices/RegexSyntaxClassifier.cs
Show resolved
Hide resolved
| current = new RegexAlternationNode( | ||
| current, ConsumeCurrentToken(allowTrivia: true), ParseSequence(consumeCloseParen)); | ||
| var barToken = ConsumeCurrentToken(allowTrivia: true); | ||
| if (isConditional && builder.Count > 1) |
There was a problem hiding this comment.
Took me a while to parse this in my head. I think >= 3 is easier to read and understand, meaning we already saw Sequence|AnotherSequence and we don't need more than.
Both should be functionally equivalent, so it's a minor suggestion one can agree or disagree with.
There was a problem hiding this comment.
sure. i don't mind changing.
...atures/Core/Portable/EmbeddedLanguages/RegularExpressions/RegexParser.CaptureInfoAnalyzer.cs
Show resolved
Hide resolved
| public RegexAlternationNode(RegexAlternatingSequenceList sequenceList) | ||
| : base(RegexKind.Alternation) | ||
| { | ||
| Debug.Assert(sequenceList.NodesAndTokens.Length > 0); |
There was a problem hiding this comment.
A bit more restrictive assert.
| Debug.Assert(sequenceList.NodesAndTokens.Length > 0); | |
| Debug.Assert(sequenceList.NodesAndTokens.Length % 2 == 1); |
Fixes #58186
Our prior data representation for
a|b|c|d|ewas a deeply nested tree like so:This gets very bad with large trees and produces trees that we have to bail out on. We handle this in a graceful fashion, but it's still expensive and will cause continual internal failures while processing.
This PR changes things so that we just store this as a simple separated list (thanks @ryzngard for adding that!) which has no issues.
This PR also adds the real-world test cases that hte runtime has found in permissible repros. This gives good coverage and is how we found both the stack exhaustion issue, as well as a bogus assert in our code.