Skip to content

Fix stack exhaustion when parsing large regex alternating sequences.#58254

Merged
CyrusNajmabadi merged 11 commits intodotnet:mainfrom
CyrusNajmabadi:regexStack
Dec 13, 2021
Merged

Fix stack exhaustion when parsing large regex alternating sequences.#58254
CyrusNajmabadi merged 11 commits intodotnet:mainfrom
CyrusNajmabadi:regexStack

Conversation

@CyrusNajmabadi
Copy link
Copy Markdown
Contributor

Fixes #58186

Our prior data representation for a|b|c|d|e was a deeply nested tree like so:

          e
         /
     c  |  d
    /
a  |  b

This gets very bad with large trees and produces trees that we have to bail out on. We handle this in a graceful fashion, but it's still expensive and will cause continual internal failures while processing.

This PR changes things so that we just store this as a simple separated list (thanks @ryzngard for adding that!) which has no issues.

This PR also adds the real-world test cases that hte runtime has found in permissible repros. This gives good coverage and is how we found both the stack exhaustion issue, as well as a bogus assert in our code.

@CyrusNajmabadi CyrusNajmabadi requested a review from a team as a code owner December 10, 2021 19:21
@ghost ghost added the Area-IDE label Dec 10, 2021
if (node is RegexAlternationNode alternationNode)
{
return AlternationToElement(alternationNode, alternationNode.SequenceList.NodesAndTokens.Length);
}
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this change in the test harness was so that we could keep all our current baselines. without this, 90% of all tests needed a baseline change.

}
}
}
}
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the file that follows is a large json file of 19000 tests the runtime validates their regex system with.

var tokenText = token.Text;

if (startDelimiter.Length > 0 && !tokenText.StartsWith(startDelimiter))
if (startDelimiter.Length > 0 && !tokenText.StartsWith(startDelimiter, StringComparison.Ordinal))
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

assert fired before in cases with interesting unicode strings where the local-based StartsWith was not the same as the Ordinal StartsWith.

current = new RegexAlternationNode(
current, ConsumeCurrentToken(allowTrivia: true), ParseSequence(consumeCloseParen));
var barToken = ConsumeCurrentToken(allowTrivia: true);
if (isConditional && builder.Count > 1)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Took me a while to parse this in my head. I think >= 3 is easier to read and understand, meaning we already saw Sequence|AnotherSequence and we don't need more than.

Both should be functionally equivalent, so it's a minor suggestion one can agree or disagree with.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure. i don't mind changing.

Copy link
Copy Markdown
Member

@Youssef1313 Youssef1313 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Loved this! ❤️

public RegexAlternationNode(RegexAlternatingSequenceList sequenceList)
: base(RegexKind.Alternation)
{
Debug.Assert(sequenceList.NodesAndTokens.Length > 0);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A bit more restrictive assert.

Suggested change
Debug.Assert(sequenceList.NodesAndTokens.Length > 0);
Debug.Assert(sequenceList.NodesAndTokens.Length % 2 == 1);

@CyrusNajmabadi CyrusNajmabadi merged commit 148f9e0 into dotnet:main Dec 13, 2021
@ghost ghost added this to the Next milestone Dec 13, 2021
@Cosifne Cosifne removed this from the Next milestone Jan 5, 2022
@Cosifne Cosifne added this to the 17.1.P3 milestone Jan 5, 2022
@CyrusNajmabadi CyrusNajmabadi deleted the regexStack branch February 1, 2022 18:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Move away from binary trees in regex parser

4 participants