ReDoS: restrict the edges considered in polynomial-redos for complex regular expressions#12543
ReDoS: restrict the edges considered in polynomial-redos for complex regular expressions#12543erik-krogh merged 4 commits intogithub:mainfrom
Conversation
|
If I remember correctly there are other tools that also try to find ReDoS. Do these tools also fail/take very long on this specific regex? |
It is a problem with the implementation strategy I've taken in implementing this. E.g. the Java implementation from the author of the research I've based my implementation on does not experience the same issue. The bottom-up evaluation nature of CodeQL hurts here. |
tausbn
left a comment
There was a problem hiding this comment.
One typo, otherwise LGTM. 👍
|
LGTM if you want to merge like this. But if you want to try and fix it in a more principled way I think this would work: (possibly for a future PR)
|
Fixes #12528
The root cause is that some regular expressions just have a massive search space.
I've tried various solutions, but I couldn't find anything good.
So I've made an approach where a simplified search is done when a regular expression is determined to be sufficiently complex (currently measured by the amount of states in the NFA).
Consider this simplified regex:
/\w*(A|B)(C|D)(E|F)\w*Y/There are many different paths from the first
\wto the second in the NFA, and that blows up the complexity of the analysis.When the complexity metric is hit the search will only consider one of the alternatives for each location where the regex is branching.
This causes us to miss some results, but it's very few results that are lost. E.g. the polynomial-redos in the above regex will still be flagged.
An evaluation on Ruby/Python/JavaScript (with the standard nightly source-suite) shows no lost results.
(The Ruby evaluation shows lost results, but we still flag other parts of the same regular expression).
I don't like using a complexity metric, but I didn't find any better approach.
Also drive-by fix that stops Python from parsing regular expression in extracted dependencies.
This made debugging way easier, and I feel that it is a good change.
(I'm quite sure I found an ugly way of achieving my goal, better solutions are very welcome).