Swift: Query for REDOS (Regular Expression Denial Of Service)#13548
Conversation
|
QHelp previews: java/ql/src/Security/CWE/CWE-730/PolynomialReDoS.qhelpPolynomial regular expression used on uncontrolled dataSome regular expressions take a long time to match certain input strings to the point where the time it takes to match a string of length n is proportional to nk or even 2n. Such regular expressions can negatively affect performance, or even allow a malicious user to perform a Denial of Service ("DoS") attack by crafting an expensive input string for the regular expression to match. The regular expression engine provided by Java uses a backtracking non-deterministic finite automata to implement regular expression matching. While this approach is space-efficient and allows supporting advanced features like capture groups, it is not time-efficient in general. The worst-case time complexity of such an automaton can be polynomial or even exponential, meaning that for strings of a certain shape, increasing the input length by ten characters may make the automaton about 1000 times slower. Typically, a regular expression is affected by this problem if it contains a repetition of the form Note that Java versions 9 and above have some mitigations against ReDoS; however they aren't perfect and more complex regular expressions can still be affected by this problem. RecommendationModify the regular expression to remove the ambiguity, or ensure that the strings matched with the regular expression are short enough that the time-complexity does not matter. Alternatively, an alternate regex library that guarantees linear time execution, such as Google's RE2J, may be used. ExampleConsider this use of a regular expression, which removes all leading and trailing whitespace in a string: Pattern.compile("^\\s+|\\s+$").matcher(text).replaceAll("") // BADThe sub-expression This ultimately means that the time cost of trimming a string is quadratic in the length of the string. So a string like Avoid this problem by rewriting the regular expression to not contain the ambiguity about when to start matching whitespace sequences. For instance, by using a negative look-behind ( Note that the sub-expression ExampleAs a similar, but slightly subtler problem, consider the regular expression that matches lines with numbers, possibly written using scientific notation: "^0\\.\\d+E?\\d+$"" The problem with this regular expression is in the sub-expression This is problematic for strings that do not end with a digit. Such a string will force the regular expression engine to process each digit sequence once per digit in the sequence, again leading to a quadratic time complexity. To make the processing faster, the regular expression should be rewritten such that the two ExampleSometimes it is unclear how a regular expression can be rewritten to avoid the problem. In such cases, it often suffices to limit the length of the input string. For instance, the following regular expression is used to match numbers, and on some non-number inputs it can have quadratic time complexity: Pattern.matches("^(\\+|-)?(\\d+|(\\d*\\.\\d*))?(E|e)?([-+])?(\\d+)?$", str); It is not immediately obvious how to rewrite this regular expression to avoid the problem. However, you can mitigate performance issues by limiting the length to 1000 characters, which will always finish in a reasonable amount of time. if (str.length() > 1000) {
throw new IllegalArgumentException("Input too long");
}
Pattern.matches("^(\\+|-)?(\\d+|(\\d*\\.\\d*))?(E|e)?([-+])?(\\d+)?$", str); References
java/ql/src/Security/CWE/CWE-730/ReDoS.qhelpInefficient regular expressionSome regular expressions take a long time to match certain input strings to the point where the time it takes to match a string of length n is proportional to nk or even 2n. Such regular expressions can negatively affect performance, or even allow a malicious user to perform a Denial of Service ("DoS") attack by crafting an expensive input string for the regular expression to match. The regular expression engine provided by Java uses a backtracking non-deterministic finite automata to implement regular expression matching. While this approach is space-efficient and allows supporting advanced features like capture groups, it is not time-efficient in general. The worst-case time complexity of such an automaton can be polynomial or even exponential, meaning that for strings of a certain shape, increasing the input length by ten characters may make the automaton about 1000 times slower. Typically, a regular expression is affected by this problem if it contains a repetition of the form Note that Java versions 9 and above have some mitigations against ReDoS; however they aren't perfect and more complex regular expressions can still be affected by this problem. RecommendationModify the regular expression to remove the ambiguity, or ensure that the strings matched with the regular expression are short enough that the time-complexity does not matter. Alternatively, an alternate regex library that guarantees linear time execution, such as Google's RE2J, may be used. ExampleConsider this regular expression: ^_(__|.)+_$Its sub-expression This problem can be avoided by rewriting the regular expression to remove the ambiguity between the two branches of the alternative inside the repetition: ^_(__|[^_])+_$References
javascript/ql/src/Performance/PolynomialReDoS.qhelpPolynomial regular expression used on uncontrolled dataSome regular expressions take a long time to match certain input strings to the point where the time it takes to match a string of length n is proportional to nk or even 2n. Such regular expressions can negatively affect performance, or even allow a malicious user to perform a Denial of Service ("DoS") attack by crafting an expensive input string for the regular expression to match. The regular expression engines provided by many popular JavaScript platforms use backtracking non-deterministic finite automata to implement regular expression matching. While this approach is space-efficient and allows supporting advanced features like capture groups, it is not time-efficient in general. The worst-case time complexity of such an automaton can be polynomial or even exponential, meaning that for strings of a certain shape, increasing the input length by ten characters may make the automaton about 1000 times slower. Typically, a regular expression is affected by this problem if it contains a repetition of the form RecommendationModify the regular expression to remove the ambiguity, or ensure that the strings matched with the regular expression are short enough that the time-complexity does not matter. ExampleConsider this use of a regular expression, which removes all leading and trailing whitespace in a string: text.replace(/^\s+|\s+$/g, ''); // BADThe sub-expression This ultimately means that the time cost of trimming a string is quadratic in the length of the string. So a string like Avoid this problem by rewriting the regular expression to not contain the ambiguity about when to start matching whitespace sequences. For instance, by using a negative look-behind ( Note that the sub-expression ExampleAs a similar, but slightly subtler problem, consider the regular expression that matches lines with numbers, possibly written using scientific notation: /^0\.\d+E?\d+$/.test(str) // BADThe problem with this regular expression is in the sub-expression This is problematic for strings that do not end with a digit. Such a string will force the regular expression engine to process each digit sequence once per digit in the sequence, again leading to a quadratic time complexity. To make the processing faster, the regular expression should be rewritten such that the two ExampleSometimes it is unclear how a regular expression can be rewritten to avoid the problem. In such cases, it often suffices to limit the length of the input string. For instance, the following regular expression is used to match numbers, and on some non-number inputs it can have quadratic time complexity: /^(\+|-)?(\d+|(\d*\.\d*))?(E|e)?([-+])?(\d+)?$/.test(str) // BADIt is not immediately obvious how to rewrite this regular expression to avoid the problem. However, you can mitigate performance issues by limiting the length to 1000 characters, which will always finish in a reasonable amount of time. if (str.length > 1000) {
throw new Error("Input too long");
}
/^(\+|-)?(\d+|(\d*\.\d*))?(E|e)?([-+])?(\d+)?$/.test(str)References
javascript/ql/src/Performance/ReDoS.qhelpInefficient regular expressionSome regular expressions take a long time to match certain input strings to the point where the time it takes to match a string of length n is proportional to nk or even 2n. Such regular expressions can negatively affect performance, or even allow a malicious user to perform a Denial of Service ("DoS") attack by crafting an expensive input string for the regular expression to match. The regular expression engines provided by many popular JavaScript platforms use backtracking non-deterministic finite automata to implement regular expression matching. While this approach is space-efficient and allows supporting advanced features like capture groups, it is not time-efficient in general. The worst-case time complexity of such an automaton can be polynomial or even exponential, meaning that for strings of a certain shape, increasing the input length by ten characters may make the automaton about 1000 times slower. Typically, a regular expression is affected by this problem if it contains a repetition of the form RecommendationModify the regular expression to remove the ambiguity, or ensure that the strings matched with the regular expression are short enough that the time-complexity does not matter. ExampleConsider this regular expression: /^_(__|.)+_$/Its sub-expression This problem can be avoided by rewriting the regular expression to remove the ambiguity between the two branches of the alternative inside the repetition: /^_(__|[^_])+_$/References
python/ql/src/Security/CWE-730/PolynomialReDoS.qhelpPolynomial regular expression used on uncontrolled dataSome regular expressions take a long time to match certain input strings to the point where the time it takes to match a string of length n is proportional to nk or even 2n. Such regular expressions can negatively affect performance, or even allow a malicious user to perform a Denial of Service ("DoS") attack by crafting an expensive input string for the regular expression to match. The regular expression engine provided by Python uses a backtracking non-deterministic finite automata to implement regular expression matching. While this approach is space-efficient and allows supporting advanced features like capture groups, it is not time-efficient in general. The worst-case time complexity of such an automaton can be polynomial or even exponential, meaning that for strings of a certain shape, increasing the input length by ten characters may make the automaton about 1000 times slower. Typically, a regular expression is affected by this problem if it contains a repetition of the form RecommendationModify the regular expression to remove the ambiguity, or ensure that the strings matched with the regular expression are short enough that the time-complexity does not matter. ExampleConsider this use of a regular expression, which removes all leading and trailing whitespace in a string: re.sub(r"^\s+|\s+$", "", text) # BADThe sub-expression This ultimately means that the time cost of trimming a string is quadratic in the length of the string. So a string like Avoid this problem by rewriting the regular expression to not contain the ambiguity about when to start matching whitespace sequences. For instance, by using a negative look-behind ( Note that the sub-expression ExampleAs a similar, but slightly subtler problem, consider the regular expression that matches lines with numbers, possibly written using scientific notation: ^0\.\d+E?\d+$ # BADThe problem with this regular expression is in the sub-expression This is problematic for strings that do not end with a digit. Such a string will force the regular expression engine to process each digit sequence once per digit in the sequence, again leading to a quadratic time complexity. To make the processing faster, the regular expression should be rewritten such that the two ExampleSometimes it is unclear how a regular expression can be rewritten to avoid the problem. In such cases, it often suffices to limit the length of the input string. For instance, the following regular expression is used to match numbers, and on some non-number inputs it can have quadratic time complexity: match = re.search(r'^(\+|-)?(\d+|(\d*\.\d*))?(E|e)?([-+])?(\d+)?$', str) It is not immediately obvious how to rewrite this regular expression to avoid the problem. However, you can mitigate performance issues by limiting the length to 1000 characters, which will always finish in a reasonable amount of time. if len(str) > 1000:
raise ValueError("Input too long")
match = re.search(r'^(\+|-)?(\d+|(\d*\.\d*))?(E|e)?([-+])?(\d+)?$', str) References
python/ql/src/Security/CWE-730/ReDoS.qhelpInefficient regular expressionSome regular expressions take a long time to match certain input strings to the point where the time it takes to match a string of length n is proportional to nk or even 2n. Such regular expressions can negatively affect performance, or even allow a malicious user to perform a Denial of Service ("DoS") attack by crafting an expensive input string for the regular expression to match. The regular expression engine provided by Python uses a backtracking non-deterministic finite automata to implement regular expression matching. While this approach is space-efficient and allows supporting advanced features like capture groups, it is not time-efficient in general. The worst-case time complexity of such an automaton can be polynomial or even exponential, meaning that for strings of a certain shape, increasing the input length by ten characters may make the automaton about 1000 times slower. Typically, a regular expression is affected by this problem if it contains a repetition of the form RecommendationModify the regular expression to remove the ambiguity, or ensure that the strings matched with the regular expression are short enough that the time-complexity does not matter. ExampleConsider this regular expression: ^_(__|.)+_$Its sub-expression This problem can be avoided by rewriting the regular expression to remove the ambiguity between the two branches of the alternative inside the repetition: ^_(__|[^_])+_$References
ruby/ql/src/queries/security/cwe-1333/PolynomialReDoS.qhelpPolynomial regular expression used on uncontrolled dataSome regular expressions take a long time to match certain input strings to the point where the time it takes to match a string of length n is proportional to nk or even 2n. Such regular expressions can negatively affect performance, or even allow a malicious user to perform a Denial of Service ("DoS") attack by crafting an expensive input string for the regular expression to match. The regular expression engine used by the Ruby interpreter (MRI) uses backtracking non-deterministic finite automata to implement regular expression matching. While this approach is space-efficient and allows supporting advanced features like capture groups, it is not time-efficient in general. The worst-case time complexity of such an automaton can be polynomial or even exponential, meaning that for strings of a certain shape, increasing the input length by ten characters may make the automaton about 1000 times slower. Note that Ruby 3.2 and later have implemented a caching mechanism that completely eliminates the worst-case time complexity for the regular expressions flagged by this query. The regular expressions flagged by this query are therefore only problematic for Ruby versions prior to 3.2. Typically, a regular expression is affected by this problem if it contains a repetition of the form RecommendationModify the regular expression to remove the ambiguity, or ensure that the strings matched with the regular expression are short enough that the time-complexity does not matter. ExampleConsider this use of a regular expression, which removes all leading and trailing whitespace in a string: text.gsub!(/^\s+|\s+$/, '') # BADThe sub-expression This ultimately means that the time cost of trimming a string is quadratic in the length of the string. So a string like Avoid this problem by rewriting the regular expression to not contain the ambiguity about when to start matching whitespace sequences. For instance, by using a negative look-behind ( Note that the sub-expression ExampleAs a similar, but slightly subtler problem, consider the regular expression that matches lines with numbers, possibly written using scientific notation: /^0\.\d+E?\d+$/ # BADThe problem with this regular expression is in the sub-expression This is problematic for strings that do not end with a digit. Such a string will force the regular expression engine to process each digit sequence once per digit in the sequence, again leading to a quadratic time complexity. To make the processing faster, the regular expression should be rewritten such that the two ExampleSometimes it is unclear how a regular expression can be rewritten to avoid the problem. In such cases, it often suffices to limit the length of the input string. For instance, the following regular expression is used to match numbers, and on some non-number inputs it can have quadratic time complexity: is_matching = /^(\+|-)?(\d+|(\d*\.\d*))?(E|e)?([-+])?(\d+)?$/.match?(str)It is not immediately obvious how to rewrite this regular expression to avoid the problem. However, you can mitigate performance issues by limiting the length to 1000 characters, which will always finish in a reasonable amount of time. if str.length > 1000
raise ArgumentError, "Input too long"
end
is_matching = /^(\+|-)?(\d+|(\d*\.\d*))?(E|e)?([-+])?(\d+)?$/.match?(str)References
ruby/ql/src/queries/security/cwe-1333/ReDoS.qhelpInefficient regular expressionSome regular expressions take a long time to match certain input strings to the point where the time it takes to match a string of length n is proportional to nk or even 2n. Such regular expressions can negatively affect performance, or even allow a malicious user to perform a Denial of Service ("DoS") attack by crafting an expensive input string for the regular expression to match. The regular expression engine used by the Ruby interpreter (MRI) uses backtracking non-deterministic finite automata to implement regular expression matching. While this approach is space-efficient and allows supporting advanced features like capture groups, it is not time-efficient in general. The worst-case time complexity of such an automaton can be polynomial or even exponential, meaning that for strings of a certain shape, increasing the input length by ten characters may make the automaton about 1000 times slower. Note that Ruby 3.2 and later have implemented a caching mechanism that completely eliminates the worst-case time complexity for the regular expressions flagged by this query. The regular expressions flagged by this query are therefore only problematic for Ruby versions prior to 3.2. Typically, a regular expression is affected by this problem if it contains a repetition of the form RecommendationModify the regular expression to remove the ambiguity, or ensure that the strings matched with the regular expression are short enough that the time-complexity does not matter. ExampleConsider this regular expression: /^_(__|.)+_$/Its sub-expression This problem can be avoided by rewriting the regular expression to remove the ambiguity between the two branches of the alternative inside the repetition: /^_(__|[^_])+_$/References
swift/ql/src/queries/Security/CWE-1333/ReDoS.qhelpInefficient regular expressionSome regular expressions take a long time to match certain input strings to the point where the time it takes to match a string of length n is proportional to nk or even 2n. Such regular expressions can negatively affect performance, and potentially allow a malicious user to perform a Denial of Service ("DoS") attack by crafting an expensive input string for the regular expression to match. The regular expression engine used by Swift uses backtracking non-deterministic finite automata to implement regular expression matching. While this approach is space-efficient and allows supporting advanced features like capture groups, it is not time-efficient in general. The worst-case time complexity of such an automaton can be polynomial or exponential, meaning that for strings of a certain shape, increasing the input length by ten characters may make the automaton about 1000 times slower. Typically, a regular expression is affected by this problem if it contains a repetition of the form RecommendationModify the regular expression to remove the ambiguity, or ensure that the strings matched with the regular expression are short enough that the time complexity does not matter. ExampleConsider the following regular expression: /^_(__|.)+_$/Its sub-expression This problem can be avoided by rewriting the regular expression to remove the ambiguity between the two branches of the alternative inside the repetition: /^_(__|[^_])+_$/References
|
|
|
||
| let nsregex3 = try? NSRegularExpression(pattern: ".*") // GOOD (safe regex) | ||
| _ = nsregex3?.stringByReplacingMatches(in: tainted, range: NSRange(location: 0, length: tainted.utf16.count), withTemplate: "") | ||
| } |
There was a problem hiding this comment.
There are many more test cases for the core logic of this query, i.e. identifying REDOS vulnerable regular expressions, in the library test file https://github.com/github/codeql/blob/main/swift/ql/test/library-tests/regex/redos_variants.swift . These include realistic cases and edge cases designed to test the limits of the engine.
There was a problem hiding this comment.
Sounds good. As long as this query delegates to the same implementation, there's no need to copy those test cases here.
There was a problem hiding this comment.
Yep, the query tests mostly just confirm that everything is wired up correctly and presents the results in a reasonable manner.
d10c
left a comment
There was a problem hiding this comment.
LGTM with one minor suggestion. Will this be run as part of the standard query suite? If so, might want to do a DCA run before merging.
|
|
||
| let nsregex3 = try? NSRegularExpression(pattern: ".*") // GOOD (safe regex) | ||
| _ = nsregex3?.stringByReplacingMatches(in: tainted, range: NSRange(location: 0, length: tainted.utf16.count), withTemplate: "") | ||
| } |
There was a problem hiding this comment.
Sounds good. As long as this query delegates to the same implementation, there's no need to copy those test cases here.
We should definitely have a DCA run for this. It'll go directly into the Code Scanning suite as it:
This library also frequently appear as the cause of various timeouts in other languages 😄. |
|
Thanks for reviewing...
As @MathiasVP said this will run in the standard suite due to its metadata. We sometimes start queries at a lower
Definitely. I've just started a run.
Yeah, we've already encountered (and fixed) a timeout in one of the test regexps in the library PR (#13470). I didn't encounter any problems on the MRVA top 1000 though. In any case I'm moderately confident any further problems we encounter will be isolated to specific forms of regex (ironically, given what this query is about) and can be investigated and solved. |
|
DCA LGTM. Analysis time is not measurably slower, which if anything is a slight surprise as this is the first query to use the new regex library. |
|
@github/docs-content-codeql please could you confirm this one hasn't been forgotten? |
|
Note that the qhelp preview generated all sorts of stuff, only the Swift |
subatoi
left a comment
There was a problem hiding this comment.
Looks good 👍 —just had some very minor Docs suggestions
Co-authored-by: Ben Ahmady <32935794+subatoi@users.noreply.github.com>
|
Thanks @subatoi, all suggestions accepted! Now this just needs a re-approval from someone and merging. |
Adds a query for REDOS (Regular Expression Denial Of Service), built atop the regular expressions library from #13470.
At present there are significant limitations (the largest of which being that the regex library doesn't support regular expression literals yet), but I believe the query adds value in its current state, and will find more results as we improve the library.
TODO: