Swift: Query for REDOS (Regular Expression Denial Of Service) by geoffw0 · Pull Request #13548 · github/codeql

geoffw0 · 2023-06-23T15:51:02Z

Adds a query for REDOS (Regular Expression Denial Of Service), built atop the regular expressions library from #13470.

At present there are significant limitations (the largest of which being that the regex library doesn't support regular expression literals yet), but I believe the query adds value in its current state, and will find more results as we improve the library.

TODO:

team review
doc review
DCA run

github-actions · 2023-06-23T15:52:18Z

QHelp previews:

java/ql/src/Security/CWE/CWE-730/PolynomialReDoS.qhelp

Polynomial regular expression used on uncontrolled data

Some regular expressions take a long time to match certain input strings to the point where the time it takes to match a string of length n is proportional to n^k or even 2ⁿ. Such regular expressions can negatively affect performance, or even allow a malicious user to perform a Denial of Service ("DoS") attack by crafting an expensive input string for the regular expression to match.

The regular expression engine provided by Java uses a backtracking non-deterministic finite automata to implement regular expression matching. While this approach is space-efficient and allows supporting advanced features like capture groups, it is not time-efficient in general. The worst-case time complexity of such an automaton can be polynomial or even exponential, meaning that for strings of a certain shape, increasing the input length by ten characters may make the automaton about 1000 times slower.

Typically, a regular expression is affected by this problem if it contains a repetition of the form r* or r+ where the sub-expression r is ambiguous in the sense that it can match some string in multiple ways. More information about the precise circumstances can be found in the references.

Note that Java versions 9 and above have some mitigations against ReDoS; however they aren't perfect and more complex regular expressions can still be affected by this problem.

Recommendation

Modify the regular expression to remove the ambiguity, or ensure that the strings matched with the regular expression are short enough that the time-complexity does not matter. Alternatively, an alternate regex library that guarantees linear time execution, such as Google's RE2J, may be used.

Example

Consider this use of a regular expression, which removes all leading and trailing whitespace in a string:

Pattern.compile("^\\s+|\\s+$").matcher(text).replaceAll("") // BAD

The sub-expression "\\s+$" will match the whitespace characters in text from left to right, but it can start matching anywhere within a whitespace sequence. This is problematic for strings that do not end with a whitespace character. Such a string will force the regular expression engine to process each whitespace sequence once per whitespace character in the sequence.

This ultimately means that the time cost of trimming a string is quadratic in the length of the string. So a string like "a b" will take milliseconds to process, but a similar string with a million spaces instead of just one will take several minutes.

Avoid this problem by rewriting the regular expression to not contain the ambiguity about when to start matching whitespace sequences. For instance, by using a negative look-behind ("^\\s+|(?<!\\s)\\s+$"), or just by using the built-in trim method (text.trim()).

Note that the sub-expression "^\\s+" is not problematic as the ^ anchor restricts when that sub-expression can start matching, and as the regular expression engine matches from left to right.

Example

As a similar, but slightly subtler problem, consider the regular expression that matches lines with numbers, possibly written using scientific notation:

"^0\\.\\d+E?\\d+$""

The problem with this regular expression is in the sub-expression \d+E?\d+ because the second \d+ can start matching digits anywhere after the first match of the first \d+ if there is no E in the input string.

This is problematic for strings that do not end with a digit. Such a string will force the regular expression engine to process each digit sequence once per digit in the sequence, again leading to a quadratic time complexity.

To make the processing faster, the regular expression should be rewritten such that the two \d+ sub-expressions do not have overlapping matches: "^0\\.\\d+(E\\d+)?$".

Example

Sometimes it is unclear how a regular expression can be rewritten to avoid the problem. In such cases, it often suffices to limit the length of the input string. For instance, the following regular expression is used to match numbers, and on some non-number inputs it can have quadratic time complexity:

Pattern.matches("^(\\+|-)?(\\d+|(\\d*\\.\\d*))?(E|e)?([-+])?(\\d+)?$", str);

It is not immediately obvious how to rewrite this regular expression to avoid the problem. However, you can mitigate performance issues by limiting the length to 1000 characters, which will always finish in a reasonable amount of time.

if (str.length() > 1000) {
    throw new IllegalArgumentException("Input too long");
}

Pattern.matches("^(\\+|-)?(\\d+|(\\d*\\.\\d*))?(E|e)?([-+])?(\\d+)?$", str);

References

OWASP: Regular expression Denial of Service - ReDoS.
Wikipedia: ReDoS.
Wikipedia: Time complexity.
James Kirrage, Asiri Rathnayake, Hayo Thielecke: Static Analysis for Regular Expression Denial-of-Service Attack.
Common Weakness Enumeration: CWE-1333.
Common Weakness Enumeration: CWE-730.
Common Weakness Enumeration: CWE-400.

java/ql/src/Security/CWE/CWE-730/ReDoS.qhelp

Inefficient regular expression

Some regular expressions take a long time to match certain input strings to the point where the time it takes to match a string of length n is proportional to n^k or even 2ⁿ. Such regular expressions can negatively affect performance, or even allow a malicious user to perform a Denial of Service ("DoS") attack by crafting an expensive input string for the regular expression to match.

The regular expression engine provided by Java uses a backtracking non-deterministic finite automata to implement regular expression matching. While this approach is space-efficient and allows supporting advanced features like capture groups, it is not time-efficient in general. The worst-case time complexity of such an automaton can be polynomial or even exponential, meaning that for strings of a certain shape, increasing the input length by ten characters may make the automaton about 1000 times slower.

Typically, a regular expression is affected by this problem if it contains a repetition of the form r* or r+ where the sub-expression r is ambiguous in the sense that it can match some string in multiple ways. More information about the precise circumstances can be found in the references.

Note that Java versions 9 and above have some mitigations against ReDoS; however they aren't perfect and more complex regular expressions can still be affected by this problem.

Recommendation

Modify the regular expression to remove the ambiguity, or ensure that the strings matched with the regular expression are short enough that the time-complexity does not matter. Alternatively, an alternate regex library that guarantees linear time execution, such as Google's RE2J, may be used.

Example

Consider this regular expression:

^_(__|.)+_$

Its sub-expression "(__|.)+?" can match the string "__" either by the first alternative "__" to the left of the "|" operator, or by two repetitions of the second alternative "." to the right. Thus, a string consisting of an odd number of underscores followed by some other character will cause the regular expression engine to run for an exponential amount of time before rejecting the input.

This problem can be avoided by rewriting the regular expression to remove the ambiguity between the two branches of the alternative inside the repetition:

^_(__|[^_])+_$

References

OWASP: Regular expression Denial of Service - ReDoS.
Wikipedia: ReDoS.
Wikipedia: Time complexity.
James Kirrage, Asiri Rathnayake, Hayo Thielecke: Static Analysis for Regular Expression Denial-of-Service Attack.
Common Weakness Enumeration: CWE-1333.
Common Weakness Enumeration: CWE-730.
Common Weakness Enumeration: CWE-400.

javascript/ql/src/Performance/PolynomialReDoS.qhelp

Polynomial regular expression used on uncontrolled data

Some regular expressions take a long time to match certain input strings to the point where the time it takes to match a string of length n is proportional to n^k or even 2ⁿ. Such regular expressions can negatively affect performance, or even allow a malicious user to perform a Denial of Service ("DoS") attack by crafting an expensive input string for the regular expression to match.

The regular expression engines provided by many popular JavaScript platforms use backtracking non-deterministic finite automata to implement regular expression matching. While this approach is space-efficient and allows supporting advanced features like capture groups, it is not time-efficient in general. The worst-case time complexity of such an automaton can be polynomial or even exponential, meaning that for strings of a certain shape, increasing the input length by ten characters may make the automaton about 1000 times slower.

Typically, a regular expression is affected by this problem if it contains a repetition of the form r* or r+ where the sub-expression r is ambiguous in the sense that it can match some string in multiple ways. More information about the precise circumstances can be found in the references.

Recommendation

Modify the regular expression to remove the ambiguity, or ensure that the strings matched with the regular expression are short enough that the time-complexity does not matter.

Example

Consider this use of a regular expression, which removes all leading and trailing whitespace in a string:

text.replace(/^\s+|\s+$/g, ''); // BAD

The sub-expression "\s+$" will match the whitespace characters in text from left to right, but it can start matching anywhere within a whitespace sequence. This is problematic for strings that do not end with a whitespace character. Such a string will force the regular expression engine to process each whitespace sequence once per whitespace character in the sequence.

This ultimately means that the time cost of trimming a string is quadratic in the length of the string. So a string like "a b" will take milliseconds to process, but a similar string with a million spaces instead of just one will take several minutes.

Avoid this problem by rewriting the regular expression to not contain the ambiguity about when to start matching whitespace sequences. For instance, by using a negative look-behind (/^\s+|(?<!\s)\s+$/g), or just by using the built-in trim method (text.trim()).

Note that the sub-expression "^\s+" is not problematic as the ^ anchor restricts when that sub-expression can start matching, and as the regular expression engine matches from left to right.

Example

As a similar, but slightly subtler problem, consider the regular expression that matches lines with numbers, possibly written using scientific notation:

/^0\.\d+E?\d+$/.test(str) // BAD

The problem with this regular expression is in the sub-expression \d+E?\d+ because the second \d+ can start matching digits anywhere after the first match of the first \d+ if there is no E in the input string.

This is problematic for strings that do not end with a digit. Such a string will force the regular expression engine to process each digit sequence once per digit in the sequence, again leading to a quadratic time complexity.

To make the processing faster, the regular expression should be rewritten such that the two \d+ sub-expressions do not have overlapping matches: ^0\.\d+(E\d+)?$.

Example

Sometimes it is unclear how a regular expression can be rewritten to avoid the problem. In such cases, it often suffices to limit the length of the input string. For instance, the following regular expression is used to match numbers, and on some non-number inputs it can have quadratic time complexity:

/^(\+|-)?(\d+|(\d*\.\d*))?(E|e)?([-+])?(\d+)?$/.test(str) // BAD

It is not immediately obvious how to rewrite this regular expression to avoid the problem. However, you can mitigate performance issues by limiting the length to 1000 characters, which will always finish in a reasonable amount of time.

if (str.length > 1000) {
    throw new Error("Input too long");
}

/^(\+|-)?(\d+|(\d*\.\d*))?(E|e)?([-+])?(\d+)?$/.test(str)

References

OWASP: Regular expression Denial of Service - ReDoS.
Wikipedia: ReDoS.
Wikipedia: Time complexity.
James Kirrage, Asiri Rathnayake, Hayo Thielecke: Static Analysis for Regular Expression Denial-of-Service Attack.
Common Weakness Enumeration: CWE-1333.
Common Weakness Enumeration: CWE-730.
Common Weakness Enumeration: CWE-400.

javascript/ql/src/Performance/ReDoS.qhelp

Inefficient regular expression

Some regular expressions take a long time to match certain input strings to the point where the time it takes to match a string of length n is proportional to n^k or even 2ⁿ. Such regular expressions can negatively affect performance, or even allow a malicious user to perform a Denial of Service ("DoS") attack by crafting an expensive input string for the regular expression to match.

The regular expression engines provided by many popular JavaScript platforms use backtracking non-deterministic finite automata to implement regular expression matching. While this approach is space-efficient and allows supporting advanced features like capture groups, it is not time-efficient in general. The worst-case time complexity of such an automaton can be polynomial or even exponential, meaning that for strings of a certain shape, increasing the input length by ten characters may make the automaton about 1000 times slower.

Typically, a regular expression is affected by this problem if it contains a repetition of the form r* or r+ where the sub-expression r is ambiguous in the sense that it can match some string in multiple ways. More information about the precise circumstances can be found in the references.

Recommendation

Modify the regular expression to remove the ambiguity, or ensure that the strings matched with the regular expression are short enough that the time-complexity does not matter.

Example

Consider this regular expression:

/^_(__|.)+_$/

Its sub-expression "(__|.)+?" can match the string "__" either by the first alternative "__" to the left of the "|" operator, or by two repetitions of the second alternative "." to the right. Thus, a string consisting of an odd number of underscores followed by some other character will cause the regular expression engine to run for an exponential amount of time before rejecting the input.

This problem can be avoided by rewriting the regular expression to remove the ambiguity between the two branches of the alternative inside the repetition:

/^_(__|[^_])+_$/

References

OWASP: Regular expression Denial of Service - ReDoS.
Wikipedia: ReDoS.
Wikipedia: Time complexity.
James Kirrage, Asiri Rathnayake, Hayo Thielecke: Static Analysis for Regular Expression Denial-of-Service Attack.
Common Weakness Enumeration: CWE-1333.
Common Weakness Enumeration: CWE-730.
Common Weakness Enumeration: CWE-400.

python/ql/src/Security/CWE-730/PolynomialReDoS.qhelp

Polynomial regular expression used on uncontrolled data

Some regular expressions take a long time to match certain input strings to the point where the time it takes to match a string of length n is proportional to n^k or even 2ⁿ. Such regular expressions can negatively affect performance, or even allow a malicious user to perform a Denial of Service ("DoS") attack by crafting an expensive input string for the regular expression to match.

The regular expression engine provided by Python uses a backtracking non-deterministic finite automata to implement regular expression matching. While this approach is space-efficient and allows supporting advanced features like capture groups, it is not time-efficient in general. The worst-case time complexity of such an automaton can be polynomial or even exponential, meaning that for strings of a certain shape, increasing the input length by ten characters may make the automaton about 1000 times slower.

Typically, a regular expression is affected by this problem if it contains a repetition of the form r* or r+ where the sub-expression r is ambiguous in the sense that it can match some string in multiple ways. More information about the precise circumstances can be found in the references.

Recommendation

Modify the regular expression to remove the ambiguity, or ensure that the strings matched with the regular expression are short enough that the time-complexity does not matter.

Example

Consider this use of a regular expression, which removes all leading and trailing whitespace in a string:

re.sub(r"^\s+|\s+$", "", text) # BAD

The sub-expression "\s+$" will match the whitespace characters in text from left to right, but it can start matching anywhere within a whitespace sequence. This is problematic for strings that do not end with a whitespace character. Such a string will force the regular expression engine to process each whitespace sequence once per whitespace character in the sequence.

This ultimately means that the time cost of trimming a string is quadratic in the length of the string. So a string like "a b" will take milliseconds to process, but a similar string with a million spaces instead of just one will take several minutes.

Avoid this problem by rewriting the regular expression to not contain the ambiguity about when to start matching whitespace sequences. For instance, by using a negative look-behind (^\s+|(?<!\s)\s+$), or just by using the built-in strip method (text.strip()).

Note that the sub-expression "^\s+" is not problematic as the ^ anchor restricts when that sub-expression can start matching, and as the regular expression engine matches from left to right.

Example

As a similar, but slightly subtler problem, consider the regular expression that matches lines with numbers, possibly written using scientific notation:

^0\.\d+E?\d+$ # BAD

The problem with this regular expression is in the sub-expression \d+E?\d+ because the second \d+ can start matching digits anywhere after the first match of the first \d+ if there is no E in the input string.

This is problematic for strings that do not end with a digit. Such a string will force the regular expression engine to process each digit sequence once per digit in the sequence, again leading to a quadratic time complexity.

To make the processing faster, the regular expression should be rewritten such that the two \d+ sub-expressions do not have overlapping matches: ^0\.\d+(E\d+)?$.

Example

Sometimes it is unclear how a regular expression can be rewritten to avoid the problem. In such cases, it often suffices to limit the length of the input string. For instance, the following regular expression is used to match numbers, and on some non-number inputs it can have quadratic time complexity:

match = re.search(r'^(\+|-)?(\d+|(\d*\.\d*))?(E|e)?([-+])?(\d+)?$', str)

It is not immediately obvious how to rewrite this regular expression to avoid the problem. However, you can mitigate performance issues by limiting the length to 1000 characters, which will always finish in a reasonable amount of time.

if len(str) > 1000:
    raise ValueError("Input too long")

match = re.search(r'^(\+|-)?(\d+|(\d*\.\d*))?(E|e)?([-+])?(\d+)?$', str)

References

OWASP: Regular expression Denial of Service - ReDoS.
Wikipedia: ReDoS.
Wikipedia: Time complexity.
James Kirrage, Asiri Rathnayake, Hayo Thielecke: Static Analysis for Regular Expression Denial-of-Service Attack.
Common Weakness Enumeration: CWE-1333.
Common Weakness Enumeration: CWE-730.
Common Weakness Enumeration: CWE-400.

python/ql/src/Security/CWE-730/ReDoS.qhelp

Inefficient regular expression

Some regular expressions take a long time to match certain input strings to the point where the time it takes to match a string of length n is proportional to n^k or even 2ⁿ. Such regular expressions can negatively affect performance, or even allow a malicious user to perform a Denial of Service ("DoS") attack by crafting an expensive input string for the regular expression to match.

The regular expression engine provided by Python uses a backtracking non-deterministic finite automata to implement regular expression matching. While this approach is space-efficient and allows supporting advanced features like capture groups, it is not time-efficient in general. The worst-case time complexity of such an automaton can be polynomial or even exponential, meaning that for strings of a certain shape, increasing the input length by ten characters may make the automaton about 1000 times slower.

Typically, a regular expression is affected by this problem if it contains a repetition of the form r* or r+ where the sub-expression r is ambiguous in the sense that it can match some string in multiple ways. More information about the precise circumstances can be found in the references.

Recommendation

Modify the regular expression to remove the ambiguity, or ensure that the strings matched with the regular expression are short enough that the time-complexity does not matter.

Example

Consider this regular expression:

^_(__|.)+_$

Its sub-expression "(__|.)+?" can match the string "__" either by the first alternative "__" to the left of the "|" operator, or by two repetitions of the second alternative "." to the right. Thus, a string consisting of an odd number of underscores followed by some other character will cause the regular expression engine to run for an exponential amount of time before rejecting the input.

This problem can be avoided by rewriting the regular expression to remove the ambiguity between the two branches of the alternative inside the repetition:

^_(__|[^_])+_$

References

OWASP: Regular expression Denial of Service - ReDoS.
Wikipedia: ReDoS.
Wikipedia: Time complexity.
James Kirrage, Asiri Rathnayake, Hayo Thielecke: Static Analysis for Regular Expression Denial-of-Service Attack.
Common Weakness Enumeration: CWE-1333.
Common Weakness Enumeration: CWE-730.
Common Weakness Enumeration: CWE-400.

ruby/ql/src/queries/security/cwe-1333/PolynomialReDoS.qhelp

Polynomial regular expression used on uncontrolled data

Some regular expressions take a long time to match certain input strings to the point where the time it takes to match a string of length n is proportional to n^k or even 2ⁿ. Such regular expressions can negatively affect performance, or even allow a malicious user to perform a Denial of Service ("DoS") attack by crafting an expensive input string for the regular expression to match.

The regular expression engine used by the Ruby interpreter (MRI) uses backtracking non-deterministic finite automata to implement regular expression matching. While this approach is space-efficient and allows supporting advanced features like capture groups, it is not time-efficient in general. The worst-case time complexity of such an automaton can be polynomial or even exponential, meaning that for strings of a certain shape, increasing the input length by ten characters may make the automaton about 1000 times slower.

Note that Ruby 3.2 and later have implemented a caching mechanism that completely eliminates the worst-case time complexity for the regular expressions flagged by this query. The regular expressions flagged by this query are therefore only problematic for Ruby versions prior to 3.2.

Typically, a regular expression is affected by this problem if it contains a repetition of the form r* or r+ where the sub-expression r is ambiguous in the sense that it can match some string in multiple ways. More information about the precise circumstances can be found in the references.

Recommendation

Modify the regular expression to remove the ambiguity, or ensure that the strings matched with the regular expression are short enough that the time-complexity does not matter.

Example

Consider this use of a regular expression, which removes all leading and trailing whitespace in a string:

text.gsub!(/^\s+|\s+$/, '') # BAD

The sub-expression "\s+$" will match the whitespace characters in text from left to right, but it can start matching anywhere within a whitespace sequence. This is problematic for strings that do not end with a whitespace character. Such a string will force the regular expression engine to process each whitespace sequence once per whitespace character in the sequence.

This ultimately means that the time cost of trimming a string is quadratic in the length of the string. So a string like "a b" will take milliseconds to process, but a similar string with a million spaces instead of just one will take several minutes.

Avoid this problem by rewriting the regular expression to not contain the ambiguity about when to start matching whitespace sequences. For instance, by using a negative look-behind (/^\s+|(?<!\s)\s+$/), or just by using the built-in strip method (text.strip!).

Note that the sub-expression "^\s+" is not problematic as the ^ anchor restricts when that sub-expression can start matching, and as the regular expression engine matches from left to right.

Example

As a similar, but slightly subtler problem, consider the regular expression that matches lines with numbers, possibly written using scientific notation:

/^0\.\d+E?\d+$/ # BAD

The problem with this regular expression is in the sub-expression \d+E?\d+ because the second \d+ can start matching digits anywhere after the first match of the first \d+ if there is no E in the input string.

This is problematic for strings that do not end with a digit. Such a string will force the regular expression engine to process each digit sequence once per digit in the sequence, again leading to a quadratic time complexity.

To make the processing faster, the regular expression should be rewritten such that the two \d+ sub-expressions do not have overlapping matches: /^0\.\d+(E\d+)?$/.

Example

Sometimes it is unclear how a regular expression can be rewritten to avoid the problem. In such cases, it often suffices to limit the length of the input string. For instance, the following regular expression is used to match numbers, and on some non-number inputs it can have quadratic time complexity:

is_matching = /^(\+|-)?(\d+|(\d*\.\d*))?(E|e)?([-+])?(\d+)?$/.match?(str)

It is not immediately obvious how to rewrite this regular expression to avoid the problem. However, you can mitigate performance issues by limiting the length to 1000 characters, which will always finish in a reasonable amount of time.

if str.length > 1000
    raise ArgumentError, "Input too long"
end

is_matching = /^(\+|-)?(\d+|(\d*\.\d*))?(E|e)?([-+])?(\d+)?$/.match?(str)

References

OWASP: Regular expression Denial of Service - ReDoS.
Wikipedia: ReDoS.
Wikipedia: Time complexity.
James Kirrage, Asiri Rathnayake, Hayo Thielecke: Static Analysis for Regular Expression Denial-of-Service Attack.
Common Weakness Enumeration: CWE-1333.
Common Weakness Enumeration: CWE-730.
Common Weakness Enumeration: CWE-400.

ruby/ql/src/queries/security/cwe-1333/ReDoS.qhelp

Inefficient regular expression

Some regular expressions take a long time to match certain input strings to the point where the time it takes to match a string of length n is proportional to n^k or even 2ⁿ. Such regular expressions can negatively affect performance, or even allow a malicious user to perform a Denial of Service ("DoS") attack by crafting an expensive input string for the regular expression to match.

The regular expression engine used by the Ruby interpreter (MRI) uses backtracking non-deterministic finite automata to implement regular expression matching. While this approach is space-efficient and allows supporting advanced features like capture groups, it is not time-efficient in general. The worst-case time complexity of such an automaton can be polynomial or even exponential, meaning that for strings of a certain shape, increasing the input length by ten characters may make the automaton about 1000 times slower.

Note that Ruby 3.2 and later have implemented a caching mechanism that completely eliminates the worst-case time complexity for the regular expressions flagged by this query. The regular expressions flagged by this query are therefore only problematic for Ruby versions prior to 3.2.

Typically, a regular expression is affected by this problem if it contains a repetition of the form r* or r+ where the sub-expression r is ambiguous in the sense that it can match some string in multiple ways. More information about the precise circumstances can be found in the references.

Recommendation

Modify the regular expression to remove the ambiguity, or ensure that the strings matched with the regular expression are short enough that the time-complexity does not matter.

Example

Consider this regular expression:

/^_(__|.)+_$/

Its sub-expression "(__|.)+?" can match the string "__" either by the first alternative "__" to the left of the "|" operator, or by two repetitions of the second alternative "." to the right. Thus, a string consisting of an odd number of underscores followed by some other character will cause the regular expression engine to run for an exponential amount of time before rejecting the input.

This problem can be avoided by rewriting the regular expression to remove the ambiguity between the two branches of the alternative inside the repetition:

/^_(__|[^_])+_$/

References

OWASP: Regular expression Denial of Service - ReDoS.
Wikipedia: ReDoS.
Wikipedia: Time complexity.
James Kirrage, Asiri Rathnayake, Hayo Thielecke: Static Analysis for Regular Expression Denial-of-Service Attack.
Common Weakness Enumeration: CWE-1333.
Common Weakness Enumeration: CWE-730.
Common Weakness Enumeration: CWE-400.

swift/ql/src/queries/Security/CWE-1333/ReDoS.qhelp

Inefficient regular expression

Some regular expressions take a long time to match certain input strings to the point where the time it takes to match a string of length n is proportional to n^k or even 2ⁿ. Such regular expressions can negatively affect performance, and potentially allow a malicious user to perform a Denial of Service ("DoS") attack by crafting an expensive input string for the regular expression to match.

The regular expression engine used by Swift uses backtracking non-deterministic finite automata to implement regular expression matching. While this approach is space-efficient and allows supporting advanced features like capture groups, it is not time-efficient in general. The worst-case time complexity of such an automaton can be polynomial or exponential, meaning that for strings of a certain shape, increasing the input length by ten characters may make the automaton about 1000 times slower.

Typically, a regular expression is affected by this problem if it contains a repetition of the form r* or r+ where the sub-expression r is ambiguous in the sense that it can match some string in multiple ways. More information about the precise circumstances can be found in the references.

Recommendation

Modify the regular expression to remove the ambiguity, or ensure that the strings matched with the regular expression are short enough that the time complexity does not matter.

Example

Consider the following regular expression:

/^_(__|.)+_$/

Its sub-expression "(__|.)+" can match the string "__" either by the first alternative "__" to the left of the "|" operator, or by two repetitions of the second alternative "." to the right. Therefore, a string consisting of an odd number of underscores followed by some other character will cause the regular expression engine to run for an exponential amount of time before rejecting the input.

This problem can be avoided by rewriting the regular expression to remove the ambiguity between the two branches of the alternative inside the repetition:

/^_(__|[^_])+_$/

References

OWASP: Regular expression Denial of Service - ReDoS.
Wikipedia: ReDoS.
Wikipedia: Time complexity.
James Kirrage, Asiri Rathnayake, Hayo Thielecke: Static Analysis for Regular Expression Denial-of-Service Attack.
Common Weakness Enumeration: CWE-1333.
Common Weakness Enumeration: CWE-730.
Common Weakness Enumeration: CWE-400.

geoffw0 · 2023-06-23T15:52:27Z

+
+    let nsregex3 = try? NSRegularExpression(pattern: ".*") // GOOD (safe regex)
+    _ = nsregex3?.stringByReplacingMatches(in: tainted, range: NSRange(location: 0, length: tainted.utf16.count), withTemplate: "")
+}


There are many more test cases for the core logic of this query, i.e. identifying REDOS vulnerable regular expressions, in the library test file https://github.com/github/codeql/blob/main/swift/ql/test/library-tests/regex/redos_variants.swift . These include realistic cases and edge cases designed to test the limits of the engine.

Sounds good. As long as this query delegates to the same implementation, there's no need to copy those test cases here.

Yep, the query tests mostly just confirm that everything is wired up correctly and presents the results in a reasonable manner.

d10c

LGTM with one minor suggestion. Will this be run as part of the standard query suite? If so, might want to do a DCA run before merging.

d10c · 2023-07-03T15:00:55Z

+
+    let nsregex3 = try? NSRegularExpression(pattern: ".*") // GOOD (safe regex)
+    _ = nsregex3?.stringByReplacingMatches(in: tainted, range: NSRange(location: 0, length: tainted.utf16.count), withTemplate: "")
+}


Sounds good. As long as this query delegates to the same implementation, there's no need to copy those test cases here.

MathiasVP · 2023-07-03T15:07:35Z

LGTM with one minor suggestion. Will this be run as part of the standard query suite? If so, might want to do a DCA run before merging.

We should definitely have a DCA run for this. It'll go directly into the Code Scanning suite as it:

Is tagged as a security query, and
is tagged with @precision high.

This library also frequently appear as the cause of various timeouts in other languages 😄.

geoffw0 · 2023-07-03T16:43:11Z

Thanks for reviewing...

Will this be run as part of the standard query suite?

As @MathiasVP said this will run in the standard suite due to its metadata. We sometimes start queries at a lower @precision but this tends to be when there are known and common false positives, and I've not seen that.

If so, might want to do a DCA run before merging.

Definitely. I've just started a run.

This library also frequently appear as the cause of various timeouts in other languages 😄.

Yeah, we've already encountered (and fixed) a timeout in one of the test regexps in the library PR (#13470). I didn't encounter any problems on the MRVA top 1000 though. In any case I'm moderately confident any further problems we encounter will be isolated to specific forms of regex (ironically, given what this query is about) and can be investigated and solved.

geoffw0 · 2023-07-04T07:44:48Z

DCA LGTM. Analysis time is not measurably slower, which if anything is a slight surprise as this is the first query to use the new regex library.

d10c

LGTM! (Oh, actually let's wait for doc review)

geoffw0 · 2023-07-10T19:30:44Z

@github/docs-content-codeql please could you confirm this one hasn't been forgotten?

geoffw0 · 2023-07-10T19:32:09Z

Note that the qhelp preview generated all sorts of stuff, only the Swift .qhelp is under review here.

subatoi

Looks good 👍 —just had some very minor Docs suggestions

Co-authored-by: Ben Ahmady <32935794+subatoi@users.noreply.github.com>

geoffw0 · 2023-07-13T18:23:07Z

Thanks @subatoi, all suggestions accepted!

Now this just needs a re-approval from someone and merging.

geoffw0 added 6 commits June 23, 2023 16:04

Swift: Copy query from Ruby.

a957ce4

Swift: Trivial fixes for the query.

4a46946

Swift: Copy qhelp from Ruby.

54a8f9e

Swift: Revise the qhelp.

1f58152

Swift: Add query tests.

bf1254f

Swift: Change note.

ddd3e88

geoffw0 added the Swift label Jun 23, 2023

geoffw0 requested a review from a team as a code owner June 23, 2023 15:51

github-actions Bot added the documentation label Jun 23, 2023

geoffw0 commented Jun 23, 2023

View reviewed changes

d10c reviewed Jul 3, 2023

View reviewed changes

Swift: Grammar.

a050de7

geoffw0 added the ready-for-doc-review This PR requires and is ready for review from the GitHub docs team. label Jul 3, 2023

d10c previously approved these changes Jul 4, 2023

View reviewed changes

subatoi previously approved these changes Jul 13, 2023

View reviewed changes

Comment thread swift/ql/src/queries/Security/CWE-1333/ReDoSIntroduction.inc.qhelp Outdated

Comment thread swift/ql/src/queries/Security/CWE-1333/ReDoS.qhelp Outdated

Comment thread swift/ql/src/queries/Security/CWE-1333/ReDoS.qhelp Outdated

subatoi self-assigned this Jul 13, 2023

Apply suggestions from code review

962c16d

Co-authored-by: Ben Ahmady <32935794+subatoi@users.noreply.github.com>

geoffw0 dismissed stale reviews from subatoi and d10c via 962c16d July 13, 2023 18:20

subatoi approved these changes Jul 14, 2023

View reviewed changes

geoffw0 merged commit 1c8297b into github:main Jul 14, 2023

Conversation

geoffw0 commented Jun 23, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions Bot commented Jun 23, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Polynomial regular expression used on uncontrolled data

Recommendation

Example

Example

Example

References

Inefficient regular expression

Recommendation

Example

References

Polynomial regular expression used on uncontrolled data

Recommendation

Example

Example

Example

References

Inefficient regular expression

Recommendation

Example

References

Polynomial regular expression used on uncontrolled data

Recommendation

Example

Example

Example

References

Inefficient regular expression

Recommendation

Example

References

Polynomial regular expression used on uncontrolled data

Recommendation

Example

Example

Example

References

Inefficient regular expression

Recommendation

Example

References

Inefficient regular expression

Recommendation

Example

References

Uh oh!

geoffw0 Jun 23, 2023

Choose a reason for hiding this comment

Uh oh!

d10c Jul 3, 2023

Choose a reason for hiding this comment

Uh oh!

geoffw0 Jul 3, 2023

Choose a reason for hiding this comment

Uh oh!

d10c left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

d10c Jul 3, 2023

Choose a reason for hiding this comment

Uh oh!

MathiasVP commented Jul 3, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

geoffw0 commented Jul 3, 2023

Uh oh!

geoffw0 commented Jul 4, 2023

Uh oh!

d10c left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

geoffw0 commented Jul 10, 2023

Uh oh!

geoffw0 commented Jun 23, 2023 •

edited

Loading

github-actions Bot commented Jun 23, 2023 •

edited

Loading

MathiasVP commented Jul 3, 2023 •

edited

Loading

d10c left a comment •

edited

Loading