Skip to content

Prevent ReDoS in Spanish sentence splitting regex#1084

Merged
Gallaecio merged 3 commits intoscrapinghub:masterfrom
Sjord:fix-spanish-regexdos
Jan 11, 2023
Merged

Prevent ReDoS in Spanish sentence splitting regex#1084
Gallaecio merged 3 commits intoscrapinghub:masterfrom
Sjord:fix-spanish-regexdos

Conversation

@Sjord
Copy link
Copy Markdown
Contributor

@Sjord Sjord commented Oct 12, 2022

In Spanish, questions start with an upside down question mark:

¿Vos bueno?

This was already handled in the original regex, but the original regex was vulnerable for regular expression denial of service (ReDoS). In the new regex, we either search for normal end-of-sentence optionally followed by a ¿ or ¡, or a ¿ or ¡ on its own. A change is that the normal end-of-sentence (.!?;…) has to come before the ¡ or ¿, but I think this is acceptable.

This PR also adds some Spanish test cases. These hit the sentence splitting logic, but the exact result of the splitting is not tested.

Fixes #869

@Sjord Sjord marked this pull request as draft October 12, 2022 12:31
@Sjord Sjord marked this pull request as ready for review October 12, 2022 12:49
@Gallaecio
Copy link
Copy Markdown
Contributor

Closing and reopening to re-trigger CI jobs…

@Gallaecio Gallaecio closed this Jan 11, 2023
@Gallaecio Gallaecio reopened this Jan 11, 2023
Sjord added 2 commits January 11, 2023 15:56
In Spanish, questions start with an upside down question mark:

> ¿Vos bueno?

This was already handled in the original regex, but the original regex was vulnerable for regular expression denial of service (ReDoS). In the new regex, we either search for normal end-of-sentence optionally followed by a ¿ or ¡, or a ¿ or ¡ on its own. A change is that the normal end-of-sentence (.!?;…) has to come before the ¡ or ¿, but I think this is acceptable.

This PR also adds some Spanish test cases. These hit the sentence splitting logic, but the exact result of the splitting is not tested.

Fixes scrapinghub#869
Consume whitespace if there is any, but still match if there isn't. This makes most sense for \n followed immediately by ¿. This also means we don't have to backtrack if there isn't any whitespace after a line ending.
@Sjord Sjord force-pushed the fix-spanish-regexdos branch from fe59143 to ef78400 Compare January 11, 2023 14:56
@Sjord
Copy link
Copy Markdown
Contributor Author

Sjord commented Jan 11, 2023

I rebased to master.

The previous code wouldn't remove multiple empty strings in a row, due to
modifying the list during the loop. We use `filter` with the default identity
function instead.
@serhii73 serhii73 requested a review from wRAR January 11, 2023 16:26
@Gallaecio Gallaecio merged commit 769e4c0 into scrapinghub:master Jan 11, 2023
@serhii73
Copy link
Copy Markdown
Collaborator

We have a new release with this PR. Thank you @Sjord

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

SECURITY: bad regex pattern in 'dateparser/languages/locale.py' will cause 'ReDos' security problem.

4 participants