update dmp submodule to fix a buffer overflow by beqabeqa473 · Pull Request #28 · JoshData/fast_diff_match_patch

beqabeqa473 · 2024-02-09T16:57:47Z

There was a problem in dmp project, where it was possible to get a fatal crash, when converting diff back to lines because of heap overflow.
It is fixed in this submodule.

JoshData · 2024-02-13T00:19:29Z

It looks like the commit isn't in the upstream repo, is that right?

beqabeqa473 · 2024-02-13T06:32:45Z

Yes, and i am not sure if they will accept pr as last commit was 8 years ago

…

On 2/13/24, Joshua Tauberer ***@***.***> wrote: It looks like the commit isn't in the upstream repo, is that right? -- Reply to this email directly or view it on GitHub: #28 (comment) You are receiving this because you authored the thread. Message ID: ***@***.***>

-- with best regards Beqa Gozalishvili Tell: +995593454005 Email: ***@***.*** Web: https://gozaltech.org Skype: beqabeqa473 Telegram: https://t.me/gozaltech facebook: https://facebook.com/gozaltech twitter: https://twitter.com/beqabeqa473 Instagram: https://instagram.com/beqa.gozalishvili

JoshData · 2024-02-14T18:46:33Z

It would be difficult for me to review your changes to know whether I should accept them here, and I'm not sure if it is meaningful to have a submodule point to a commit that isn't in the repository that it points to. So for both reasons I don't think I can merge this PR as is.

Is there a similar fix in Google's original DMP library, which the repository that the submodule points to is based on?

beqabeqa473 · 2024-02-14T19:02:19Z

I am not sure. I will try to make a pr to dmp-cpp-stl but i am not sure it will be merged unfortunately. I

…

On 2/14/24, Joshua Tauberer ***@***.***> wrote: It would be difficult for me to review your changes to know whether I should accept them here, and I'm not sure if it is meaningful to have a submodule point to a commit that isn't in the repository that it points to. So for both reasons I don't think I can merge this PR as is. Is there a similar fix in Google's original DMP library, which the repository that the submodule points to is based on? -- Reply to this email directly or view it on GitHub: #28 (comment) You are receiving this because you authored the thread. Message ID: ***@***.***>

-- with best regards Beqa Gozalishvili Tell: +995593454005 Email: ***@***.*** Web: https://gozaltech.org Skype: beqabeqa473 Telegram: https://t.me/gozaltech facebook: https://facebook.com/gozaltech twitter: https://twitter.com/beqabeqa473 Instagram: https://instagram.com/beqa.gozalishvili

beqabeqa473 · 2024-02-17T17:45:41Z

I made a pr again diff-match-patch -cpp-stl but have not hope this to be merged unfortunately because of inactivity of author.

We have two ways, you can review and merge this pr as is, or fork my fork to your account to be sure nothing will happen and make a commit with that fork.

This issue is very serious, and makes whole application to crash.

fast-diff-match-patch is used in nvda screenreader and users of this screenreader are suffering from this issue.
it is possible to downgrade to diffmatchpatch-python, but as testing revealed, it is 4 to 10 times slower than fast-diff-match-patch.

JoshData · 2024-02-17T23:05:28Z

Got it. Ok let me find time to review what you did and figure out how to best include it.

Do you have a test case that I can run to see the crash?

beqabeqa473 · 2024-02-18T07:13:48Z

Unfortunately, just using dmp without nvda no, i don't have. It is possible to reliably get a crash when you try to pass large data to it, like git log -P 10000 in nvaccess/nvda repository, with nvda running.

…

On 2/18/24, Joshua Tauberer ***@***.***> wrote: Got it. Ok let me find time to review what you did and figure out how to best include it. Do you have a test case that I can run to see the crash? -- Reply to this email directly or view it on GitHub: #28 (comment) You are receiving this because you authored the thread. Message ID: ***@***.***>

-- with best regards Beqa Gozalishvili Tell: +995593454005 Email: ***@***.*** Web: https://gozaltech.org Skype: beqabeqa473 Telegram: https://t.me/gozaltech facebook: https://facebook.com/gozaltech twitter: https://twitter.com/beqabeqa473 Instagram: https://instagram.com/beqa.gozalishvili

Danstiv · 2024-02-18T08:16:58Z

Hello @JoshData
You can use this test script to see the problem.

Test text files contains some commits from NVDA repo, searching for diff before crashing takes a while (20 seconds on my machine).

beqabeqa473 · 2024-02-24T14:31:43Z

  hello @JoshData  By the chance, did you have a possibility to look at pr?

JoshData · 2024-03-04T04:37:26Z

I've had a chance to look now. I was able to reproduce the crash with @Danstiv's example (thanks for that!).

In DMP's method of doing a line-by-line diff, it maps each unique line to an integer, stores the integer as a character in a new string, and then does a diff over those fake characters. The DMP library doesn't check that the character type of the string can hold integers for as many unique lines that occurs. In this library, when comparing byte strings, it runs DMP using 'char' as the string's underlying character type, which means it won't be able to store more than ~255 unique lines. In the example, there are ~45,000 unique lines. The issue seems to be that since 'char' is signed, it doesn't correctly round-trip all the integers.

Rather than patching the DMP library, I think the right thing to do would be for me to avoid using the char type and always use a larger unsigned integer, i.e. wchar_t, the type used when comparing Python strings. This will also ensure that more unique lines can be represented in the string's character type, which would stay true to the DMP library's intent.

I'll try to get a fix done by the end of next weekend.

Edit: A work-around would be to do a comparison using Python strings (or unicode strings in Py2) rather than byte strings.

In #28, it was found that using a signed character type ('char') for the DMP library, the cast to and from an unsigned integer type doesn't round-trip correctly in the DMP library. This occurs when the number of unique lines in the input texts exceeds the number of positive values in the signed type. I think. We use std::string, based on char, when comparing Python bytestrings. (Regular strings use std::wstring, which is based on wchar_t, which I think is unsigned on all platforms.) Rather than patch the library to correct the cast, I'm changing this library to use wstring even for bytestring comparisons. This not only fixes the cast bug, but it also ensures that there is space to store more than 256 unique lines in the input. The original suggested fix to the upstream library was: ``` From 673dfc9d4bcb8cd0234894f5a245e9824abf5fbc Mon Sep 17 00:00:00 2001 From: Beka Gozalishvili <beqaprogger@gmail.com> Date: Fri, 9 Feb 2024 20:37:19 +0400 Subject: [PATCH] Fix buffer overflow in toLines function which was causing a fatal crash --- diff_match_patch.h | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/diff_match_patch.h b/diff_match_patch.h index 47531c9..9d184bc 100644 --- a/diff_match_patch.h +++ b/diff_match_patch.h @@ -682,7 +682,8 @@ class diff_match_patch { for (typename Diffs::iterator cur_diff = diffs.begin(); cur_diff != diffs.end(); ++cur_diff) { string_t text; for (int y = 0; y < (int)(*cur_diff).text.length(); y++) { - const LinePtr& lp = lineArray[static_cast<size_t>((*cur_diff).text[y])]; + typedef typename std::make_unsigned<typename string_t::value_type>::type unsigned_value_type; + const LinePtr& lp = lineArray[static_cast<size_t>(static_cast<unsigned_value_type>((*cur_diff).text[y]))]; text.append(lp.first, lp.second); } (*cur_diff).text.swap(text); ```

JoshData · 2024-03-06T19:39:01Z

I pushed a fix to the main branch. Hopefully this fixes it. If you can give it a try before I post a release that would be appreciated. (I am trying to get some pre-built packages to test ahead of time.)

Danstiv · 2024-03-06T21:05:05Z

@JoshData,
Unfortunately, another incorrect behavior is now observed.
When running on the same data the function hangs probably forever, I waited for about two minutes but didn't get the diff.

In #28, it was found that using a signed character type ('char') for the DMP library, the cast to and from an unsigned integer type doesn't round-trip correctly in the DMP library. This occurs when the number of unique lines in the input texts exceeds the number of positive values in the signed type. I think. We use std::string, based on char, when comparing Python bytestrings. (Regular strings use std::wstring, which is based on wchar_t, which I think is unsigned on all platforms.) Rather than patch the library to correct the cast, I'm changing this library to use wstring even for bytestring comparisons. This not only fixes the cast bug, but it also ensures that there is space to store more than 256 unique lines in the input. The original suggested fix to the upstream library was: ``` From 673dfc9d4bcb8cd0234894f5a245e9824abf5fbc Mon Sep 17 00:00:00 2001 From: Beka Gozalishvili <beqaprogger@gmail.com> Date: Fri, 9 Feb 2024 20:37:19 +0400 Subject: [PATCH] Fix buffer overflow in toLines function which was causing a fatal crash --- diff_match_patch.h | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/diff_match_patch.h b/diff_match_patch.h index 47531c9..9d184bc 100644 --- a/diff_match_patch.h +++ b/diff_match_patch.h @@ -682,7 +682,8 @@ class diff_match_patch { for (typename Diffs::iterator cur_diff = diffs.begin(); cur_diff != diffs.end(); ++cur_diff) { string_t text; for (int y = 0; y < (int)(*cur_diff).text.length(); y++) { - const LinePtr& lp = lineArray[static_cast<size_t>((*cur_diff).text[y])]; + typedef typename std::make_unsigned<typename string_t::value_type>::type unsigned_value_type; + const LinePtr& lp = lineArray[static_cast<size_t>(static_cast<unsigned_value_type>((*cur_diff).text[y]))]; text.append(lp.first, lp.second); } (*cur_diff).text.swap(text); ```

JoshData · 2024-03-09T13:55:47Z

It takes 14 minutes to complete for me. I'm not sure this is incorrect behavior since these files are quite large. To check that it isn't related to this PR, I changed the test to read the files as string (rather than bytes) and I also reverted to an earlier version of the library - same time.

In #28, it was found that using a signed character type ('char') for the DMP library, the cast to and from an unsigned integer type doesn't round-trip correctly in the DMP library. This occurs when the number of unique lines in the input texts exceeds the number of positive values in the signed type. I think. We use std::string, based on char, when comparing Python bytestrings. (Regular strings use std::wstring, which is based on wchar_t, which I think is unsigned on all platforms.) Rather than patch the library to correct the cast, I'm changing this library to use wstring even for bytestring comparisons. This not only fixes the cast bug, but it also ensures that there is space to store more than 256 unique lines in the input. The original suggested fix to the upstream library was: ``` From 673dfc9d4bcb8cd0234894f5a245e9824abf5fbc Mon Sep 17 00:00:00 2001 From: Beka Gozalishvili <beqaprogger@gmail.com> Date: Fri, 9 Feb 2024 20:37:19 +0400 Subject: [PATCH] Fix buffer overflow in toLines function which was causing a fatal crash --- diff_match_patch.h | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/diff_match_patch.h b/diff_match_patch.h index 47531c9..9d184bc 100644 --- a/diff_match_patch.h +++ b/diff_match_patch.h @@ -682,7 +682,8 @@ class diff_match_patch { for (typename Diffs::iterator cur_diff = diffs.begin(); cur_diff != diffs.end(); ++cur_diff) { string_t text; for (int y = 0; y < (int)(*cur_diff).text.length(); y++) { - const LinePtr& lp = lineArray[static_cast<size_t>((*cur_diff).text[y])]; + typedef typename std::make_unsigned<typename string_t::value_type>::type unsigned_value_type; + const LinePtr& lp = lineArray[static_cast<size_t>(static_cast<unsigned_value_type>((*cur_diff).text[y]))]; text.append(lp.first, lp.second); } (*cur_diff).text.swap(text); ```

Danstiv · 2024-03-09T18:22:57Z

@JoshData,
But with the initial fix the processing is much faster.

Fix by @beqabeqa473

Danstiv@Danstiv-PC UCRT64 /d/temp/dmp
$ time py -3.11 test_dmp.py
Before diff
after diff

real    0m14.218s
user    0m0.000s
sys     0m0.000s

Fix by @JoshData

Danstiv@Danstiv-PC UCRT64 /d/temp/dmp
$ time py -3.11 test_dmp.py
Before diff
after diff

real    6m44.171s
user    0m0.015s
sys     0m0.046s

It seems your fix has caused some performance degradation.

I'm also attaching new files for testing, collected from the NVDA screen reader when running git log.
The test results above were obtained by comparing these files.
test_dmp.zip

JoshData · 2024-03-24T13:49:16Z

Thanks. I've spent some time trying to figure out what's happening and haven't figured it out yet.

beqabeqa473 · 2024-04-09T11:17:19Z

Hello @JoshData

Please update official dmp-stl submodule, where my pr was merged.
After your update submodule to the latest commit, i will close this pr.

beqabeqa473 · 2024-04-09T11:23:04Z

Also, please revert your changes, as they are slowing down diffing

…h for bytestring comparisons In #28, it was found that when using a C++ std::string type with a signed character type ('char') for the DMP library the cast to and from an unsigned integer type didn't round-trip correctly in the DMP library. This occurs when the number of unique lines in the input texts exceeds the number of positive values in the signed type. I think. We use std::string, based on char, which is signed, when comparing Python bytestrings. (Regular strings use std::wstring, which is based on wchar_t, which I think is unsigned on all platforms.) I tried changing the C++ string type used for bytestring comparisons to wstring which fixed the cast bug and also ensured that there is space to store more than 256 unique lines in the input, as with string diffs. But performance of wstring comparisons was an order of magnitude (or more) slower. I couldn't figure out why. The upstream change is: ``` From 673dfc9d4bcb8cd0234894f5a245e9824abf5fbc Mon Sep 17 00:00:00 2001 From: Beka Gozalishvili <beqaprogger@gmail.com> Date: Fri, 9 Feb 2024 20:37:19 +0400 Subject: [PATCH] Fix buffer overflow in toLines function which was causing a fatal crash --- diff_match_patch.h | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/diff_match_patch.h b/diff_match_patch.h index 47531c9..9d184bc 100644 --- a/diff_match_patch.h +++ b/diff_match_patch.h @@ -682,7 +682,8 @@ class diff_match_patch { for (typename Diffs::iterator cur_diff = diffs.begin(); cur_diff != diffs.end(); ++cur_diff) { string_t text; for (int y = 0; y < (int)(*cur_diff).text.length(); y++) { - const LinePtr& lp = lineArray[static_cast<size_t>((*cur_diff).text[y])]; + typedef typename std::make_unsigned<typename string_t::value_type>::type unsigned_value_type; + const LinePtr& lp = lineArray[static_cast<size_t>(static_cast<unsigned_value_type>((*cur_diff).text[y]))]; text.append(lp.first, lp.second); } (*cur_diff).text.swap(text); ```

JoshData · 2024-04-09T18:47:16Z

Done.

Danstiv · 2024-04-15T14:28:46Z

Hello @JoshData,
Are you planning to release on pypi?

JoshData · 2024-04-23T15:41:55Z

Thanks for the reminder. I just published a release. Hopefully it all works OK now!

update dmp submodule to fix a buffer overflow

3b44f50

JoshData force-pushed the main branch from b71fc28 to 137ecf7 Compare March 9, 2024 15:51

JoshData force-pushed the main branch from 137ecf7 to 32d2c92 Compare April 9, 2024 15:33

beqabeqa473 closed this Apr 9, 2024

beqabeqa473 deleted the dmp_buffer_overflow_fix branch April 9, 2024 19:05

Danstiv mentioned this pull request May 7, 2024

Refactor DiffMatchPatch diff handler nvaccess/nvda#16483

Merged

5 tasks

Conversation

beqabeqa473 commented Feb 9, 2024

Uh oh!

JoshData commented Feb 13, 2024

Uh oh!

beqabeqa473 commented Feb 13, 2024 via email

Uh oh!

JoshData commented Feb 14, 2024

Uh oh!

beqabeqa473 commented Feb 14, 2024 via email

Uh oh!

beqabeqa473 commented Feb 17, 2024

Uh oh!

JoshData commented Feb 17, 2024

Uh oh!

beqabeqa473 commented Feb 18, 2024 via email

Uh oh!

Danstiv commented Feb 18, 2024

Uh oh!

beqabeqa473 commented Feb 24, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JoshData commented Mar 4, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JoshData commented Mar 6, 2024

Uh oh!

Danstiv commented Mar 6, 2024

Uh oh!

JoshData commented Mar 9, 2024

Uh oh!

Danstiv commented Mar 9, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Fix by @beqabeqa473

Fix by @JoshData

Uh oh!

JoshData commented Mar 24, 2024

Uh oh!

beqabeqa473 commented Apr 9, 2024

Uh oh!

beqabeqa473 commented Apr 9, 2024

Uh oh!

JoshData commented Apr 9, 2024

Uh oh!

Danstiv commented Apr 15, 2024

Uh oh!

JoshData commented Apr 23, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

beqabeqa473 commented Feb 24, 2024 •

edited

Loading

JoshData commented Mar 4, 2024 •

edited

Loading

Danstiv commented Mar 9, 2024 •

edited

Loading