Skip to content

update dmp submodule to fix a buffer overflow#28

Closed
beqabeqa473 wants to merge 1 commit into
JoshData:mainfrom
beqabeqa473:dmp_buffer_overflow_fix
Closed

update dmp submodule to fix a buffer overflow#28
beqabeqa473 wants to merge 1 commit into
JoshData:mainfrom
beqabeqa473:dmp_buffer_overflow_fix

Conversation

@beqabeqa473

Copy link
Copy Markdown

There was a problem in dmp project, where it was possible to get a fatal crash, when converting diff back to lines because of heap overflow.
It is fixed in this submodule.

@JoshData

Copy link
Copy Markdown
Owner

It looks like the commit isn't in the upstream repo, is that right?

@beqabeqa473

beqabeqa473 commented Feb 13, 2024 via email

Copy link
Copy Markdown
Author

@JoshData

Copy link
Copy Markdown
Owner

It would be difficult for me to review your changes to know whether I should accept them here, and I'm not sure if it is meaningful to have a submodule point to a commit that isn't in the repository that it points to. So for both reasons I don't think I can merge this PR as is.

Is there a similar fix in Google's original DMP library, which the repository that the submodule points to is based on?

@beqabeqa473

beqabeqa473 commented Feb 14, 2024 via email

Copy link
Copy Markdown
Author

@beqabeqa473

Copy link
Copy Markdown
Author

I made a pr again diff-match-patch -cpp-stl but have not hope this to be merged unfortunately because of inactivity of author.

We have two ways, you can review and merge this pr as is, or fork my fork to your account to be sure nothing will happen and make a commit with that fork.

This issue is very serious, and makes whole application to crash.

fast-diff-match-patch is used in nvda screenreader and users of this screenreader are suffering from this issue.
it is possible to downgrade to diffmatchpatch-python, but as testing revealed, it is 4 to 10 times slower than fast-diff-match-patch.

@JoshData

Copy link
Copy Markdown
Owner

Got it. Ok let me find time to review what you did and figure out how to best include it.

Do you have a test case that I can run to see the crash?

@beqabeqa473

beqabeqa473 commented Feb 18, 2024 via email

Copy link
Copy Markdown
Author

@Danstiv

Danstiv commented Feb 18, 2024

Copy link
Copy Markdown

Hello @JoshData
You can use this test script to see the problem.

Test text files contains some commits from NVDA repo, searching for diff before crashing takes a while (20 seconds on my machine).

@beqabeqa473

beqabeqa473 commented Feb 24, 2024

Copy link
Copy Markdown
Author
  hello @JoshData  By the chance, did you have a possibility to look at pr?

@JoshData

JoshData commented Mar 4, 2024

Copy link
Copy Markdown
Owner

I've had a chance to look now. I was able to reproduce the crash with @Danstiv's example (thanks for that!).

In DMP's method of doing a line-by-line diff, it maps each unique line to an integer, stores the integer as a character in a new string, and then does a diff over those fake characters. The DMP library doesn't check that the character type of the string can hold integers for as many unique lines that occurs. In this library, when comparing byte strings, it runs DMP using 'char' as the string's underlying character type, which means it won't be able to store more than ~255 unique lines. In the example, there are ~45,000 unique lines. The issue seems to be that since 'char' is signed, it doesn't correctly round-trip all the integers.

Rather than patching the DMP library, I think the right thing to do would be for me to avoid using the char type and always use a larger unsigned integer, i.e. wchar_t, the type used when comparing Python strings. This will also ensure that more unique lines can be represented in the string's character type, which would stay true to the DMP library's intent.

I'll try to get a fix done by the end of next weekend.

Edit: A work-around would be to do a comparison using Python strings (or unicode strings in Py2) rather than byte strings.

JoshData added a commit that referenced this pull request Mar 6, 2024
In #28, it was found that using a signed character type ('char') for the DMP library, the cast to and from an unsigned integer type doesn't round-trip correctly in the DMP library. This occurs when the number of unique lines in the input texts exceeds the number of positive values in the signed type. I think. We use std::string, based on char, when comparing Python bytestrings. (Regular strings use std::wstring, which is based on wchar_t, which I think is unsigned on all platforms.)

Rather than patch the library to correct the cast, I'm changing this library to use wstring even for bytestring comparisons. This not only fixes the cast bug, but it also ensures that there is space to store more than 256 unique lines in the input.

The original suggested fix to the upstream library was:

```
From 673dfc9d4bcb8cd0234894f5a245e9824abf5fbc Mon Sep 17 00:00:00 2001
From: Beka Gozalishvili <beqaprogger@gmail.com>
Date: Fri, 9 Feb 2024 20:37:19 +0400
Subject: [PATCH] Fix buffer overflow in toLines function which was causing a
 fatal crash

---
 diff_match_patch.h | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/diff_match_patch.h b/diff_match_patch.h
index 47531c9..9d184bc 100644
--- a/diff_match_patch.h
+++ b/diff_match_patch.h
@@ -682,7 +682,8 @@ class diff_match_patch {
     for (typename Diffs::iterator cur_diff = diffs.begin(); cur_diff != diffs.end(); ++cur_diff) {
       string_t text;
       for (int y = 0; y < (int)(*cur_diff).text.length(); y++) {
-        const LinePtr& lp = lineArray[static_cast<size_t>((*cur_diff).text[y])];
+          typedef typename std::make_unsigned<typename string_t::value_type>::type unsigned_value_type;
+          const LinePtr& lp = lineArray[static_cast<size_t>(static_cast<unsigned_value_type>((*cur_diff).text[y]))];
         text.append(lp.first, lp.second);
       }
       (*cur_diff).text.swap(text);
```
@JoshData

JoshData commented Mar 6, 2024

Copy link
Copy Markdown
Owner

I pushed a fix to the main branch. Hopefully this fixes it. If you can give it a try before I post a release that would be appreciated. (I am trying to get some pre-built packages to test ahead of time.)

@Danstiv

Danstiv commented Mar 6, 2024

Copy link
Copy Markdown

@JoshData,
Unfortunately, another incorrect behavior is now observed.
When running on the same data the function hangs probably forever, I waited for about two minutes but didn't get the diff.

JoshData added a commit that referenced this pull request Mar 9, 2024
In #28, it was found that using a signed character type ('char') for the DMP library, the cast to and from an unsigned integer type doesn't round-trip correctly in the DMP library. This occurs when the number of unique lines in the input texts exceeds the number of positive values in the signed type. I think. We use std::string, based on char, when comparing Python bytestrings. (Regular strings use std::wstring, which is based on wchar_t, which I think is unsigned on all platforms.)

Rather than patch the library to correct the cast, I'm changing this library to use wstring even for bytestring comparisons. This not only fixes the cast bug, but it also ensures that there is space to store more than 256 unique lines in the input.

The original suggested fix to the upstream library was:

```
From 673dfc9d4bcb8cd0234894f5a245e9824abf5fbc Mon Sep 17 00:00:00 2001
From: Beka Gozalishvili <beqaprogger@gmail.com>
Date: Fri, 9 Feb 2024 20:37:19 +0400
Subject: [PATCH] Fix buffer overflow in toLines function which was causing a
 fatal crash

---
 diff_match_patch.h | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/diff_match_patch.h b/diff_match_patch.h
index 47531c9..9d184bc 100644
--- a/diff_match_patch.h
+++ b/diff_match_patch.h
@@ -682,7 +682,8 @@ class diff_match_patch {
     for (typename Diffs::iterator cur_diff = diffs.begin(); cur_diff != diffs.end(); ++cur_diff) {
       string_t text;
       for (int y = 0; y < (int)(*cur_diff).text.length(); y++) {
-        const LinePtr& lp = lineArray[static_cast<size_t>((*cur_diff).text[y])];
+          typedef typename std::make_unsigned<typename string_t::value_type>::type unsigned_value_type;
+          const LinePtr& lp = lineArray[static_cast<size_t>(static_cast<unsigned_value_type>((*cur_diff).text[y]))];
         text.append(lp.first, lp.second);
       }
       (*cur_diff).text.swap(text);
```
@JoshData

JoshData commented Mar 9, 2024

Copy link
Copy Markdown
Owner

It takes 14 minutes to complete for me. I'm not sure this is incorrect behavior since these files are quite large. To check that it isn't related to this PR, I changed the test to read the files as string (rather than bytes) and I also reverted to an earlier version of the library - same time.

JoshData added a commit that referenced this pull request Mar 9, 2024
In #28, it was found that using a signed character type ('char') for the DMP library, the cast to and from an unsigned integer type doesn't round-trip correctly in the DMP library. This occurs when the number of unique lines in the input texts exceeds the number of positive values in the signed type. I think. We use std::string, based on char, when comparing Python bytestrings. (Regular strings use std::wstring, which is based on wchar_t, which I think is unsigned on all platforms.)

Rather than patch the library to correct the cast, I'm changing this library to use wstring even for bytestring comparisons. This not only fixes the cast bug, but it also ensures that there is space to store more than 256 unique lines in the input.

The original suggested fix to the upstream library was:

```
From 673dfc9d4bcb8cd0234894f5a245e9824abf5fbc Mon Sep 17 00:00:00 2001
From: Beka Gozalishvili <beqaprogger@gmail.com>
Date: Fri, 9 Feb 2024 20:37:19 +0400
Subject: [PATCH] Fix buffer overflow in toLines function which was causing a
 fatal crash

---
 diff_match_patch.h | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/diff_match_patch.h b/diff_match_patch.h
index 47531c9..9d184bc 100644
--- a/diff_match_patch.h
+++ b/diff_match_patch.h
@@ -682,7 +682,8 @@ class diff_match_patch {
     for (typename Diffs::iterator cur_diff = diffs.begin(); cur_diff != diffs.end(); ++cur_diff) {
       string_t text;
       for (int y = 0; y < (int)(*cur_diff).text.length(); y++) {
-        const LinePtr& lp = lineArray[static_cast<size_t>((*cur_diff).text[y])];
+          typedef typename std::make_unsigned<typename string_t::value_type>::type unsigned_value_type;
+          const LinePtr& lp = lineArray[static_cast<size_t>(static_cast<unsigned_value_type>((*cur_diff).text[y]))];
         text.append(lp.first, lp.second);
       }
       (*cur_diff).text.swap(text);
```
@Danstiv

Danstiv commented Mar 9, 2024

Copy link
Copy Markdown

@JoshData,
But with the initial fix the processing is much faster.

Fix by @beqabeqa473

Danstiv@Danstiv-PC UCRT64 /d/temp/dmp
$ time py -3.11 test_dmp.py
Before diff
after diff

real    0m14.218s
user    0m0.000s
sys     0m0.000s

Fix by @JoshData

Danstiv@Danstiv-PC UCRT64 /d/temp/dmp
$ time py -3.11 test_dmp.py
Before diff
after diff

real    6m44.171s
user    0m0.015s
sys     0m0.046s

It seems your fix has caused some performance degradation.

I'm also attaching new files for testing, collected from the NVDA screen reader when running git log.
The test results above were obtained by comparing these files.
test_dmp.zip

@JoshData

Copy link
Copy Markdown
Owner

Thanks. I've spent some time trying to figure out what's happening and haven't figured it out yet.

@beqabeqa473

Copy link
Copy Markdown
Author

Hello @JoshData

Please update official dmp-stl submodule, where my pr was merged.
After your update submodule to the latest commit, i will close this pr.

@beqabeqa473

Copy link
Copy Markdown
Author

Also, please revert your changes, as they are slowing down diffing

JoshData added a commit that referenced this pull request Apr 9, 2024
…h for bytestring comparisons

In #28, it was found that when using a C++ std::string type with a signed character type ('char') for the DMP library the cast to and from an unsigned integer type didn't round-trip correctly in the DMP library. This occurs when the number of unique lines in the input texts exceeds the number of positive values in the signed type. I think. We use std::string, based on char, which is signed, when comparing Python bytestrings. (Regular strings use std::wstring, which is based on wchar_t, which I think is unsigned on all platforms.)

I tried changing the C++ string type used for bytestring comparisons to wstring which fixed the cast bug and also ensured that there is space to store more than 256 unique lines in the input, as with string diffs. But performance of wstring comparisons was an order of magnitude (or more) slower. I couldn't figure out why.

The upstream change is:

```
From 673dfc9d4bcb8cd0234894f5a245e9824abf5fbc Mon Sep 17 00:00:00 2001
From: Beka Gozalishvili <beqaprogger@gmail.com>
Date: Fri, 9 Feb 2024 20:37:19 +0400
Subject: [PATCH] Fix buffer overflow in toLines function which was causing a
 fatal crash

---
 diff_match_patch.h | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/diff_match_patch.h b/diff_match_patch.h
index 47531c9..9d184bc 100644
--- a/diff_match_patch.h
+++ b/diff_match_patch.h
@@ -682,7 +682,8 @@ class diff_match_patch {
     for (typename Diffs::iterator cur_diff = diffs.begin(); cur_diff != diffs.end(); ++cur_diff) {
       string_t text;
       for (int y = 0; y < (int)(*cur_diff).text.length(); y++) {
-        const LinePtr& lp = lineArray[static_cast<size_t>((*cur_diff).text[y])];
+          typedef typename std::make_unsigned<typename string_t::value_type>::type unsigned_value_type;
+          const LinePtr& lp = lineArray[static_cast<size_t>(static_cast<unsigned_value_type>((*cur_diff).text[y]))];
         text.append(lp.first, lp.second);
       }
       (*cur_diff).text.swap(text);
```
@JoshData

JoshData commented Apr 9, 2024

Copy link
Copy Markdown
Owner

Done.

@beqabeqa473 beqabeqa473 closed this Apr 9, 2024
@beqabeqa473 beqabeqa473 deleted the dmp_buffer_overflow_fix branch April 9, 2024 19:05
@Danstiv

Danstiv commented Apr 15, 2024

Copy link
Copy Markdown

Hello @JoshData,
Are you planning to release on pypi?

@JoshData

Copy link
Copy Markdown
Owner

Thanks for the reminder. I just published a release. Hopefully it all works OK now!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants