update dmp submodule to fix a buffer overflow#28
Conversation
|
It looks like the commit isn't in the upstream repo, is that right? |
|
Yes, and i am not sure if they will accept pr as last commit was 8 years ago
…On 2/13/24, Joshua Tauberer ***@***.***> wrote:
It looks like the commit isn't in the upstream repo, is that right?
--
Reply to this email directly or view it on GitHub:
#28 (comment)
You are receiving this because you authored the thread.
Message ID: ***@***.***>
--
with best regards Beqa Gozalishvili
Tell: +995593454005
Email: ***@***.***
Web: https://gozaltech.org
Skype: beqabeqa473
Telegram: https://t.me/gozaltech
facebook: https://facebook.com/gozaltech
twitter: https://twitter.com/beqabeqa473
Instagram: https://instagram.com/beqa.gozalishvili
|
|
It would be difficult for me to review your changes to know whether I should accept them here, and I'm not sure if it is meaningful to have a submodule point to a commit that isn't in the repository that it points to. So for both reasons I don't think I can merge this PR as is. Is there a similar fix in Google's original DMP library, which the repository that the submodule points to is based on? |
|
I am not sure. I will try to make a pr to dmp-cpp-stl but i am not
sure it will be merged unfortunately. I
…On 2/14/24, Joshua Tauberer ***@***.***> wrote:
It would be difficult for me to review your changes to know whether I should
accept them here, and I'm not sure if it is meaningful to have a submodule
point to a commit that isn't in the repository that it points to. So for
both reasons I don't think I can merge this PR as is.
Is there a similar fix in Google's original DMP library, which the
repository that the submodule points to is based on?
--
Reply to this email directly or view it on GitHub:
#28 (comment)
You are receiving this because you authored the thread.
Message ID: ***@***.***>
--
with best regards Beqa Gozalishvili
Tell: +995593454005
Email: ***@***.***
Web: https://gozaltech.org
Skype: beqabeqa473
Telegram: https://t.me/gozaltech
facebook: https://facebook.com/gozaltech
twitter: https://twitter.com/beqabeqa473
Instagram: https://instagram.com/beqa.gozalishvili
|
|
I made a pr again diff-match-patch -cpp-stl but have not hope this to be merged unfortunately because of inactivity of author. We have two ways, you can review and merge this pr as is, or fork my fork to your account to be sure nothing will happen and make a commit with that fork. This issue is very serious, and makes whole application to crash. fast-diff-match-patch is used in nvda screenreader and users of this screenreader are suffering from this issue. |
|
Got it. Ok let me find time to review what you did and figure out how to best include it. Do you have a test case that I can run to see the crash? |
|
Unfortunately, just using dmp without nvda no, i don't have.
It is possible to reliably get a crash when you try to pass large data
to it, like git log -P 10000 in nvaccess/nvda repository, with nvda
running.
…On 2/18/24, Joshua Tauberer ***@***.***> wrote:
Got it. Ok let me find time to review what you did and figure out how to
best include it.
Do you have a test case that I can run to see the crash?
--
Reply to this email directly or view it on GitHub:
#28 (comment)
You are receiving this because you authored the thread.
Message ID: ***@***.***>
--
with best regards Beqa Gozalishvili
Tell: +995593454005
Email: ***@***.***
Web: https://gozaltech.org
Skype: beqabeqa473
Telegram: https://t.me/gozaltech
facebook: https://facebook.com/gozaltech
twitter: https://twitter.com/beqabeqa473
Instagram: https://instagram.com/beqa.gozalishvili
|
|
Hello @JoshData Test text files contains some commits from NVDA repo, searching for diff before crashing takes a while (20 seconds on my machine). |
|
|
I've had a chance to look now. I was able to reproduce the crash with @Danstiv's example (thanks for that!). In DMP's method of doing a line-by-line diff, it maps each unique line to an integer, stores the integer as a character in a new string, and then does a diff over those fake characters. The DMP library doesn't check that the character type of the string can hold integers for as many unique lines that occurs. In this library, when comparing byte strings, it runs DMP using 'char' as the string's underlying character type, which means it won't be able to store more than ~255 unique lines. In the example, there are ~45,000 unique lines. The issue seems to be that since 'char' is signed, it doesn't correctly round-trip all the integers. Rather than patching the DMP library, I think the right thing to do would be for me to avoid using the char type and always use a larger unsigned integer, i.e. wchar_t, the type used when comparing Python strings. This will also ensure that more unique lines can be represented in the string's character type, which would stay true to the DMP library's intent. I'll try to get a fix done by the end of next weekend. Edit: A work-around would be to do a comparison using Python strings (or unicode strings in Py2) rather than byte strings. |
In #28, it was found that using a signed character type ('char') for the DMP library, the cast to and from an unsigned integer type doesn't round-trip correctly in the DMP library. This occurs when the number of unique lines in the input texts exceeds the number of positive values in the signed type. I think. We use std::string, based on char, when comparing Python bytestrings. (Regular strings use std::wstring, which is based on wchar_t, which I think is unsigned on all platforms.) Rather than patch the library to correct the cast, I'm changing this library to use wstring even for bytestring comparisons. This not only fixes the cast bug, but it also ensures that there is space to store more than 256 unique lines in the input. The original suggested fix to the upstream library was: ``` From 673dfc9d4bcb8cd0234894f5a245e9824abf5fbc Mon Sep 17 00:00:00 2001 From: Beka Gozalishvili <beqaprogger@gmail.com> Date: Fri, 9 Feb 2024 20:37:19 +0400 Subject: [PATCH] Fix buffer overflow in toLines function which was causing a fatal crash --- diff_match_patch.h | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/diff_match_patch.h b/diff_match_patch.h index 47531c9..9d184bc 100644 --- a/diff_match_patch.h +++ b/diff_match_patch.h @@ -682,7 +682,8 @@ class diff_match_patch { for (typename Diffs::iterator cur_diff = diffs.begin(); cur_diff != diffs.end(); ++cur_diff) { string_t text; for (int y = 0; y < (int)(*cur_diff).text.length(); y++) { - const LinePtr& lp = lineArray[static_cast<size_t>((*cur_diff).text[y])]; + typedef typename std::make_unsigned<typename string_t::value_type>::type unsigned_value_type; + const LinePtr& lp = lineArray[static_cast<size_t>(static_cast<unsigned_value_type>((*cur_diff).text[y]))]; text.append(lp.first, lp.second); } (*cur_diff).text.swap(text); ```
|
I pushed a fix to the main branch. Hopefully this fixes it. If you can give it a try before I post a release that would be appreciated. (I am trying to get some pre-built packages to test ahead of time.) |
|
@JoshData, |
In #28, it was found that using a signed character type ('char') for the DMP library, the cast to and from an unsigned integer type doesn't round-trip correctly in the DMP library. This occurs when the number of unique lines in the input texts exceeds the number of positive values in the signed type. I think. We use std::string, based on char, when comparing Python bytestrings. (Regular strings use std::wstring, which is based on wchar_t, which I think is unsigned on all platforms.) Rather than patch the library to correct the cast, I'm changing this library to use wstring even for bytestring comparisons. This not only fixes the cast bug, but it also ensures that there is space to store more than 256 unique lines in the input. The original suggested fix to the upstream library was: ``` From 673dfc9d4bcb8cd0234894f5a245e9824abf5fbc Mon Sep 17 00:00:00 2001 From: Beka Gozalishvili <beqaprogger@gmail.com> Date: Fri, 9 Feb 2024 20:37:19 +0400 Subject: [PATCH] Fix buffer overflow in toLines function which was causing a fatal crash --- diff_match_patch.h | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/diff_match_patch.h b/diff_match_patch.h index 47531c9..9d184bc 100644 --- a/diff_match_patch.h +++ b/diff_match_patch.h @@ -682,7 +682,8 @@ class diff_match_patch { for (typename Diffs::iterator cur_diff = diffs.begin(); cur_diff != diffs.end(); ++cur_diff) { string_t text; for (int y = 0; y < (int)(*cur_diff).text.length(); y++) { - const LinePtr& lp = lineArray[static_cast<size_t>((*cur_diff).text[y])]; + typedef typename std::make_unsigned<typename string_t::value_type>::type unsigned_value_type; + const LinePtr& lp = lineArray[static_cast<size_t>(static_cast<unsigned_value_type>((*cur_diff).text[y]))]; text.append(lp.first, lp.second); } (*cur_diff).text.swap(text); ```
|
It takes 14 minutes to complete for me. I'm not sure this is incorrect behavior since these files are quite large. To check that it isn't related to this PR, I changed the test to read the files as string (rather than bytes) and I also reverted to an earlier version of the library - same time. |
In #28, it was found that using a signed character type ('char') for the DMP library, the cast to and from an unsigned integer type doesn't round-trip correctly in the DMP library. This occurs when the number of unique lines in the input texts exceeds the number of positive values in the signed type. I think. We use std::string, based on char, when comparing Python bytestrings. (Regular strings use std::wstring, which is based on wchar_t, which I think is unsigned on all platforms.) Rather than patch the library to correct the cast, I'm changing this library to use wstring even for bytestring comparisons. This not only fixes the cast bug, but it also ensures that there is space to store more than 256 unique lines in the input. The original suggested fix to the upstream library was: ``` From 673dfc9d4bcb8cd0234894f5a245e9824abf5fbc Mon Sep 17 00:00:00 2001 From: Beka Gozalishvili <beqaprogger@gmail.com> Date: Fri, 9 Feb 2024 20:37:19 +0400 Subject: [PATCH] Fix buffer overflow in toLines function which was causing a fatal crash --- diff_match_patch.h | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/diff_match_patch.h b/diff_match_patch.h index 47531c9..9d184bc 100644 --- a/diff_match_patch.h +++ b/diff_match_patch.h @@ -682,7 +682,8 @@ class diff_match_patch { for (typename Diffs::iterator cur_diff = diffs.begin(); cur_diff != diffs.end(); ++cur_diff) { string_t text; for (int y = 0; y < (int)(*cur_diff).text.length(); y++) { - const LinePtr& lp = lineArray[static_cast<size_t>((*cur_diff).text[y])]; + typedef typename std::make_unsigned<typename string_t::value_type>::type unsigned_value_type; + const LinePtr& lp = lineArray[static_cast<size_t>(static_cast<unsigned_value_type>((*cur_diff).text[y]))]; text.append(lp.first, lp.second); } (*cur_diff).text.swap(text); ```
|
@JoshData, Fix by @beqabeqa473Fix by @JoshDataIt seems your fix has caused some performance degradation. I'm also attaching new files for testing, collected from the NVDA screen reader when running git log. |
|
Thanks. I've spent some time trying to figure out what's happening and haven't figured it out yet. |
|
Hello @JoshData Please update official dmp-stl submodule, where my pr was merged. |
|
Also, please revert your changes, as they are slowing down diffing |
…h for bytestring comparisons In #28, it was found that when using a C++ std::string type with a signed character type ('char') for the DMP library the cast to and from an unsigned integer type didn't round-trip correctly in the DMP library. This occurs when the number of unique lines in the input texts exceeds the number of positive values in the signed type. I think. We use std::string, based on char, which is signed, when comparing Python bytestrings. (Regular strings use std::wstring, which is based on wchar_t, which I think is unsigned on all platforms.) I tried changing the C++ string type used for bytestring comparisons to wstring which fixed the cast bug and also ensured that there is space to store more than 256 unique lines in the input, as with string diffs. But performance of wstring comparisons was an order of magnitude (or more) slower. I couldn't figure out why. The upstream change is: ``` From 673dfc9d4bcb8cd0234894f5a245e9824abf5fbc Mon Sep 17 00:00:00 2001 From: Beka Gozalishvili <beqaprogger@gmail.com> Date: Fri, 9 Feb 2024 20:37:19 +0400 Subject: [PATCH] Fix buffer overflow in toLines function which was causing a fatal crash --- diff_match_patch.h | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/diff_match_patch.h b/diff_match_patch.h index 47531c9..9d184bc 100644 --- a/diff_match_patch.h +++ b/diff_match_patch.h @@ -682,7 +682,8 @@ class diff_match_patch { for (typename Diffs::iterator cur_diff = diffs.begin(); cur_diff != diffs.end(); ++cur_diff) { string_t text; for (int y = 0; y < (int)(*cur_diff).text.length(); y++) { - const LinePtr& lp = lineArray[static_cast<size_t>((*cur_diff).text[y])]; + typedef typename std::make_unsigned<typename string_t::value_type>::type unsigned_value_type; + const LinePtr& lp = lineArray[static_cast<size_t>(static_cast<unsigned_value_type>((*cur_diff).text[y]))]; text.append(lp.first, lp.second); } (*cur_diff).text.swap(text); ```
|
Done. |
|
Hello @JoshData, |
|
Thanks for the reminder. I just published a release. Hopefully it all works OK now! |
There was a problem in dmp project, where it was possible to get a fatal crash, when converting diff back to lines because of heap overflow.
It is fixed in this submodule.