bpo-29456: bugs in unicodedata.normalize: u1176, u11a7 and u11c3#1958
Conversation
|
Okay, I added some tests for the issue. |
| if (i < len && | ||
| TBase <= PyUnicode_READ(kind, data, i) && | ||
| PyUnicode_READ(kind, data, i) <= (TBase+TCount)) { | ||
| TBase < PyUnicode_READ(kind, data, i) && |
There was a problem hiding this comment.
Are you sure this should be < rather than <=?
There was a problem hiding this comment.
Yes.
That code determines PyUnicode_READ(kind, data, i) is a trailing(final) consonant while TBase(0x11A7) is the last Vowel in Hangul (Hangul Jamo).
So < is correct rather than <=.
There was a problem hiding this comment.
Thanks! And after checking (which I should have done before leaving my comment), I see that this agrees with section 3.12 of (version 10 of ) the standard.
Still, Python eyes are rather used to seeing half-open ranges, so anything other than lower <= value < high looks surprising. Is it worth adding a comment explaining what's going on?
There was a problem hiding this comment.
Okay, I'll add some comments.
There was a problem hiding this comment.
I've just added some comments. Is it enough?
There was a problem hiding this comment.
Let me give a supplement:
Before Unicode 4.1.0 (draft), here is: TBase <= code <= TBase+TCount
see: http://www.unicode.org/reports/tr15/tr15-24.html#hangul_composition
After Unicode 4.1.0, here is TBase < code < TBase+TCount, which in line with the latest version (Unicode 10.0)
see: http://www.unicode.org/reports/tr15/tr15-25.html#hangul_composition
This change happened in 2005.
|
I think it can be merged. Is there anything I need to do? |
|
Hello? |
|
@Pusnow |
|
Done, thank you for response. |
|
I closed and reopened the PR to force to reschedule a test on AppVeyor: it just started a new job, https://ci.appveyor.com/project/python/cpython/build/3.8build17701 |
|
Thanks @Pusnow for the PR, and @zhangyangyu for merging it 🌮🎉.. I'm working now to backport this PR to: 2.7, 3.6, 3.7. |
|
GH-7702 is a backport of this pull request to the 3.7 branch. |
…ythonGH-1958) Hangul composition check boundaries are wrong for the second character ([0x1161, 0x1176) instead of [0x1161, 0x1176]) and third character ((0x11A7, 0x11C3) instead of [0x11A7, 0x11C3]). (cherry picked from commit d134809) Co-authored-by: Wonsup Yoon <pusnow@me.com>
|
GH-7703 is a backport of this pull request to the 3.6 branch. |
…ythonGH-1958) Hangul composition check boundaries are wrong for the second character ([0x1161, 0x1176) instead of [0x1161, 0x1176]) and third character ((0x11A7, 0x11C3) instead of [0x11A7, 0x11C3]). (cherry picked from commit d134809) Co-authored-by: Wonsup Yoon <pusnow@me.com>
|
Sorry, @Pusnow and @zhangyangyu, I could not cleanly backport this to |
…u11c3 (pythonGH-1958) Hangul composition check boundaries are wrong for the second character ([0x1161, 0x1176) instead of [0x1161, 0x1176]) and third character ((0x11A7, 0x11C3) instead of [0x11A7, 0x11C3]).. (cherry picked from commit d134809) Co-authored-by: Wonsup Yoon <pusnow@me.com>
|
GH-7704 is a backport of this pull request to the 2.7 branch. |
https://bugs.python.org/issue29456