Skip to content

bpo-29456: bugs in unicodedata.normalize: u1176, u11a7 and u11c3#1958

Merged
zhangyangyu merged 6 commits into
python:masterfrom
Pusnow:fix-issue-29456
Jun 15, 2018
Merged

bpo-29456: bugs in unicodedata.normalize: u1176, u11a7 and u11c3#1958
zhangyangyu merged 6 commits into
python:masterfrom
Pusnow:fix-issue-29456

Conversation

@Pusnow

@Pusnow Pusnow commented Jun 5, 2017

Copy link
Copy Markdown
Contributor

@corona10

Copy link
Copy Markdown
Member

@Pusnow
I am not a committer of this library.
But here is a one thing I want to review.
Can you add test codes about your changing?
You can add your test cases in here.

Thank you.

@Pusnow

Pusnow commented Jul 29, 2017

Copy link
Copy Markdown
Contributor Author

Okay, I added some tests for the issue.

Comment thread Modules/unicodedata.c
if (i < len &&
TBase <= PyUnicode_READ(kind, data, i) &&
PyUnicode_READ(kind, data, i) <= (TBase+TCount)) {
TBase < PyUnicode_READ(kind, data, i) &&

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you sure this should be < rather than <=?

@Pusnow Pusnow Aug 2, 2017

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes.
That code determines PyUnicode_READ(kind, data, i) is a trailing(final) consonant while TBase(0x11A7) is the last Vowel in Hangul (Hangul Jamo).
So < is correct rather than <=.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! And after checking (which I should have done before leaving my comment), I see that this agrees with section 3.12 of (version 10 of ) the standard.

Still, Python eyes are rather used to seeing half-open ranges, so anything other than lower <= value < high looks surprising. Is it worth adding a comment explaining what's going on?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I'll add some comments.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've just added some comments. Is it enough?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Yes, that's helpful.

@ghost ghost Aug 10, 2017

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me give a supplement:

Before Unicode 4.1.0 (draft), here is: TBase <= code <= TBase+TCount
see: http://www.unicode.org/reports/tr15/tr15-24.html#hangul_composition

After Unicode 4.1.0, here is TBase < code < TBase+TCount, which in line with the latest version (Unicode 10.0)
see: http://www.unicode.org/reports/tr15/tr15-25.html#hangul_composition

This change happened in 2005.

@Pusnow

Pusnow commented Aug 7, 2017

Copy link
Copy Markdown
Contributor Author

I think it can be merged. Is there anything I need to do?

@Pusnow

Pusnow commented Aug 24, 2017

Copy link
Copy Markdown
Contributor Author

Hello?

@corona10

Copy link
Copy Markdown
Member

@Pusnow
There should be a Misc/NEWS.d entry for this change using blurb.
See https://devguide.python.org/committing/#what-s-new-and-news-entries

@Pusnow

Pusnow commented Aug 24, 2017

Copy link
Copy Markdown
Contributor Author

Done, thank you for response.

@vstinner

Copy link
Copy Markdown
Member

I closed and reopened the PR to force to reschedule a test on AppVeyor: it just started a new job, https://ci.appveyor.com/project/python/cpython/build/3.8build17701

@zhangyangyu zhangyangyu merged commit d134809 into python:master Jun 15, 2018
@miss-islington

Copy link
Copy Markdown
Contributor

Thanks @Pusnow for the PR, and @zhangyangyu for merging it 🌮🎉.. I'm working now to backport this PR to: 2.7, 3.6, 3.7.
🐍🍒⛏🤖

@bedevere-bot

Copy link
Copy Markdown

GH-7702 is a backport of this pull request to the 3.7 branch.

miss-islington pushed a commit to miss-islington/cpython that referenced this pull request Jun 15, 2018
…ythonGH-1958)

Hangul composition check boundaries are wrong for the second character
([0x1161, 0x1176) instead of [0x1161, 0x1176]) and third character ((0x11A7, 0x11C3)
instead of [0x11A7, 0x11C3]).
(cherry picked from commit d134809)

Co-authored-by: Wonsup Yoon <pusnow@me.com>
@bedevere-bot

Copy link
Copy Markdown

GH-7703 is a backport of this pull request to the 3.6 branch.

miss-islington pushed a commit to miss-islington/cpython that referenced this pull request Jun 15, 2018
…ythonGH-1958)

Hangul composition check boundaries are wrong for the second character
([0x1161, 0x1176) instead of [0x1161, 0x1176]) and third character ((0x11A7, 0x11C3)
instead of [0x11A7, 0x11C3]).
(cherry picked from commit d134809)

Co-authored-by: Wonsup Yoon <pusnow@me.com>
@miss-islington

Copy link
Copy Markdown
Contributor

Sorry, @Pusnow and @zhangyangyu, I could not cleanly backport this to 2.7 due to a conflict.
Please backport using cherry_picker on command line.
cherry_picker d134809cd3764c6a634eab7bb8995e3e2eff14d5 2.7

miss-islington added a commit that referenced this pull request Jun 15, 2018
…H-1958)

Hangul composition check boundaries are wrong for the second character
([0x1161, 0x1176) instead of [0x1161, 0x1176]) and third character ((0x11A7, 0x11C3)
instead of [0x11A7, 0x11C3]).
(cherry picked from commit d134809)

Co-authored-by: Wonsup Yoon <pusnow@me.com>
zhangyangyu pushed a commit to zhangyangyu/cpython that referenced this pull request Jun 15, 2018
…u11c3 (pythonGH-1958)

Hangul composition check boundaries are wrong for the second character
([0x1161, 0x1176) instead of [0x1161, 0x1176]) and third character ((0x11A7, 0x11C3)
instead of [0x11A7, 0x11C3])..
(cherry picked from commit d134809)

Co-authored-by: Wonsup Yoon <pusnow@me.com>
@bedevere-bot

Copy link
Copy Markdown

GH-7704 is a backport of this pull request to the 2.7 branch.

miss-islington added a commit that referenced this pull request Jun 15, 2018
…H-1958)

Hangul composition check boundaries are wrong for the second character
([0x1161, 0x1176) instead of [0x1161, 0x1176]) and third character ((0x11A7, 0x11C3)
instead of [0x11A7, 0x11C3]).
(cherry picked from commit d134809)

Co-authored-by: Wonsup Yoon <pusnow@me.com>
zhangyangyu added a commit that referenced this pull request Jun 15, 2018
…H-1958) (GH-7704)

Hangul composition check boundaries are wrong for the second character
([0x1161, 0x1176) instead of [0x1161, 0x1176]) and third character ((0x11A7, 0x11C3)
instead of [0x11A7, 0x11C3])..
(cherry picked from commit d134809)

Co-authored-by: Wonsup Yoon <pusnow@me.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants