feat: Support Unicode 17.0.0#157
Conversation
Manishearth
left a comment
There was a problem hiding this comment.
Oh, also, could you add comments on the added code explaining their relationship with the rules? Especially the out-of-state-machine stuff around emoji.
But I can still land if you'd just prefer to land this, this is correct as is as far as I can tell.
@Manishearth Done: 3a391bf, hope it is 100% correct ✅ |
|
Thank you so much! |
|
And published a new version |
|
@Manishearth Awesome, thank you so much! By the way, I see you didn't update the changelog in Readme (and it's also lacking updates for the |
|
Ah, hadn't pushed the tag. Opened #159 for the changelog. |
This PR updates Unicode support to version 17.0.0 and introduces fixes to word boundary detection, particularly for emoji and Zero Width Joiner (ZWJ) handling. The changes enhance Unicode compliance and fix edge cases in grapheme and word segmentation logic.
Unicode version update:
UNICODE_VERSIONinscripts/unicode.pyfrom 16.0.0 to 17.0.0.src/tables.rsandtests/testdata/mod.rsfiles using the Unicode version 17.0.0.Word boundary and emoji handling improvements:
next_significant_is_emojitoUWordBoundsinsrc/word.rs, which checks if the next significant character is an emoji, skipping overExtendandFormatcharacters.UWordBoundsto skip ZWJ characters that are followed by an emoji using thenext_significant_is_emojimethod.UWordBoundsto move the handling of emoji characters to a later match case, ensuring correct state transitions for emoji.test_wordstest.Grapheme segmentation fix:
GraphemeCursor(insrc/grapheme.rs) to updateincb_linker_countinstead ofris_count. This fixed the failingtest_graphemetest.Supersedes #156