Skip to content

feat: Support Unicode 17.0.0#157

Merged
Manishearth merged 4 commits intounicode-rs:masterfrom
Martin005:unicode-17
Mar 24, 2026
Merged

feat: Support Unicode 17.0.0#157
Manishearth merged 4 commits intounicode-rs:masterfrom
Martin005:unicode-17

Conversation

@Martin005
Copy link
Copy Markdown
Contributor

@Martin005 Martin005 commented Mar 24, 2026

This PR updates Unicode support to version 17.0.0 and introduces fixes to word boundary detection, particularly for emoji and Zero Width Joiner (ZWJ) handling. The changes enhance Unicode compliance and fix edge cases in grapheme and word segmentation logic.

Unicode version update:

  • Updated the UNICODE_VERSION in scripts/unicode.py from 16.0.0 to 17.0.0.
  • Generated src/tables.rs and tests/testdata/mod.rs files using the Unicode version 17.0.0.

Word boundary and emoji handling improvements:

  • Added a new method next_significant_is_emoji to UWordBounds in src/word.rs, which checks if the next significant character is an emoji, skipping over Extend and Format characters.
  • Modified the double-ended iterator for UWordBounds to skip ZWJ characters that are followed by an emoji using the next_significant_is_emoji method.
  • Adjusted the state machine in UWordBounds to move the handling of emoji characters to a later match case, ensuring correct state transitions for emoji.
  • Those modifications fixed the failing test_words test.

Grapheme segmentation fix:

  • Corrected the logic in GraphemeCursor (in src/grapheme.rs) to update incb_linker_count instead of ris_count. This fixed the failing test_grapheme test.

Supersedes #156

Copy link
Copy Markdown
Member

@Manishearth Manishearth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, also, could you add comments on the added code explaining their relationship with the rules? Especially the out-of-state-machine stuff around emoji.

But I can still land if you'd just prefer to land this, this is correct as is as far as I can tell.

@Martin005
Copy link
Copy Markdown
Contributor Author

Oh, also, could you add comments on the added code explaining their relationship with the rules? Especially the out-of-state-machine stuff around emoji.

But I can still land if you'd just prefer to land this, this is correct as is as far as I can tell.

@Manishearth Done: 3a391bf, hope it is 100% correct ✅

@Manishearth Manishearth merged commit 13862d8 into unicode-rs:master Mar 24, 2026
2 checks passed
@Manishearth
Copy link
Copy Markdown
Member

Thank you so much!

@Martin005 Martin005 deleted the unicode-17 branch March 24, 2026 16:20
@Manishearth
Copy link
Copy Markdown
Member

And published a new version

@Martin005
Copy link
Copy Markdown
Contributor Author

Martin005 commented Mar 24, 2026

@Manishearth Awesome, thank you so much! By the way, I see you didn't update the changelog in Readme (and it's also lacking updates for the 1.12.0 version). And haven't created a tag for 1.13.0.

@Manishearth
Copy link
Copy Markdown
Member

Ah, hadn't pushed the tag. Opened #159 for the changelog.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants