[Feature #19908] Update to Unicode 15.1.0#12798
Merged
ima1zumi merged 3 commits intoruby:masterfrom Mar 18, 2025
Merged
Conversation
1b0a49a to
b3ec885
Compare
kou
reviewed
Feb 24, 2025
| * reline 0.6.0 | ||
| * readline 0.0.4 | ||
| * fiddle 1.1.6 | ||
|
|
Member
There was a problem hiding this comment.
Could you revert a needless change?
tool/enc-unicode.rb
Outdated
Comment on lines
+169
to
+181
| elsif /^(\h+)(?:\.\.(\h+))?\s*;\s*(\w(?>[\w\s;]*\w)?)/ =~ line | ||
| $2 ? cps.concat(($1.to_i(16)..$2.to_i(16)).to_a) : cps.push($1.to_i(16)) | ||
| current = $3.gsub(/\W+/, '_') |
Member
There was a problem hiding this comment.
Could you add a comment that shows an example matched line and $1/$2/$3 for the case for easy to understand?
Member
Author
There was a problem hiding this comment.
b3ec885 to
162d060
Compare
ima1zumi
commented
Feb 24, 2025
Comment on lines
+6092
to
+6134
| /* conjunctCluster := \p{InCB=Consonant} ([\p{InCB=Extend} \p{InCB=Linker}]* \p{InCB=Linker} [\p{InCB=Extend} \p{InCB=Linker}]* \p{InCB=Consonant})+ */ | ||
| { | ||
| Node **CC_list = core_alts + 6; /* size: 2 */ | ||
| R_ERR(create_property_node(CC_list+0, env, "InCB_Consonant")); | ||
|
|
||
| { | ||
| Node **CC_list1 = CC_list + 1; /* size: 4 */ | ||
| // [\p{InCB=Extend} \p{InCB=Linker}]* | ||
| { | ||
| { | ||
| Node **CC_alt1 = CC_list1 + 1; | ||
| R_ERR(create_property_node(CC_alt1+1, env, "InCB_Extend")); | ||
| R_ERR(create_property_node(CC_alt1+2, env, "InCB_Linker")); | ||
|
|
||
| R_ERR(create_node_from_array(ALT, CC_alt1, CC_alt1+1)); | ||
| } | ||
|
|
||
| R_ERR(quantify_node(CC_list1+1, 0, REPEAT_INFINITE)); | ||
| } | ||
|
|
||
| // \p{InCB=Linker} | ||
| R_ERR(create_property_node(CC_list1+2, env, "InCB_Linker")); | ||
|
|
||
| // [\p{InCB=Extend} \p{InCB=Linker}]* | ||
| { | ||
| { | ||
| Node **CC_alt2 = CC_list1 + 3; | ||
| R_ERR(create_property_node(CC_alt2+1, env, "InCB_Extend")); | ||
| R_ERR(create_property_node(CC_alt2+2, env, "InCB_Linker")); | ||
|
|
||
| R_ERR(create_node_from_array(ALT, CC_alt2, CC_alt2+1)); | ||
| } | ||
|
|
||
| R_ERR(quantify_node(CC_list1+3, 0, REPEAT_INFINITE)); | ||
| } | ||
|
|
||
| // \p{InCB=Consonant} | ||
| R_ERR(create_property_node(CC_list1+4, env, "InCB_Consonant")); | ||
|
|
||
| R_ERR(create_node_from_array(LIST, CC_list1, CC_list1+1)); | ||
| R_ERR(quantify_node(CC_list1, 1, REPEAT_INFINITE)); | ||
| } | ||
| R_ERR(create_node_from_array(LIST, core_alts+5, CC_list)); | ||
| } | ||
|
|
||
| /* [^Control CR LF] */ | ||
| core_alts[5] = node_new_cclass(); | ||
| if (IS_NULL(core_alts[5])) goto err; | ||
| cc = NCCLASS(core_alts[5]); | ||
| core_alts[6] = node_new_cclass(); | ||
| if (IS_NULL(core_alts[6])) goto err; | ||
| cc = NCCLASS(core_alts[6]); |
Member
Author
There was a problem hiding this comment.
This part is covered by an automatically generated test in this file based on GraphemeBreakTest.txt.
https://github.com/ruby/ruby/blob/dfc25204235079e23eadf9e0ba860c1ebcb14325/test/ruby/enc/test_grapheme_breaks.rb
https://www.unicode.org/Public/15.1.0/ucd/auxiliary/GraphemeBreakTest.txt
162d060 to
4817ab0
Compare
tompng
reviewed
Feb 26, 2025
bcd7641 to
d07beeb
Compare
This comment has been minimized.
This comment has been minimized.
tompng
reviewed
Mar 7, 2025
40d6ec9 to
c648189
Compare
tompng
reviewed
Mar 9, 2025
nurse
approved these changes
Mar 14, 2025
49775b7 to
a66a4d3
Compare
yahonda
added a commit
to yahonda/rails
that referenced
this pull request
Mar 23, 2025
…failure with Ruby 3.5.0dev This commit addresses the following failure with Ruby 3.5.0dev since ruby/ruby@6670926 ```ruby % ruby -v ruby 3.5.0dev (2025-03-21T06:17:15Z master d868922ea8) +PRISM [arm64-darwin24] % bin/test test/multibyte_chars_test.rb -n test_should_compute_grapheme_length Running 90 tests in parallel using 10 processes Run options: -n test_should_compute_grapheme_length --seed 52859 F Failure: MultibyteCharsExtrasTest#test_should_compute_grapheme_length [test/multibyte_chars_test.rb:512]: "त्र". Expected: 2 Actual: 1 bin/test test/multibyte_chars_test.rb:595 Finished in 0.209643s, 4.7700 runs/s, 38.1601 assertions/s. 1 runs, 8 assertions, 1 failures, 0 errors, 0 skips % ``` According to ruby/ruby#12798 ,this is an expected change since Unicode 15.1.0. > As a result, an orthographic syllable in scripts like Devanagari (e.g. “क्या”, consisting of KA + VIRAMA + YA) > is now treated as a single extended grapheme cluster rather than split into two. Fix rails#54794
4 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
https://bugs.ruby-lang.org/issues/19908
Summary
Unicode 15.1.0 introduced a new grapheme cluster rule to handle Indic conjuncts. In UAX #29, rule GB9c was added to prevent grapheme breaks within certain Indic consonant+virama sequences. As a result, an orthographic syllable in scripts like Devanagari (e.g. “क्या”, consisting of KA + VIRAMA + YA) is now treated as a single extended grapheme cluster rather than split into two. This aligns the default segmentation with Indic writing system expectations.
UAX #29: Unicode Text Segmentation
New Enumerated Property
To support this, the Unicode Character Database (UCD) added a new enumerated property, Indic_Conjunct_Break (InCB), with values Consonant, Linker, Extend, and None, derived from existing properties. For example, virama characters in certain Brahmic scripts are classified as InCB=Linker, and base consonants as InCB=Consonant. These property values are listed in the UCD data (e.g. in DerivedCoreProperties.txt).
Impact on Ruby
This change affected Ruby’s implementation of Unicode support in several ways: