Skip to content

[Feature #19908] Update to Unicode 15.1.0#12798

Merged
ima1zumi merged 3 commits intoruby:masterfrom
ima1zumi:unicode-15.1.0
Mar 18, 2025
Merged

[Feature #19908] Update to Unicode 15.1.0#12798
ima1zumi merged 3 commits intoruby:masterfrom
ima1zumi:unicode-15.1.0

Conversation

@ima1zumi
Copy link
Copy Markdown
Member

@ima1zumi ima1zumi commented Feb 24, 2025

https://bugs.ruby-lang.org/issues/19908

Summary

Unicode 15.1.0 introduced a new grapheme cluster rule to handle Indic conjuncts. In UAX #29, rule GB9c was added to prevent grapheme breaks within certain Indic consonant+virama sequences. As a result, an orthographic syllable in scripts like Devanagari (e.g. “क्या”, consisting of KA + VIRAMA + YA) is now treated as a single extended grapheme cluster rather than split into two. This aligns the default segmentation with Indic writing system expectations.

UAX #29: Unicode Text Segmentation

New Enumerated Property

To support this, the Unicode Character Database (UCD) added a new enumerated property, Indic_Conjunct_Break (InCB), with values Consonant, Linker, Extend, and None, derived from existing properties. For example, virama characters in certain Brahmic scripts are classified as InCB=Linker, and base consonants as InCB=Consonant. These property values are listed in the UCD data (e.g. in DerivedCoreProperties.txt).

Impact on Ruby

This change affected Ruby’s implementation of Unicode support in several ways:

  • enc-unicode.rb
    • Ruby’s enc-unicode.rb script, which processes Unicode data, had to be updated to handle the new enumerated property. Previously, it expected only binary properties.
  • Regex Engine
    • Support for InCB property values was added to Ruby’s regex engine (exposing them as InCB_Linker, InCB_Consonant, etc.).
  • Grapheme Cluster Logic
    • Ruby’s grapheme cluster regex (\X) logic was updated to incorporate GB9c, ensuring that Indic conjuncts are not split in string operations.

@ima1zumi ima1zumi changed the title Update to Unicode 15.1.0 [Feature #19908] Update to Unicode 15.1.0 Feb 24, 2025
* reline 0.6.0
* readline 0.0.4
* fiddle 1.1.6

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you revert a needless change?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have reverted.

Comment on lines +169 to +181
elsif /^(\h+)(?:\.\.(\h+))?\s*;\s*(\w(?>[\w\s;]*\w)?)/ =~ line
$2 ? cps.concat(($1.to_i(16)..$2.to_i(16)).to_a) : cps.push($1.to_i(16))
current = $3.gsub(/\W+/, '_')
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a comment that shows an example matched line and $1/$2/$3 for the case for easy to understand?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines +6092 to +6134
/* conjunctCluster := \p{InCB=Consonant} ([\p{InCB=Extend} \p{InCB=Linker}]* \p{InCB=Linker} [\p{InCB=Extend} \p{InCB=Linker}]* \p{InCB=Consonant})+ */
{
Node **CC_list = core_alts + 6; /* size: 2 */
R_ERR(create_property_node(CC_list+0, env, "InCB_Consonant"));

{
Node **CC_list1 = CC_list + 1; /* size: 4 */
// [\p{InCB=Extend} \p{InCB=Linker}]*
{
{
Node **CC_alt1 = CC_list1 + 1;
R_ERR(create_property_node(CC_alt1+1, env, "InCB_Extend"));
R_ERR(create_property_node(CC_alt1+2, env, "InCB_Linker"));

R_ERR(create_node_from_array(ALT, CC_alt1, CC_alt1+1));
}

R_ERR(quantify_node(CC_list1+1, 0, REPEAT_INFINITE));
}

// \p{InCB=Linker}
R_ERR(create_property_node(CC_list1+2, env, "InCB_Linker"));

// [\p{InCB=Extend} \p{InCB=Linker}]*
{
{
Node **CC_alt2 = CC_list1 + 3;
R_ERR(create_property_node(CC_alt2+1, env, "InCB_Extend"));
R_ERR(create_property_node(CC_alt2+2, env, "InCB_Linker"));

R_ERR(create_node_from_array(ALT, CC_alt2, CC_alt2+1));
}

R_ERR(quantify_node(CC_list1+3, 0, REPEAT_INFINITE));
}

// \p{InCB=Consonant}
R_ERR(create_property_node(CC_list1+4, env, "InCB_Consonant"));

R_ERR(create_node_from_array(LIST, CC_list1, CC_list1+1));
R_ERR(quantify_node(CC_list1, 1, REPEAT_INFINITE));
}
R_ERR(create_node_from_array(LIST, core_alts+5, CC_list));
}

/* [^Control CR LF] */
core_alts[5] = node_new_cclass();
if (IS_NULL(core_alts[5])) goto err;
cc = NCCLASS(core_alts[5]);
core_alts[6] = node_new_cclass();
if (IS_NULL(core_alts[6])) goto err;
cc = NCCLASS(core_alts[6]);
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ima1zumi ima1zumi marked this pull request as ready for review February 24, 2025 15:05
@ima1zumi ima1zumi marked this pull request as draft February 25, 2025 00:21
@ima1zumi ima1zumi force-pushed the unicode-15.1.0 branch 2 times, most recently from bcd7641 to d07beeb Compare March 4, 2025 11:16
@ima1zumi ima1zumi marked this pull request as ready for review March 4, 2025 11:16
@launchable-app

This comment has been minimized.

@ima1zumi ima1zumi force-pushed the unicode-15.1.0 branch 2 times, most recently from 40d6ec9 to c648189 Compare March 9, 2025 11:48
@nurse nurse self-assigned this Mar 13, 2025
@ima1zumi ima1zumi merged commit 6670926 into ruby:master Mar 18, 2025
80 checks passed
yahonda added a commit to yahonda/rails that referenced this pull request Mar 23, 2025
…failure with Ruby 3.5.0dev

This commit addresses the following failure with Ruby 3.5.0dev
since ruby/ruby@6670926

```ruby
% ruby -v
ruby 3.5.0dev (2025-03-21T06:17:15Z master d868922ea8) +PRISM [arm64-darwin24]
% bin/test test/multibyte_chars_test.rb -n test_should_compute_grapheme_length
Running 90 tests in parallel using 10 processes
Run options: -n test_should_compute_grapheme_length --seed 52859

F

Failure:
MultibyteCharsExtrasTest#test_should_compute_grapheme_length [test/multibyte_chars_test.rb:512]:
"त्र".
Expected: 2
  Actual: 1

bin/test test/multibyte_chars_test.rb:595

Finished in 0.209643s, 4.7700 runs/s, 38.1601 assertions/s.
1 runs, 8 assertions, 1 failures, 0 errors, 0 skips
%
```

According to ruby/ruby#12798 ,this is an expected change since Unicode 15.1.0.

> As a result, an orthographic syllable in scripts like Devanagari (e.g. “क्या”, consisting of KA + VIRAMA + YA)
> is now treated as a single extended grapheme cluster rather than split into two.

Fix rails#54794
@ima1zumi ima1zumi deleted the unicode-15.1.0 branch April 5, 2025 16:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants