Skip to content

Unicode 16#41

Merged
clipperhouse merged 2 commits intomasterfrom
unicode-16
Jan 18, 2026
Merged

Unicode 16#41
clipperhouse merged 2 commits intomasterfrom
unicode-16

Conversation

@clipperhouse
Copy link
Copy Markdown
Owner

@clipperhouse clipperhouse commented Jan 18, 2026

Add support for Unicode 16.

  • Pull Unicode 16 data instead of the current Go version
  • Add GB9c rule to graphemes, Indic_Conjunct_Break

Looks like ~3% perf hit, since there are now more conditionals in the graphemes logic, and the trie now requires int32.

Removed comparative test with uniseg, we have diverged in terms of graphemes count — uniseg is on Unicode 15 and we are on 16, which gives different results. So the test is no longer helpful.

Copilot AI review requested due to automatic review settings January 18, 2026 16:57
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request adds support for Unicode 16.0.0, upgrading from Unicode 15.0.0. The primary changes include updating Unicode data sources and implementing the GB9c rule for grapheme cluster boundaries, which handles Indic conjunct breaks.

Changes:

  • Updated Unicode version from 15.0.0 to 16.0.0 across all data generation and test files
  • Implemented GB9c rule with Indic_Conjunct_Break (InCB) property support for proper handling of Indic conjunct clusters
  • Added 3 new word break tests and 10 new sentence break tests from Unicode 16.0.0 test suite

Reviewed changes

Copilot reviewed 6 out of 11 changed files in this pull request and generated no comments.

Show a summary per file
File Description
internal/gen/main.go Added unicodeVersion constant (16.0.0), InCB property fetching, and three-part property format parsing for DerivedCoreProperties.txt
graphemes/splitfunc.go Implemented GB9c rule with incbState state machine to track Indic conjunct clusters (Consonant-Linker-Consonant patterns)
graphemes/trie.go Auto-generated with Unicode 16.0.0 data, added InCB property constants (_InCBConsonant, _InCBLinker, _InCBExtend)
words/trie.go Auto-generated with Unicode 16.0.0 data
sentences/trie.go Auto-generated with Unicode 16.0.0 data
phrases/trie.go Auto-generated with Unicode 16.0.0 data
words/unicode_test.go Added 3 new test cases from Unicode 16.0.0 test suite, updated array size to 1826
sentences/unicode_test.go Added 10 new test cases from Unicode 16.0.0 test suite, updated array size to 512
graphemes/unicode_test.go Auto-generated with Unicode 16.0.0 data
graphemes/comparative/go.mod Updated module version reference from v2.2.0 to v2.3.0
graphemes/comparative/comparative_test.go Removed TestGraphemeCountConsistency test (likely because implementations now differ due to Unicode version mismatch with comparison library)

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@clipperhouse clipperhouse merged commit a5cc697 into master Jan 18, 2026
26 checks passed
@clipperhouse clipperhouse deleted the unicode-16 branch January 18, 2026 17:05
@clipperhouse clipperhouse mentioned this pull request Jan 24, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants