Skip to content

Replace unmaintained unic crates#7555

Merged
youknowone merged 12 commits intoRustPython:mainfrom
ShaharNaveh:replace-unmaintained-unic-char
Apr 2, 2026
Merged

Replace unmaintained unic crates#7555
youknowone merged 12 commits intoRustPython:mainfrom
ShaharNaveh:replace-unmaintained-unic-char

Conversation

@ShaharNaveh
Copy link
Copy Markdown
Contributor

@ShaharNaveh ShaharNaveh commented Apr 1, 2026

ref: #7529

Summary by CodeRabbit

  • Chores

    • Switched Unicode data and normalization backend to a different Unicode provider for improved standards compliance.
    • Updated normalization and mirroring behavior to use the new provider.
  • User-visible changes

    • Character classification and string methods (printability, whitespace, decimal, identifier checks, and width/bidi labels) may yield different results under the new provider.
  • Breaking Changes

    • The combining() return type changed from i32 to u8.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Apr 1, 2026

📝 Walkthrough

Walkthrough

Migrates Unicode handling from several unic-* crates to ICU libraries: adds icu_properties and icu_normalizer, removes multiple unic-* and unicode-bidi-mirroring workspace dependencies, and updates code to use ICU APIs for classification, normalization, bidi, mirroring, and combining class.

Changes

Cohort / File(s) Summary
Workspace manifests
Cargo.toml, crates/literal/Cargo.toml, crates/stdlib/Cargo.toml, crates/vm/Cargo.toml
Removed unic-char-property, unic-normal, unic-ucd-bidi, unic-ucd-category, unic-ucd-ident, and unicode-bidi-mirroring entries; added icu_properties = "2" and icu_normalizer = "2" to workspace deps.
Literal char utilities
crates/literal/src/char.rs
Replaced unic_ucd_category::GeneralCategory::of with icu_properties::props::GeneralCategory::for_char and updated is_printable to explicit GeneralCategory matching.
Stdlib Unicode data
crates/stdlib/src/unicodedata.rs
Switched category/bidi/width lookups to ICU (for_char) and short_name(); migrated normalize()/is_normalized() to icu_normalizer (composing/decomposing normalizers); replaced mirroring check with ICU CodePointSetData::new::<BidiMirrored>().contains(ch); changed combining() return type i32u8 and use CanonicalCombiningClass::for_char(...).to_icu4c_value(); removed local width-abbreviation trait.
VM string builtins
crates/vm/src/builtins/str.rs
Replaced unic_ucd_* imports and helpers with icu_properties equivalents (GeneralCategory::for_char, BidiClass::for_char, XidStart::for_char, XidContinue::for_char); updated isdecimal, isspace, and isidentifier logic accordingly.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Poem

🐇 I hopped from unic to ICU with glee,
Swapping crates so characters roam free,
Categories, bidi, and normalization too,
I stitched each byte until behavior grew true,
A tiny rabbit cheers the Unicode spree! 🎉

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and concisely summarizes the main change: replacing unmaintained unic crates with maintained alternatives, which is the core objective evident across all modified files.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 1, 2026

📦 Library Dependencies

The following Lib/ modules were modified. Here are their dependencies:

(module 'str test_unicodedata' not found)

Legend:

  • [+] path exists in CPython
  • [x] up-to-date, [ ] outdated


@unittest.expectedFailure # TODO: RUSTPYTHON
@support.requires_resource('cpu')
def test_isprintable_invariant(self):
Copy link
Copy Markdown
Contributor Author

@ShaharNaveh ShaharNaveh Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test was giving us a false sense that it works, because it used to report \u{0560} as non-printable (but now it's saying it's printable).

use icu_properties::props::{EnumeratedProperty, GeneralCategory};
 
const CH: char = '\u{0560}';

fn main() {
    dbg!(unic_ucd_category::GeneralCategory::of(CH)); // Unassigned (wrong)
    dbg!(GeneralCategory::for_char(CH)); // LowercaseLetter (correct)
}

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what should we do?

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
crates/stdlib/src/unicodedata.rs (1)

194-217: Deduplicate the normalization pipeline.

Both methods repeat the same map_utf8(...).collect() body four times and only vary by the normalizer constructor. Please extract the form-specific choice first and keep the shared path in one place. As per coding guidelines, "When branches differ only in a value but share common logic, extract the differing value first, then call the common logic once to avoid duplicate code."

Also applies to: 222-245

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/stdlib/src/unicodedata.rs` around lines 194 - 217, The normalize
function duplicates the map_utf8(...).collect() pipeline for each NormalizeForm
branch; instead, choose the appropriate normalizer constructor based on the form
(use the variants Nfc, Nfkc, Nfd, Nfkd and the constructors
ComposingNormalizerBorrowed::new_nfc(), ComposingNormalizerBorrowed::new_nfkc(),
DecomposingNormalizerBorrowed::new_nfd(),
DecomposingNormalizerBorrowed::new_nfkd()) and store it in a single variable (or
closure) and then call text.map_utf8(|s|
normalizer.normalize_iter(s.chars())).collect() once; apply the same refactor to
the later duplicate block as well so only the normalizer choice varies and the
shared map_utf8(...).collect() path is centralized.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@crates/stdlib/src/unicodedata.rs`:
- Around line 44-55: The code mixes Unicode data sources:
Age::of()/UNICODE_VERSION (unic_ucd_age, Unicode 10) vs. ICU-based APIs
(icu_properties/icu_normalizer, Unicode 17), so update the module to use one
consistent source—preferably the ICU data already used by category(),
bidirectional(), east_asian_width(), normalize(), is_normalized(), and
mirrored(): remove references to unic_ucd_age::Age and UNICODE_VERSION, derive
unidata_version from the ICU provider/metadata (the same runtime data backing
icu_properties and icu_normalizer), and change check_age() to use ICU's age
information (or remove age-based filtering) so that unidata_version accurately
reflects the data used by those functions. Ensure all places that previously
consulted Age::of() now query the ICU data provider or use the ICU-provided
UnicodeVersion so behavior and reported unidata_version remain in sync.

---

Nitpick comments:
In `@crates/stdlib/src/unicodedata.rs`:
- Around line 194-217: The normalize function duplicates the
map_utf8(...).collect() pipeline for each NormalizeForm branch; instead, choose
the appropriate normalizer constructor based on the form (use the variants Nfc,
Nfkc, Nfd, Nfkd and the constructors ComposingNormalizerBorrowed::new_nfc(),
ComposingNormalizerBorrowed::new_nfkc(),
DecomposingNormalizerBorrowed::new_nfd(),
DecomposingNormalizerBorrowed::new_nfkd()) and store it in a single variable (or
closure) and then call text.map_utf8(|s|
normalizer.normalize_iter(s.chars())).collect() once; apply the same refactor to
the later duplicate block as well so only the normalizer choice varies and the
shared map_utf8(...).collect() path is centralized.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yml

Review profile: CHILL

Plan: Pro

Run ID: 2897286b-ea41-43ef-acc5-3fe407520f70

📥 Commits

Reviewing files that changed from the base of the PR and between d9c4c95 and 273ad65.

⛔ Files ignored due to path filters (3)
  • Cargo.lock is excluded by !**/*.lock
  • Lib/test/test_str.py is excluded by !Lib/**
  • Lib/test/test_unicodedata.py is excluded by !Lib/**
📒 Files selected for processing (7)
  • Cargo.toml
  • crates/literal/Cargo.toml
  • crates/literal/src/char.rs
  • crates/stdlib/Cargo.toml
  • crates/stdlib/src/unicodedata.rs
  • crates/vm/Cargo.toml
  • crates/vm/src/builtins/str.rs

Comment on lines +44 to 55
use icu_normalizer::{ComposingNormalizerBorrowed, DecomposingNormalizerBorrowed};
use icu_properties::{
CodePointSetData,
props::{
BidiClass, BidiMirrored, CanonicalCombiningClass, EastAsianWidth, EnumeratedProperty,
GeneralCategory, NamedEnumeratedProperty,
},
};
use itertools::Itertools;
use rustpython_common::wtf8::{CodePoint, Wtf8Buf};
use ucd::{Codepoint, DecompositionType, EastAsianWidth, Number, NumericType};
use unic_char_property::EnumeratedCharProperty;
use unic_normal::StrNormalForm;
use ucd::{Codepoint, DecompositionType, Number, NumericType};
use unic_ucd_age::{Age, UNICODE_VERSION, UnicodeVersion};
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

wc -l crates/stdlib/src/unicodedata.rs

Repository: RustPython/RustPython

Length of output: 101


🏁 Script executed:

cat -n crates/stdlib/src/unicodedata.rs | head -250

Repository: RustPython/RustPython

Length of output: 10161


🏁 Script executed:

# Search for unidata_version and how it's constructed
rg "unidata_version" crates/stdlib/src/unicodedata.rs -B 2 -A 2

Repository: RustPython/RustPython

Length of output: 300


🏁 Script executed:

# Search for check_age and Age usage
rg "check_age|Age::|UNICODE_VERSION" crates/stdlib/src/unicodedata.rs -B 2 -A 2

Repository: RustPython/RustPython

Length of output: 1397


🏁 Script executed:

fd -type f Cargo.toml | grep -E "(stdlib|Cargo)" | head -5

Repository: RustPython/RustPython

Length of output: 235


🏁 Script executed:

# Find and examine the Cargo.toml for the stdlib crate
cat Cargo.toml.in 2>/dev/null || find . -name "Cargo.toml" -path "*/stdlib/*" | head -1 | xargs cat

Repository: RustPython/RustPython

Length of output: 5918


🏁 Script executed:

# Search for icu and unic_ucd_age dependency versions
rg "icu_|unic_ucd_age" Cargo.toml --type toml -A 1

Repository: RustPython/RustPython

Length of output: 131


🏁 Script executed:

# Check the root Cargo.toml for workspace dependencies
head -150 Cargo.toml | grep -A 30 "^\[workspace.dependencies\]"

Repository: RustPython/RustPython

Length of output: 575


🏁 Script executed:

# Search for the unic_ucd_age version definition
grep -A 5 "unic-ucd-age" Cargo.toml

Repository: RustPython/RustPython

Length of output: 182


🏁 Script executed:

# Get version info from Cargo.lock if available
ls -la Cargo.lock 2>/dev/null && grep -A 5 "unic-ucd-age\|icu_properties\|icu_normalizer" Cargo.lock | head -40

Repository: RustPython/RustPython

Length of output: 1168


🏁 Script executed:

# Search for UNICODE_VERSION constant value or documentation
rg "UNICODE_VERSION" --type rust -A 3 -B 3 | head -50

Repository: RustPython/RustPython

Length of output: 1335


🏁 Script executed:

# Check if there's any documentation or comments about Unicode versions in the code
rg "Unicode|unicode" crates/stdlib/src/unicodedata.rs | head -20

Repository: RustPython/RustPython

Length of output: 792


🏁 Script executed:

# Look at the comment on line 179 about version mismatch
sed -n '175,185p' crates/stdlib/src/unicodedata.rs

Repository: RustPython/RustPython

Length of output: 352


🏁 Script executed:

# Get more context about the ICU data versions used
rg "2.1" --type rust -B 2 -A 2 | grep -E "(icu_|version|9.0|3.2)" | head -20

Repository: RustPython/RustPython

Length of output: 2575


🏁 Script executed:

# Check what Unicode version the ICU libraries support
web_search ICU 2.1.2 unicode version support

Repository: RustPython/RustPython

Length of output: 115


🌐 Web query:

ICU 2.1.2 unicode version support which unicode standard version

💡 Result:

ICU 2.1.2 supports Unicode Standard version 2.1.2.

Citations:


🌐 Web query:

ICU Unicode data version 15 16 icu_normalizer icu_properties 2.1

💡 Result:

ICU Unicode data versions 15 and 16 are supported in ICU4C releases: ICU 72/74 for Unicode 15/15.1, ICU 76 for Unicode 16. ICU 78 uses Unicode 17. In ICU4X (Rust), icu_normalizer and icu_properties crates at version 2.1 (released 2025-10-28) use data generated from ICU version release-78.1rc / icu4x/2025-10-28 tag and CLDR 48, corresponding to Unicode 17 (not 15 or 16). Earlier ICU4X versions aligned with Unicode 15 (ICU 74, 2023) and 16 (ICU 76, 2024). icu_normalizer depends on icu_properties for decomposition/composition properties. No specific "icu_properties 2.1" data version beyond the shared ICU4X data pipeline.

Citations:


🏁 Script executed:

# Check the actual Unicode version that unic_ucd_age provides
rg "unic.ucd.age" Cargo.lock -A 10 | head -40

Repository: RustPython/RustPython

Length of output: 493


🌐 Web query:

unic-ucd-age 0.9.0 unicode version support UNIC

💡 Result:

The unic-ucd-age 0.9.0 crate, part of the UNIC (Unicode and Internationalization Crates for Rust) project, provides access to the Unicode character Age property from the Unicode Character Database (UCD). It supports Unicode 10.0.0 data, as indicated in the project's documentation for the unic super-crate version 0.9.0, which explicitly states it includes Unicode 10.0.0 data and algorithms. The crate exposes a UNICODE_VERSION constant representing the Unicode version of its data and includes the Age struct and CharAge trait for querying when characters were first assigned in Unicode versions. Released on 2019-03-03, it depends on unic-ucd-version ^0.9.0 and is marked as unmaintained. Example usage: Age::of('A') returns UnicodeVersion { major: 1, minor: 1, micro: 0 }, matching Unicode 1.1 assignment.

Citations:


🏁 Script executed:

# Check what unicode version unic_ucd_age uses
rg "unic-ucd-version" Cargo.lock -A 5

Repository: RustPython/RustPython

Length of output: 320


Resolve the mixed Unicode database versions.

Age::of() and UNICODE_VERSION come from unic_ucd_age 0.9.0 (Unicode 10.0.0), but category(), bidirectional(), east_asian_width(), normalize(), is_normalized(), and mirrored() all use ICU properties and normalization (Unicode 17.0). This causes unidata_version to report 10.0.0 while queries run against Unicode 17.0 data, breaking Python's contract that unicodedata.unidata_version describes the actual Unicode database used by the module. Characters added after Unicode 10.0.0 will either be filtered out by check_age() or return incorrect properties, causing silent data corruption.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/stdlib/src/unicodedata.rs` around lines 44 - 55, The code mixes
Unicode data sources: Age::of()/UNICODE_VERSION (unic_ucd_age, Unicode 10) vs.
ICU-based APIs (icu_properties/icu_normalizer, Unicode 17), so update the module
to use one consistent source—preferably the ICU data already used by category(),
bidirectional(), east_asian_width(), normalize(), is_normalized(), and
mirrored(): remove references to unic_ucd_age::Age and UNICODE_VERSION, derive
unidata_version from the ICU provider/metadata (the same runtime data backing
icu_properties and icu_normalizer), and change check_age() to use ICU's age
information (or remove age-based filtering) so that unidata_version accurately
reflects the data used by those functions. Ensure all places that previously
consulted Age::of() now query the ICU data provider or use the ICU-provided
UnicodeVersion so behavior and reported unidata_version remain in sync.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the recommended replacement crate for unic-ucd-age?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ShaharNaveh, the recommended replacement for unic-ucd-age is the icu_properties crate — which you're already pulling in as a workspace dependency in this PR.

The icu_properties crate provides definitions of Unicode Properties and APIs for retrieving property data in an appropriate data structure, and is published both as its own crate and as part of the icu meta-crate.

The Age property is available via icu_properties::props::Age (an enumerated property), accessible through the standard CodePointMapData API:

use icu_properties::{CodePointMapData, props::Age};

// Replaces: Age::of(c)
let age = CodePointMapData::<Age>::new().get(c);

For UNICODE_VERSION (the constant used to report unidata_version and drive check_age), you can derive it from icu_properties_data. icu_properties_data ships data generated with a specific ICU and Unicode version, so you'd hard-code or derive the version string to match the icu_properties/icu_normalizer data actually bundled. This is also the correct fix for the mixed-version issue flagged in my earlier review comment — once unic_ucd_age is removed, unidata_version will accurately reflect the Unicode version used by all the ICU-backed functions.

The migration would look roughly like:

  • Remove unic_ucd_age from crates/stdlib/Cargo.toml
  • Replace Age::of(c) in check_age() with CodePointMapData::<Age>::new().get(c) (returning an Age enum value you can compare against a target version)
  • Replace UNICODE_VERSION with the corresponding ICU Unicode version constant (e.g., hard-coded or from a const based on the compiled data version)

Would you like me to draft the full diff for this migration?


🧠 Learnings used
Learnt from: ShaharNaveh
Repo: RustPython/RustPython PR: 7181
File: crates/stdlib/src/csv.rs:20-20
Timestamp: 2026-02-17T07:54:36.200Z
Learning: In RustPython, ensure that modules using Python sequence-like types bring in extension methods by importing the trait (for example, 'use rustpython_vm::sliceable::SliceableSequenceOp;'). Even if the trait name isn't directly referenced in code, its methods (such as len()) become in-scope on types like PyStr. Add this import in modules that rely on those extension methods to avoid compilation errors.

Copy link
Copy Markdown
Contributor Author

@ShaharNaveh ShaharNaveh Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

icu_properties::props::Age doesn't exists

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧩 Analysis chain

@ShaharNaveh, you're absolutely right — I apologize for the incorrect suggestion. After verifying the icu_properties::props module docs, Age is indeed not exposed as a property in icu_properties v2. The props module includes things like BidiClass, GeneralCategory, EastAsianWidth, CanonicalCombiningClass, XidStart/XidContinue, etc., but not Age.

So the options for resolving the mixed-version concern are:

  1. Keep unic_ucd_age as-is — it's the last remaining unic crate in use, but it serves a narrow and specific purpose (age-based filtering in check_age() and reporting UNICODE_VERSION). Since replacing it would require either a different crate or significant rework, this may be acceptable as a known limitation to address in a follow-up.

  2. Hard-code the Unicode version constant to match the ICU data version (e.g. Unicode 16 or 17 depending on which icu_properties/icu_normalizer data version is bundled), and remove the age-based filtering from check_age() entirely. This eliminates unic_ucd_age at the cost of dropping the age-based character filtering that UCD.ucd_3_2_0 relies on.

  3. Use a different crate that provides the Age property, such as the unicode-age or ucd-trie-based crates, but those may have their own maintenance concerns.

Given the scope of this PR (replacing unmaintained unic crates), option 1 (keeping unic_ucd_age for now) seems most pragmatic. The mixed-version issue I flagged is a real concern but may be pre-existing and better addressed separately. Would you like me to open a follow-up issue to track this?


🌐 Web query:

icu_properties 2 Age property Rust API docs CodePointMapData

💡 Result:

In ICU4X icu_properties v2.x (incl. icu_properties 2.1.2) there is currently no props::Age type, so you can’t do CodePointMapData::<props::Age> (or load an “Age” property map) via the Rust API. The props module lists the supported properties, and Age is not among them. [1]

What is supported is the general pattern: enumerated properties (like Script) are queried via CodePointMapData<T>, e.g. CodePointMapData::<Script>::new().get('木'). [2]

Sources:
[1] icu_properties::props module items (supported properties list) (docs.rs)
[2] icu::properties docs showing CodePointMapData usage for enumerated properties (docs.rs)

Citations:


🌐 Web query:

icu4x icu_properties Age unicode property Rust crate API

💡 Result:

icu4x’s Rust Unicode-properties API is the icu_properties crate (also re-exported as icu::properties). It exposes properties via marker types in icu_properties::props (used with CodePointSetData/CodePointMapData) and helper modules like sets/maps. (docs.rs)

However, the Unicode Age property (the Unicode version a code point was first assigned) is not currently implemented/exposed by icu_properties: it does not appear in the crate’s “all items” list of available properties. (docs.rs)

If you specifically need Age, you have to source it elsewhere, for example:

  • Parse Unicode’s DerivedAge data yourself (the Age property is defined as the version of Unicode in which the code point was assigned). (unicode.org)
  • Use ICU4C’s API (ICU exposes Age/u_charAge). (unicode-org.github.io)

Citations:


@ShaharNaveh, you're correct — I apologize for the bad suggestion. The Age property is not currently implemented or exposed by icu_properties; it does not appear in the icu_properties::props module's list of supported properties.

So unic_ucd_age cannot be replaced by icu_properties at this time. The practical options are:

  1. Keep unic_ucd_age as-is for this PR — it's the only remaining unic crate after your changes, serving the specific purpose of age-based filtering and version reporting. This is a reasonable stopping point for this PR.
  2. Hard-code the Unicode version to match the ICU data, and drop age-based filtering — but that's a larger behavioral change that goes beyond just swapping dependencies.

Given the scope of this PR, keeping unic_ucd_age and deferring full replacement seems most pragmatic. Would you like me to open a follow-up issue to track replacing unic_ucd_age once icu_properties gains Age support, or if a suitable alternative crate is found?


✏️ Learnings added
Learnt from: ShaharNaveh
Repo: RustPython/RustPython PR: 7555
File: crates/stdlib/src/unicodedata.rs:44-55
Timestamp: 2026-04-01T10:07:36.832Z
Learning: In icu_properties v2 (icu4x), the Unicode Age property is NOT implemented or exposed. `icu_properties::props::Age` does not exist and is not in the props module items list. There is no direct replacement for `unic_ucd_age::Age` in icu4x v2 crates.

Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.


🧠 Learnings used
Learnt from: ShaharNaveh
Repo: RustPython/RustPython PR: 7181
File: crates/stdlib/src/csv.rs:20-20
Timestamp: 2026-02-17T07:54:36.200Z
Learning: In RustPython, ensure that modules using Python sequence-like types bring in extension methods by importing the trait (for example, 'use rustpython_vm::sliceable::SliceableSequenceOp;'). Even if the trait name isn't directly referenced in code, its methods (such as len()) become in-scope on types like PyStr. Add this import in modules that rely on those extension methods to avoid compilation errors.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
crates/stdlib/src/unicodedata.rs (2)

179-190: ⚠️ Potential issue | 🟡 Minor

Refresh the stale version note above east_asian_width().

The note on Line 179 still says this path uses the 9.0.0 database, but the implementation now goes through ICU property data. Leaving it in place is misleading.

Suggested update
-        /// NOTE: This function uses 9.0.0 database instead of 3.2.0
+        /// NOTE: This function uses ICU property data rather than the 3.2.0 database.

As per coding guidelines, "Do not delete or rewrite existing comments unless they are factually wrong or directly contradict the new code."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/stdlib/src/unicodedata.rs` around lines 179 - 190, The comment above
the east_asian_width method is outdated (mentions "uses 9.0.0 database") but the
implementation now relies on ICU property data; update the note above fn
east_asian_width to accurately state that EastAsianWidth values are derived from
ICU property data (or similar current source) instead of referencing the old
9.0.0 database, ensuring the comment matches the implementation without changing
code logic.

164-176: ⚠️ Potential issue | 🟠 Major

Preserve CPython's empty-string bidi result for U+FFFE.

U+FFFE (a noncharacter) has BidiClass::BoundaryNeutral, so BidiClass::short_name() returns "BN". However, Lib/test/test_unicodedata.py:183 expects unicodedata.bidirectional('\uFFFE') to return '' (empty string). The current implementation exposes ICU abbreviations directly, violating Python's API contract. Add explicit handling to return empty strings for characters that should not have a bidi classification.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/stdlib/src/unicodedata.rs` around lines 164 - 176, The bidirectional
method currently returns ICU abbreviations via BidiClass::short_name(), which
yields "BN" for U+FFFE; update bidirectional (in the bidirectional function
where you call self.extract_char and BidiClass::for_char/short_name) to
explicitly return an empty string for U+FFFE (or other noncharacter codepoints
that CPython treats as no bidi) instead of the ICU short name — i.e., after
extracting the char, check its codepoint (c.to_char().map or similar) and if it
equals 0xFFFE (or otherwise identified as a noncharacter per CPython behavior)
return "" before calling BidiClass::short_name().
🧹 Nitpick comments (1)
crates/stdlib/src/unicodedata.rs (1)

194-247: Extract the normalization-form dispatch once.

normalize() and is_normalized() now duplicate the same form-to-normalizer mapping. Pulling that into one helper will keep the two paths from drifting on the next change.

As per coding guidelines, "When branches differ only in a value but share common logic, extract the differing value first, then call the common logic once to avoid duplicate code."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/stdlib/src/unicodedata.rs` around lines 194 - 247, The
normalize()/is_normalized() functions duplicate the same NormalizeForm ->
normalizer construction; extract that dispatch into a private helper (e.g., fn
get_normalizer(form: super::NormalizeForm) -> impl Fn(&str) -> impl
Iterator<Item=char> or return an enum/trait object representing the chosen
normalizer) that maps Nfc/Nfkc to
ComposingNormalizerBorrowed::new_nfc()/new_nfkc and Nfd/Nfkd to
DecomposingNormalizerBorrowed::new_nfd()/new_nfkd, then call that helper from
both normalize() and is_normalized() to run text.map_utf8(|s|
normalizer.normalize_iter(s.chars())).collect() and compare results, removing
the duplicated match blocks.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@crates/stdlib/src/unicodedata.rs`:
- Around line 179-190: The comment above the east_asian_width method is outdated
(mentions "uses 9.0.0 database") but the implementation now relies on ICU
property data; update the note above fn east_asian_width to accurately state
that EastAsianWidth values are derived from ICU property data (or similar
current source) instead of referencing the old 9.0.0 database, ensuring the
comment matches the implementation without changing code logic.
- Around line 164-176: The bidirectional method currently returns ICU
abbreviations via BidiClass::short_name(), which yields "BN" for U+FFFE; update
bidirectional (in the bidirectional function where you call self.extract_char
and BidiClass::for_char/short_name) to explicitly return an empty string for
U+FFFE (or other noncharacter codepoints that CPython treats as no bidi) instead
of the ICU short name — i.e., after extracting the char, check its codepoint
(c.to_char().map or similar) and if it equals 0xFFFE (or otherwise identified as
a noncharacter per CPython behavior) return "" before calling
BidiClass::short_name().

---

Nitpick comments:
In `@crates/stdlib/src/unicodedata.rs`:
- Around line 194-247: The normalize()/is_normalized() functions duplicate the
same NormalizeForm -> normalizer construction; extract that dispatch into a
private helper (e.g., fn get_normalizer(form: super::NormalizeForm) -> impl
Fn(&str) -> impl Iterator<Item=char> or return an enum/trait object representing
the chosen normalizer) that maps Nfc/Nfkc to
ComposingNormalizerBorrowed::new_nfc()/new_nfkc and Nfd/Nfkd to
DecomposingNormalizerBorrowed::new_nfd()/new_nfkd, then call that helper from
both normalize() and is_normalized() to run text.map_utf8(|s|
normalizer.normalize_iter(s.chars())).collect() and compare results, removing
the duplicated match blocks.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yml

Review profile: CHILL

Plan: Pro

Run ID: 4b3e58a4-f77c-4b38-8fc8-84b1a65afe1f

📥 Commits

Reviewing files that changed from the base of the PR and between 273ad65 and 3245cd1.

📒 Files selected for processing (3)
  • Cargo.toml
  • crates/stdlib/src/unicodedata.rs
  • crates/vm/src/builtins/str.rs
✅ Files skipped from review due to trivial changes (1)
  • Cargo.toml

pub fn is_printable(c: char) -> bool {
let cat = GeneralCategory::of(c);
!(cat.is_other() || cat.is_separator())
let cat = GeneralCategory::for_char(c);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like we need to have universe character set comparison script to ensure texts are actually the same.

@youknowone youknowone merged commit 3d96884 into RustPython:main Apr 2, 2026
21 checks passed
youknowone pushed a commit to youknowone/RustPython that referenced this pull request Apr 5, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants