Replace unmaintained `unic` crates by ShaharNaveh · Pull Request #7555 · RustPython/RustPython

ShaharNaveh · 2026-04-01T09:39:18Z

Summary by CodeRabbit

Chores
- Switched Unicode data and normalization backend to a different Unicode provider for improved standards compliance.
- Updated normalization and mirroring behavior to use the new provider.
User-visible changes
- Character classification and string methods (printability, whitespace, decimal, identifier checks, and width/bidi labels) may yield different results under the new provider.
Breaking Changes
- The combining() return type changed from i32 to u8.

coderabbitai · 2026-04-01T09:39:33Z

📝 Walkthrough

Walkthrough

Migrates Unicode handling from several unic-* crates to ICU libraries: adds icu_properties and icu_normalizer, removes multiple unic-* and unicode-bidi-mirroring workspace dependencies, and updates code to use ICU APIs for classification, normalization, bidi, mirroring, and combining class.

Changes

Cohort / File(s)	Summary
Workspace manifests `Cargo.toml`, `crates/literal/Cargo.toml`, `crates/stdlib/Cargo.toml`, `crates/vm/Cargo.toml`	Removed `unic-char-property`, `unic-normal`, `unic-ucd-bidi`, `unic-ucd-category`, `unic-ucd-ident`, and `unicode-bidi-mirroring` entries; added `icu_properties = "2"` and `icu_normalizer = "2"` to workspace deps.
Literal char utilities `crates/literal/src/char.rs`	Replaced `unic_ucd_category::GeneralCategory::of` with `icu_properties::props::GeneralCategory::for_char` and updated `is_printable` to explicit `GeneralCategory` matching.
Stdlib Unicode data `crates/stdlib/src/unicodedata.rs`	Switched category/bidi/width lookups to ICU (`for_char`) and `short_name()`; migrated `normalize()`/`is_normalized()` to `icu_normalizer` (composing/decomposing normalizers); replaced mirroring check with ICU `CodePointSetData::new::<BidiMirrored>().contains(ch)`; changed `combining()` return type `i32` → `u8` and use `CanonicalCombiningClass::for_char(...).to_icu4c_value()`; removed local width-abbreviation trait.
VM string builtins `crates/vm/src/builtins/str.rs`	Replaced `unic_ucd_*` imports and helpers with `icu_properties` equivalents (`GeneralCategory::for_char`, `BidiClass::for_char`, `XidStart::for_char`, `XidContinue::for_char`); updated `isdecimal`, `isspace`, and `isidentifier` logic accordingly.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Poem

🐇 I hopped from unic to ICU with glee,
Swapping crates so characters roam free,
Categories, bidi, and normalization too,
I stitched each byte until behavior grew true,
A tiny rabbit cheers the Unicode spree! 🎉

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly and concisely summarizes the main change: replacing unmaintained `unic` crates with maintained alternatives, which is the core objective evident across all modified files.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

github-actions · 2026-04-01T09:39:43Z

📦 Library Dependencies

The following Lib/ modules were modified. Here are their dependencies:

(module 'str test_unicodedata' not found)

Legend:

[+] path exists in CPython
[x] up-to-date, [ ] outdated

ShaharNaveh · 2026-04-01T09:43:34Z

Lib/test/test_str.py


+    @unittest.expectedFailure  # TODO: RUSTPYTHON
    @support.requires_resource('cpu')
    def test_isprintable_invariant(self):


This test was giving us a false sense that it works, because it used to report \u{0560} as non-printable (but now it's saying it's printable).

use icu_properties::props::{EnumeratedProperty, GeneralCategory}; const CH: char = '\u{0560}'; fn main() { dbg!(unic_ucd_category::GeneralCategory::of(CH)); // Unassigned (wrong) dbg!(GeneralCategory::for_char(CH)); // LowercaseLetter (correct) }

what should we do?

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (1)

crates/stdlib/src/unicodedata.rs (1)
194-217: Deduplicate the normalization pipeline.

Both methods repeat the same map_utf8(...).collect() body four times and only vary by the normalizer constructor. Please extract the form-specific choice first and keep the shared path in one place. As per coding guidelines, "When branches differ only in a value but share common logic, extract the differing value first, then call the common logic once to avoid duplicate code."

Also applies to: 222-245
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/stdlib/src/unicodedata.rs` around lines 194 - 217, The normalize
function duplicates the map_utf8(...).collect() pipeline for each NormalizeForm
branch; instead, choose the appropriate normalizer constructor based on the form
(use the variants Nfc, Nfkc, Nfd, Nfkd and the constructors
ComposingNormalizerBorrowed::new_nfc(), ComposingNormalizerBorrowed::new_nfkc(),
DecomposingNormalizerBorrowed::new_nfd(),
DecomposingNormalizerBorrowed::new_nfkd()) and store it in a single variable (or
closure) and then call text.map_utf8(|s|
normalizer.normalize_iter(s.chars())).collect() once; apply the same refactor to
the later duplicate block as well so only the normalizer choice varies and the
shared map_utf8(...).collect() path is centralized.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@crates/stdlib/src/unicodedata.rs`:
- Around line 44-55: The code mixes Unicode data sources:
Age::of()/UNICODE_VERSION (unic_ucd_age, Unicode 10) vs. ICU-based APIs
(icu_properties/icu_normalizer, Unicode 17), so update the module to use one
consistent source—preferably the ICU data already used by category(),
bidirectional(), east_asian_width(), normalize(), is_normalized(), and
mirrored(): remove references to unic_ucd_age::Age and UNICODE_VERSION, derive
unidata_version from the ICU provider/metadata (the same runtime data backing
icu_properties and icu_normalizer), and change check_age() to use ICU's age
information (or remove age-based filtering) so that unidata_version accurately
reflects the data used by those functions. Ensure all places that previously
consulted Age::of() now query the ICU data provider or use the ICU-provided
UnicodeVersion so behavior and reported unidata_version remain in sync.

---

Nitpick comments:
In `@crates/stdlib/src/unicodedata.rs`:
- Around line 194-217: The normalize function duplicates the
map_utf8(...).collect() pipeline for each NormalizeForm branch; instead, choose
the appropriate normalizer constructor based on the form (use the variants Nfc,
Nfkc, Nfd, Nfkd and the constructors ComposingNormalizerBorrowed::new_nfc(),
ComposingNormalizerBorrowed::new_nfkc(),
DecomposingNormalizerBorrowed::new_nfd(),
DecomposingNormalizerBorrowed::new_nfkd()) and store it in a single variable (or
closure) and then call text.map_utf8(|s|
normalizer.normalize_iter(s.chars())).collect() once; apply the same refactor to
the later duplicate block as well so only the normalizer choice varies and the
shared map_utf8(...).collect() path is centralized.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yml

Review profile: CHILL

Plan: Pro

Run ID: 2897286b-ea41-43ef-acc5-3fe407520f70

📥 Commits

Reviewing files that changed from the base of the PR and between d9c4c95 and 273ad65.

⛔ Files ignored due to path filters (3)

Cargo.lock is excluded by !**/*.lock
Lib/test/test_str.py is excluded by !Lib/**
Lib/test/test_unicodedata.py is excluded by !Lib/**

📒 Files selected for processing (7)

Cargo.toml
crates/literal/Cargo.toml
crates/literal/src/char.rs
crates/stdlib/Cargo.toml
crates/stdlib/src/unicodedata.rs
crates/vm/Cargo.toml
crates/vm/src/builtins/str.rs

coderabbitai · 2026-04-01T09:59:47Z

crates/stdlib/src/unicodedata.rs

+    use icu_normalizer::{ComposingNormalizerBorrowed, DecomposingNormalizerBorrowed};
+    use icu_properties::{
+        CodePointSetData,
+        props::{
+            BidiClass, BidiMirrored, CanonicalCombiningClass, EastAsianWidth, EnumeratedProperty,
+            GeneralCategory, NamedEnumeratedProperty,
+        },
+    };
    use itertools::Itertools;
    use rustpython_common::wtf8::{CodePoint, Wtf8Buf};
-    use ucd::{Codepoint, DecompositionType, EastAsianWidth, Number, NumericType};
-    use unic_char_property::EnumeratedCharProperty;
-    use unic_normal::StrNormalForm;
+    use ucd::{Codepoint, DecompositionType, Number, NumericType};
    use unic_ucd_age::{Age, UNICODE_VERSION, UnicodeVersion};


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

wc -l crates/stdlib/src/unicodedata.rs

Repository: RustPython/RustPython

Length of output: 101

🏁 Script executed:

cat -n crates/stdlib/src/unicodedata.rs | head -250

Repository: RustPython/RustPython

Length of output: 10161

🏁 Script executed:

# Search for unidata_version and how it's constructed rg "unidata_version" crates/stdlib/src/unicodedata.rs -B 2 -A 2

Repository: RustPython/RustPython

Length of output: 300

🏁 Script executed:

# Search for check_age and Age usage rg "check_age|Age::|UNICODE_VERSION" crates/stdlib/src/unicodedata.rs -B 2 -A 2

Repository: RustPython/RustPython

Length of output: 1397

🏁 Script executed:

fd -type f Cargo.toml | grep -E "(stdlib|Cargo)" | head -5

Repository: RustPython/RustPython

Length of output: 235

🏁 Script executed:

# Find and examine the Cargo.toml for the stdlib crate cat Cargo.toml.in 2>/dev/null || find . -name "Cargo.toml" -path "*/stdlib/*" | head -1 | xargs cat

Repository: RustPython/RustPython

Length of output: 5918

🏁 Script executed:

# Search for icu and unic_ucd_age dependency versions rg "icu_|unic_ucd_age" Cargo.toml --type toml -A 1

Repository: RustPython/RustPython

Length of output: 131

🏁 Script executed:

# Check the root Cargo.toml for workspace dependencies head -150 Cargo.toml | grep -A 30 "^\[workspace.dependencies\]"

Repository: RustPython/RustPython

Length of output: 575

🏁 Script executed:

# Search for the unic_ucd_age version definition grep -A 5 "unic-ucd-age" Cargo.toml

Repository: RustPython/RustPython

Length of output: 182

🏁 Script executed:

# Get version info from Cargo.lock if available ls -la Cargo.lock 2>/dev/null && grep -A 5 "unic-ucd-age\|icu_properties\|icu_normalizer" Cargo.lock | head -40

Repository: RustPython/RustPython

Length of output: 1168

🏁 Script executed:

# Search for UNICODE_VERSION constant value or documentation rg "UNICODE_VERSION" --type rust -A 3 -B 3 | head -50

Repository: RustPython/RustPython

Length of output: 1335

🏁 Script executed:

# Check if there's any documentation or comments about Unicode versions in the code rg "Unicode|unicode" crates/stdlib/src/unicodedata.rs | head -20

Repository: RustPython/RustPython

Length of output: 792

🏁 Script executed:

# Look at the comment on line 179 about version mismatch sed -n '175,185p' crates/stdlib/src/unicodedata.rs

Repository: RustPython/RustPython

Length of output: 352

🏁 Script executed:

# Get more context about the ICU data versions used rg "2.1" --type rust -B 2 -A 2 | grep -E "(icu_|version|9.0|3.2)" | head -20

Repository: RustPython/RustPython

Length of output: 2575

🏁 Script executed:

# Check what Unicode version the ICU libraries support web_search ICU 2.1.2 unicode version support

Repository: RustPython/RustPython

Length of output: 115

🌐 Web query:

ICU 2.1.2 unicode version support which unicode standard version

💡 Result:

ICU 2.1.2 supports Unicode Standard version 2.1.2.

Citations:

1: https://icu.unicode.org/processes/release/tasks/versions

2: https://unicode-org.github.io/icu/processes/release/tasks/versions.html

3: http://www.unicode.org/versions/components-2.1.2.html

🌐 Web query:

ICU Unicode data version 15 16 icu_normalizer icu_properties 2.1

💡 Result:

ICU Unicode data versions 15 and 16 are supported in ICU4C releases: ICU 72/74 for Unicode 15/15.1, ICU 76 for Unicode 16. ICU 78 uses Unicode 17. In ICU4X (Rust), icu_normalizer and icu_properties crates at version 2.1 (released 2025-10-28) use data generated from ICU version release-78.1rc / icu4x/2025-10-28 tag and CLDR 48, corresponding to Unicode 17 (not 15 or 16). Earlier ICU4X versions aligned with Unicode 15 (ICU 74, 2023) and 16 (ICU 76, 2024). icu_normalizer depends on icu_properties for decomposition/composition properties. No specific "icu_properties 2.1" data version beyond the shared ICU4X data pipeline.

Citations:

1: https://github.com/unicode-org/icu/releases

2: https://unicode-org.github.io/icu/download/76.html

3: https://unicode-org.github.io/icu/download/

4: https://crates.io/crates/icu_normalizer/2.1.1/dependencies

5: https://crates.io/crates/icu_normalizer_data/1.5.0

6: https://unicode-org.github.io/icu4x/rustdoc/icu_properties_data/index.html

7: https://chromium.googlesource.com/external/github.com/unicode-org/icu4x/+/e0c2dad5b7e9d0217bf20cbbed80e285c71ddc29/CHANGELOG.md

8: https://docs.rs/crate/icu_normalizer_data/latest

9: https://blog.unicode.org/2025/05/icu4x-20-released.html

10: http://blog.unicode.org/2025/10/icu4x-21-released.html

11: http://www.icu-project.org/icu/download/78.html

12: http://blog.unicode.org/2025/05/icu4x-20-released.html

13: https://docs.rs/icu_properties_data/2.1.2/icu_properties_data/

🏁 Script executed:

# Check the actual Unicode version that unic_ucd_age provides rg "unic.ucd.age" Cargo.lock -A 10 | head -40

Repository: RustPython/RustPython

Length of output: 493

🌐 Web query:

unic-ucd-age 0.9.0 unicode version support UNIC

💡 Result:

The unic-ucd-age 0.9.0 crate, part of the UNIC (Unicode and Internationalization Crates for Rust) project, provides access to the Unicode character Age property from the Unicode Character Database (UCD). It supports Unicode 10.0.0 data, as indicated in the project's documentation for the unic super-crate version 0.9.0, which explicitly states it includes Unicode 10.0.0 data and algorithms. The crate exposes a UNICODE_VERSION constant representing the Unicode version of its data and includes the Age struct and CharAge trait for querying when characters were first assigned in Unicode versions. Released on 2019-03-03, it depends on unic-ucd-version ^0.9.0 and is marked as unmaintained. Example usage: Age::of('A') returns UnicodeVersion { major: 1, minor: 1, micro: 0 }, matching Unicode 1.1 assignment.

Citations:

1: https://docs.rs/unic-ucd-age/^0.9.0/

2: https://crates.io/crates/unic-ucd-age

3: https://lib.rs/crates/unic-ucd-age

4: https://docs.rs/unic-ucd-version/^0.9.0

5: https://docs.rs/crate/unic/latest

6: https://docs.rs/unic-ucd/^0.9.0/

7: https://docs.rs/unic/latest/unic/

🏁 Script executed:

# Check what unicode version unic_ucd_age uses rg "unic-ucd-version" Cargo.lock -A 5

Repository: RustPython/RustPython

Length of output: 320

Resolve the mixed Unicode database versions.

Age::of() and UNICODE_VERSION come from unic_ucd_age 0.9.0 (Unicode 10.0.0), but category(), bidirectional(), east_asian_width(), normalize(), is_normalized(), and mirrored() all use ICU properties and normalization (Unicode 17.0). This causes unidata_version to report 10.0.0 while queries run against Unicode 17.0 data, breaking Python's contract that unicodedata.unidata_version describes the actual Unicode database used by the module. Characters added after Unicode 10.0.0 will either be filtered out by check_age() or return incorrect properties, causing silent data corruption.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@crates/stdlib/src/unicodedata.rs` around lines 44 - 55, The code mixes Unicode data sources: Age::of()/UNICODE_VERSION (unic_ucd_age, Unicode 10) vs. ICU-based APIs (icu_properties/icu_normalizer, Unicode 17), so update the module to use one consistent source—preferably the ICU data already used by category(), bidirectional(), east_asian_width(), normalize(), is_normalized(), and mirrored(): remove references to unic_ucd_age::Age and UNICODE_VERSION, derive unidata_version from the ICU provider/metadata (the same runtime data backing icu_properties and icu_normalizer), and change check_age() to use ICU's age information (or remove age-based filtering) so that unidata_version accurately reflects the data used by those functions. Ensure all places that previously consulted Age::of() now query the ICU data provider or use the ICU-provided UnicodeVersion so behavior and reported unidata_version remain in sync.

What is the recommended replacement crate for unic-ucd-age?

@ShaharNaveh, the recommended replacement for unic-ucd-age is the icu_properties crate — which you're already pulling in as a workspace dependency in this PR.

The icu_properties crate provides definitions of Unicode Properties and APIs for retrieving property data in an appropriate data structure, and is published both as its own crate and as part of the icu meta-crate.

The Age property is available via icu_properties::props::Age (an enumerated property), accessible through the standard CodePointMapData API:

use icu_properties::{CodePointMapData, props::Age}; // Replaces: Age::of(c) let age = CodePointMapData::<Age>::new().get(c);

For UNICODE_VERSION (the constant used to report unidata_version and drive check_age), you can derive it from icu_properties_data. icu_properties_data ships data generated with a specific ICU and Unicode version, so you'd hard-code or derive the version string to match the icu_properties/icu_normalizer data actually bundled. This is also the correct fix for the mixed-version issue flagged in my earlier review comment — once unic_ucd_age is removed, unidata_version will accurately reflect the Unicode version used by all the ICU-backed functions.

The migration would look roughly like:

Remove unic_ucd_age from crates/stdlib/Cargo.toml

Replace Age::of(c) in check_age() with CodePointMapData::<Age>::new().get(c) (returning an Age enum value you can compare against a target version)

Replace UNICODE_VERSION with the corresponding ICU Unicode version constant (e.g., hard-coded or from a const based on the compiled data version)

Would you like me to draft the full diff for this migration?

🧠 Learnings used

Learnt from: ShaharNaveh Repo: RustPython/RustPython PR: 7181 File: crates/stdlib/src/csv.rs:20-20 Timestamp: 2026-02-17T07:54:36.200Z Learning: In RustPython, ensure that modules using Python sequence-like types bring in extension methods by importing the trait (for example, 'use rustpython_vm::sliceable::SliceableSequenceOp;'). Even if the trait name isn't directly referenced in code, its methods (such as len()) become in-scope on types like PyStr. Add this import in modules that rely on those extension methods to avoid compilation errors.

icu_properties::props::Age doesn't exists

🧩 Analysis chain

@ShaharNaveh, you're absolutely right — I apologize for the incorrect suggestion. After verifying the icu_properties::props module docs, Age is indeed not exposed as a property in icu_properties v2. The props module includes things like BidiClass, GeneralCategory, EastAsianWidth, CanonicalCombiningClass, XidStart/XidContinue, etc., but not Age.

So the options for resolving the mixed-version concern are:

Keep unic_ucd_age as-is — it's the last remaining unic crate in use, but it serves a narrow and specific purpose (age-based filtering in check_age() and reporting UNICODE_VERSION). Since replacing it would require either a different crate or significant rework, this may be acceptable as a known limitation to address in a follow-up.

Hard-code the Unicode version constant to match the ICU data version (e.g. Unicode 16 or 17 depending on which icu_properties/icu_normalizer data version is bundled), and remove the age-based filtering from check_age() entirely. This eliminates unic_ucd_age at the cost of dropping the age-based character filtering that UCD.ucd_3_2_0 relies on.

Use a different crate that provides the Age property, such as the unicode-age or ucd-trie-based crates, but those may have their own maintenance concerns.

Given the scope of this PR (replacing unmaintained unic crates), option 1 (keeping unic_ucd_age for now) seems most pragmatic. The mixed-version issue I flagged is a real concern but may be pre-existing and better addressed separately. Would you like me to open a follow-up issue to track this?

🌐 Web query:

icu_properties 2 Age property Rust API docs CodePointMapData

💡 Result:

In ICU4X icu_properties v2.x (incl. icu_properties 2.1.2) there is currently no props::Age type, so you can’t do CodePointMapData::<props::Age> (or load an “Age” property map) via the Rust API. The props module lists the supported properties, and Age is not among them. [1]

What is supported is the general pattern: enumerated properties (like Script) are queried via CodePointMapData<T>, e.g. CodePointMapData::<Script>::new().get('木'). [2]

Sources:
[1] icu_properties::props module items (supported properties list) (docs.rs)
[2] icu::properties docs showing CodePointMapData usage for enumerated properties (docs.rs)

Citations:

1: https://docs.rs/icu_properties/latest/icu_properties/props/index.html

2: https://docs.rs/icu/latest/icu/properties/index.html

🌐 Web query:

icu4x icu_properties Age unicode property Rust crate API

💡 Result:

icu4x’s Rust Unicode-properties API is the icu_properties crate (also re-exported as icu::properties). It exposes properties via marker types in icu_properties::props (used with CodePointSetData/CodePointMapData) and helper modules like sets/maps. (docs.rs)

However, the Unicode Age property (the Unicode version a code point was first assigned) is not currently implemented/exposed by icu_properties: it does not appear in the crate’s “all items” list of available properties. (docs.rs)

If you specifically need Age, you have to source it elsewhere, for example:

Parse Unicode’s DerivedAge data yourself (the Age property is defined as the version of Unicode in which the code point was assigned). (unicode.org)

Use ICU4C’s API (ICU exposes Age/u_charAge). (unicode-org.github.io)

Citations:

1: https://docs.rs/icu_properties/latest/icu_properties/

2: https://docs.rs/icu_properties/latest/icu_properties/all.html

3: https://unicode.org/L2/L2009/09136-pri144-uax42-4.pdf?utm_source=openai

4: https://unicode-org.github.io/icu/userguide/strings/properties.html?utm_source=openai

@ShaharNaveh, you're correct — I apologize for the bad suggestion. The Age property is not currently implemented or exposed by icu_properties; it does not appear in the icu_properties::props module's list of supported properties.

So unic_ucd_age cannot be replaced by icu_properties at this time. The practical options are:

Keep unic_ucd_age as-is for this PR — it's the only remaining unic crate after your changes, serving the specific purpose of age-based filtering and version reporting. This is a reasonable stopping point for this PR.

Hard-code the Unicode version to match the ICU data, and drop age-based filtering — but that's a larger behavioral change that goes beyond just swapping dependencies.

Given the scope of this PR, keeping unic_ucd_age and deferring full replacement seems most pragmatic. Would you like me to open a follow-up issue to track replacing unic_ucd_age once icu_properties gains Age support, or if a suitable alternative crate is found?

✏️ Learnings added

Learnt from: ShaharNaveh Repo: RustPython/RustPython PR: 7555 File: crates/stdlib/src/unicodedata.rs:44-55 Timestamp: 2026-04-01T10:07:36.832Z Learning: In icu_properties v2 (icu4x), the Unicode Age property is NOT implemented or exposed. `icu_properties::props::Age` does not exist and is not in the props module items list. There is no direct replacement for `unic_ucd_age::Age` in icu4x v2 crates.

Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.

🧠 Learnings used

Learnt from: ShaharNaveh Repo: RustPython/RustPython PR: 7181 File: crates/stdlib/src/csv.rs:20-20 Timestamp: 2026-02-17T07:54:36.200Z Learning: In RustPython, ensure that modules using Python sequence-like types bring in extension methods by importing the trait (for example, 'use rustpython_vm::sliceable::SliceableSequenceOp;'). Even if the trait name isn't directly referenced in code, its methods (such as len()) become in-scope on types like PyStr. Add this import in modules that rely on those extension methods to avoid compilation errors.

coderabbitai

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)

crates/stdlib/src/unicodedata.rs (2)
179-190: ⚠️ Potential issue | 🟡 Minor

Refresh the stale version note above east_asian_width().

The note on Line 179 still says this path uses the 9.0.0 database, but the implementation now goes through ICU property data. Leaving it in place is misleading.
Suggested update
-        /// NOTE: This function uses 9.0.0 database instead of 3.2.0
+        /// NOTE: This function uses ICU property data rather than the 3.2.0 database.
As per coding guidelines, "Do not delete or rewrite existing comments unless they are factually wrong or directly contradict the new code."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/stdlib/src/unicodedata.rs` around lines 179 - 190, The comment above
the east_asian_width method is outdated (mentions "uses 9.0.0 database") but the
implementation now relies on ICU property data; update the note above fn
east_asian_width to accurately state that EastAsianWidth values are derived from
ICU property data (or similar current source) instead of referencing the old
9.0.0 database, ensuring the comment matches the implementation without changing
code logic.
164-176: ⚠️ Potential issue | 🟠 Major

Preserve CPython's empty-string bidi result for U+FFFE.

U+FFFE (a noncharacter) has BidiClass::BoundaryNeutral, so BidiClass::short_name() returns "BN". However, Lib/test/test_unicodedata.py:183 expects unicodedata.bidirectional('\uFFFE') to return '' (empty string). The current implementation exposes ICU abbreviations directly, violating Python's API contract. Add explicit handling to return empty strings for characters that should not have a bidi classification.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/stdlib/src/unicodedata.rs` around lines 164 - 176, The bidirectional
method currently returns ICU abbreviations via BidiClass::short_name(), which
yields "BN" for U+FFFE; update bidirectional (in the bidirectional function
where you call self.extract_char and BidiClass::for_char/short_name) to
explicitly return an empty string for U+FFFE (or other noncharacter codepoints
that CPython treats as no bidi) instead of the ICU short name — i.e., after
extracting the char, check its codepoint (c.to_char().map or similar) and if it
equals 0xFFFE (or otherwise identified as a noncharacter per CPython behavior)
return "" before calling BidiClass::short_name().

🧹 Nitpick comments (1)

crates/stdlib/src/unicodedata.rs (1)
194-247: Extract the normalization-form dispatch once.

normalize() and is_normalized() now duplicate the same form-to-normalizer mapping. Pulling that into one helper will keep the two paths from drifting on the next change.

As per coding guidelines, "When branches differ only in a value but share common logic, extract the differing value first, then call the common logic once to avoid duplicate code."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/stdlib/src/unicodedata.rs` around lines 194 - 247, The
normalize()/is_normalized() functions duplicate the same NormalizeForm ->
normalizer construction; extract that dispatch into a private helper (e.g., fn
get_normalizer(form: super::NormalizeForm) -> impl Fn(&str) -> impl
Iterator<Item=char> or return an enum/trait object representing the chosen
normalizer) that maps Nfc/Nfkc to
ComposingNormalizerBorrowed::new_nfc()/new_nfkc and Nfd/Nfkd to
DecomposingNormalizerBorrowed::new_nfd()/new_nfkd, then call that helper from
both normalize() and is_normalized() to run text.map_utf8(|s|
normalizer.normalize_iter(s.chars())).collect() and compare results, removing
the duplicated match blocks.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@crates/stdlib/src/unicodedata.rs`:
- Around line 179-190: The comment above the east_asian_width method is outdated
(mentions "uses 9.0.0 database") but the implementation now relies on ICU
property data; update the note above fn east_asian_width to accurately state
that EastAsianWidth values are derived from ICU property data (or similar
current source) instead of referencing the old 9.0.0 database, ensuring the
comment matches the implementation without changing code logic.
- Around line 164-176: The bidirectional method currently returns ICU
abbreviations via BidiClass::short_name(), which yields "BN" for U+FFFE; update
bidirectional (in the bidirectional function where you call self.extract_char
and BidiClass::for_char/short_name) to explicitly return an empty string for
U+FFFE (or other noncharacter codepoints that CPython treats as no bidi) instead
of the ICU short name — i.e., after extracting the char, check its codepoint
(c.to_char().map or similar) and if it equals 0xFFFE (or otherwise identified as
a noncharacter per CPython behavior) return "" before calling
BidiClass::short_name().

---

Nitpick comments:
In `@crates/stdlib/src/unicodedata.rs`:
- Around line 194-247: The normalize()/is_normalized() functions duplicate the
same NormalizeForm -> normalizer construction; extract that dispatch into a
private helper (e.g., fn get_normalizer(form: super::NormalizeForm) -> impl
Fn(&str) -> impl Iterator<Item=char> or return an enum/trait object representing
the chosen normalizer) that maps Nfc/Nfkc to
ComposingNormalizerBorrowed::new_nfc()/new_nfkc and Nfd/Nfkd to
DecomposingNormalizerBorrowed::new_nfd()/new_nfkd, then call that helper from
both normalize() and is_normalized() to run text.map_utf8(|s|
normalizer.normalize_iter(s.chars())).collect() and compare results, removing
the duplicated match blocks.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yml

Review profile: CHILL

Plan: Pro

Run ID: 4b3e58a4-f77c-4b38-8fc8-84b1a65afe1f

📥 Commits

Reviewing files that changed from the base of the PR and between 273ad65 and 3245cd1.

📒 Files selected for processing (3)

Cargo.toml
crates/stdlib/src/unicodedata.rs
crates/vm/src/builtins/str.rs

✅ Files skipped from review due to trivial changes (1)

Cargo.toml

youknowone · 2026-04-02T16:42:57Z

crates/literal/src/char.rs

 pub fn is_printable(c: char) -> bool {
-    let cat = GeneralCategory::of(c);
-    !(cat.is_other() || cat.is_separator())
+    let cat = GeneralCategory::for_char(c);


I feel like we need to have universe character set comparison script to ensure texts are actually the same.

ShaharNaveh added 9 commits April 1, 2026 02:47

Use maintained crates

6abe80a

Fix literal

36c3886

Fix some of vm

23335bd

Fix vm

de8cd0a

Fix stdlib

15b4a82

Migrate more at stdlib

4789fc0

Use shortform

b0583ab

Mark failing test

47dadaf

Fix test marks

273ad65

ShaharNaveh commented Apr 1, 2026

View reviewed changes

ShaharNaveh added 3 commits April 1, 2026 11:44

cargo shear

5a776f5

Clippy

0cc47eb

clippy

3245cd1

coderabbitai bot reviewed Apr 1, 2026

View reviewed changes

ShaharNaveh mentioned this pull request Apr 1, 2026

Replace unmaintained unic-ucd-category crate with icu_properties astral-sh/ruff#24344

Merged

youknowone approved these changes Apr 2, 2026

View reviewed changes

youknowone merged commit 3d96884 into RustPython:main Apr 2, 2026
21 checks passed

youknowone pushed a commit to youknowone/RustPython that referenced this pull request Apr 5, 2026

Replace unmaintained unic crates (RustPython#7555)

2c88743

Conversation

ShaharNaveh commented Apr 1, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Apr 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Poem

Uh oh!

github-actions bot commented Apr 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

📦 Library Dependencies

Uh oh!

ShaharNaveh Apr 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

youknowone Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ShaharNaveh Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

ShaharNaveh Apr 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

youknowone Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ShaharNaveh commented Apr 1, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Apr 1, 2026 •

edited

Loading

github-actions bot commented Apr 1, 2026 •

edited

Loading

ShaharNaveh Apr 1, 2026 •

edited

Loading

coderabbitai bot Apr 1, 2026 •

edited

Loading

ShaharNaveh Apr 1, 2026 •

edited

Loading