Replace unmaintained unic crates#7555
Conversation
📝 WalkthroughWalkthroughMigrates Unicode handling from several Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
📦 Library DependenciesThe following Lib/ modules were modified. Here are their dependencies: (module 'str test_unicodedata' not found) Legend:
|
|
|
||
| @unittest.expectedFailure # TODO: RUSTPYTHON | ||
| @support.requires_resource('cpu') | ||
| def test_isprintable_invariant(self): |
There was a problem hiding this comment.
This test was giving us a false sense that it works, because it used to report \u{0560} as non-printable (but now it's saying it's printable).
use icu_properties::props::{EnumeratedProperty, GeneralCategory};
const CH: char = '\u{0560}';
fn main() {
dbg!(unic_ucd_category::GeneralCategory::of(CH)); // Unassigned (wrong)
dbg!(GeneralCategory::for_char(CH)); // LowercaseLetter (correct)
}There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
crates/stdlib/src/unicodedata.rs (1)
194-217: Deduplicate the normalization pipeline.Both methods repeat the same
map_utf8(...).collect()body four times and only vary by the normalizer constructor. Please extract the form-specific choice first and keep the shared path in one place. As per coding guidelines, "When branches differ only in a value but share common logic, extract the differing value first, then call the common logic once to avoid duplicate code."Also applies to: 222-245
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@crates/stdlib/src/unicodedata.rs` around lines 194 - 217, The normalize function duplicates the map_utf8(...).collect() pipeline for each NormalizeForm branch; instead, choose the appropriate normalizer constructor based on the form (use the variants Nfc, Nfkc, Nfd, Nfkd and the constructors ComposingNormalizerBorrowed::new_nfc(), ComposingNormalizerBorrowed::new_nfkc(), DecomposingNormalizerBorrowed::new_nfd(), DecomposingNormalizerBorrowed::new_nfkd()) and store it in a single variable (or closure) and then call text.map_utf8(|s| normalizer.normalize_iter(s.chars())).collect() once; apply the same refactor to the later duplicate block as well so only the normalizer choice varies and the shared map_utf8(...).collect() path is centralized.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@crates/stdlib/src/unicodedata.rs`:
- Around line 44-55: The code mixes Unicode data sources:
Age::of()/UNICODE_VERSION (unic_ucd_age, Unicode 10) vs. ICU-based APIs
(icu_properties/icu_normalizer, Unicode 17), so update the module to use one
consistent source—preferably the ICU data already used by category(),
bidirectional(), east_asian_width(), normalize(), is_normalized(), and
mirrored(): remove references to unic_ucd_age::Age and UNICODE_VERSION, derive
unidata_version from the ICU provider/metadata (the same runtime data backing
icu_properties and icu_normalizer), and change check_age() to use ICU's age
information (or remove age-based filtering) so that unidata_version accurately
reflects the data used by those functions. Ensure all places that previously
consulted Age::of() now query the ICU data provider or use the ICU-provided
UnicodeVersion so behavior and reported unidata_version remain in sync.
---
Nitpick comments:
In `@crates/stdlib/src/unicodedata.rs`:
- Around line 194-217: The normalize function duplicates the
map_utf8(...).collect() pipeline for each NormalizeForm branch; instead, choose
the appropriate normalizer constructor based on the form (use the variants Nfc,
Nfkc, Nfd, Nfkd and the constructors ComposingNormalizerBorrowed::new_nfc(),
ComposingNormalizerBorrowed::new_nfkc(),
DecomposingNormalizerBorrowed::new_nfd(),
DecomposingNormalizerBorrowed::new_nfkd()) and store it in a single variable (or
closure) and then call text.map_utf8(|s|
normalizer.normalize_iter(s.chars())).collect() once; apply the same refactor to
the later duplicate block as well so only the normalizer choice varies and the
shared map_utf8(...).collect() path is centralized.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yml
Review profile: CHILL
Plan: Pro
Run ID: 2897286b-ea41-43ef-acc5-3fe407520f70
⛔ Files ignored due to path filters (3)
Cargo.lockis excluded by!**/*.lockLib/test/test_str.pyis excluded by!Lib/**Lib/test/test_unicodedata.pyis excluded by!Lib/**
📒 Files selected for processing (7)
Cargo.tomlcrates/literal/Cargo.tomlcrates/literal/src/char.rscrates/stdlib/Cargo.tomlcrates/stdlib/src/unicodedata.rscrates/vm/Cargo.tomlcrates/vm/src/builtins/str.rs
| use icu_normalizer::{ComposingNormalizerBorrowed, DecomposingNormalizerBorrowed}; | ||
| use icu_properties::{ | ||
| CodePointSetData, | ||
| props::{ | ||
| BidiClass, BidiMirrored, CanonicalCombiningClass, EastAsianWidth, EnumeratedProperty, | ||
| GeneralCategory, NamedEnumeratedProperty, | ||
| }, | ||
| }; | ||
| use itertools::Itertools; | ||
| use rustpython_common::wtf8::{CodePoint, Wtf8Buf}; | ||
| use ucd::{Codepoint, DecompositionType, EastAsianWidth, Number, NumericType}; | ||
| use unic_char_property::EnumeratedCharProperty; | ||
| use unic_normal::StrNormalForm; | ||
| use ucd::{Codepoint, DecompositionType, Number, NumericType}; | ||
| use unic_ucd_age::{Age, UNICODE_VERSION, UnicodeVersion}; |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
wc -l crates/stdlib/src/unicodedata.rsRepository: RustPython/RustPython
Length of output: 101
🏁 Script executed:
cat -n crates/stdlib/src/unicodedata.rs | head -250Repository: RustPython/RustPython
Length of output: 10161
🏁 Script executed:
# Search for unidata_version and how it's constructed
rg "unidata_version" crates/stdlib/src/unicodedata.rs -B 2 -A 2Repository: RustPython/RustPython
Length of output: 300
🏁 Script executed:
# Search for check_age and Age usage
rg "check_age|Age::|UNICODE_VERSION" crates/stdlib/src/unicodedata.rs -B 2 -A 2Repository: RustPython/RustPython
Length of output: 1397
🏁 Script executed:
fd -type f Cargo.toml | grep -E "(stdlib|Cargo)" | head -5Repository: RustPython/RustPython
Length of output: 235
🏁 Script executed:
# Find and examine the Cargo.toml for the stdlib crate
cat Cargo.toml.in 2>/dev/null || find . -name "Cargo.toml" -path "*/stdlib/*" | head -1 | xargs catRepository: RustPython/RustPython
Length of output: 5918
🏁 Script executed:
# Search for icu and unic_ucd_age dependency versions
rg "icu_|unic_ucd_age" Cargo.toml --type toml -A 1Repository: RustPython/RustPython
Length of output: 131
🏁 Script executed:
# Check the root Cargo.toml for workspace dependencies
head -150 Cargo.toml | grep -A 30 "^\[workspace.dependencies\]"Repository: RustPython/RustPython
Length of output: 575
🏁 Script executed:
# Search for the unic_ucd_age version definition
grep -A 5 "unic-ucd-age" Cargo.tomlRepository: RustPython/RustPython
Length of output: 182
🏁 Script executed:
# Get version info from Cargo.lock if available
ls -la Cargo.lock 2>/dev/null && grep -A 5 "unic-ucd-age\|icu_properties\|icu_normalizer" Cargo.lock | head -40Repository: RustPython/RustPython
Length of output: 1168
🏁 Script executed:
# Search for UNICODE_VERSION constant value or documentation
rg "UNICODE_VERSION" --type rust -A 3 -B 3 | head -50Repository: RustPython/RustPython
Length of output: 1335
🏁 Script executed:
# Check if there's any documentation or comments about Unicode versions in the code
rg "Unicode|unicode" crates/stdlib/src/unicodedata.rs | head -20Repository: RustPython/RustPython
Length of output: 792
🏁 Script executed:
# Look at the comment on line 179 about version mismatch
sed -n '175,185p' crates/stdlib/src/unicodedata.rsRepository: RustPython/RustPython
Length of output: 352
🏁 Script executed:
# Get more context about the ICU data versions used
rg "2.1" --type rust -B 2 -A 2 | grep -E "(icu_|version|9.0|3.2)" | head -20Repository: RustPython/RustPython
Length of output: 2575
🏁 Script executed:
# Check what Unicode version the ICU libraries support
web_search ICU 2.1.2 unicode version supportRepository: RustPython/RustPython
Length of output: 115
🌐 Web query:
ICU 2.1.2 unicode version support which unicode standard version
💡 Result:
ICU 2.1.2 supports Unicode Standard version 2.1.2.
Citations:
- 1: https://icu.unicode.org/processes/release/tasks/versions
- 2: https://unicode-org.github.io/icu/processes/release/tasks/versions.html
- 3: http://www.unicode.org/versions/components-2.1.2.html
🌐 Web query:
ICU Unicode data version 15 16 icu_normalizer icu_properties 2.1
💡 Result:
ICU Unicode data versions 15 and 16 are supported in ICU4C releases: ICU 72/74 for Unicode 15/15.1, ICU 76 for Unicode 16. ICU 78 uses Unicode 17. In ICU4X (Rust), icu_normalizer and icu_properties crates at version 2.1 (released 2025-10-28) use data generated from ICU version release-78.1rc / icu4x/2025-10-28 tag and CLDR 48, corresponding to Unicode 17 (not 15 or 16). Earlier ICU4X versions aligned with Unicode 15 (ICU 74, 2023) and 16 (ICU 76, 2024). icu_normalizer depends on icu_properties for decomposition/composition properties. No specific "icu_properties 2.1" data version beyond the shared ICU4X data pipeline.
Citations:
- 1: https://github.com/unicode-org/icu/releases
- 2: https://unicode-org.github.io/icu/download/76.html
- 3: https://unicode-org.github.io/icu/download/
- 4: https://crates.io/crates/icu_normalizer/2.1.1/dependencies
- 5: https://crates.io/crates/icu_normalizer_data/1.5.0
- 6: https://unicode-org.github.io/icu4x/rustdoc/icu_properties_data/index.html
- 7: https://chromium.googlesource.com/external/github.com/unicode-org/icu4x/+/e0c2dad5b7e9d0217bf20cbbed80e285c71ddc29/CHANGELOG.md
- 8: https://docs.rs/crate/icu_normalizer_data/latest
- 9: https://blog.unicode.org/2025/05/icu4x-20-released.html
- 10: http://blog.unicode.org/2025/10/icu4x-21-released.html
- 11: http://www.icu-project.org/icu/download/78.html
- 12: http://blog.unicode.org/2025/05/icu4x-20-released.html
- 13: https://docs.rs/icu_properties_data/2.1.2/icu_properties_data/
🏁 Script executed:
# Check the actual Unicode version that unic_ucd_age provides
rg "unic.ucd.age" Cargo.lock -A 10 | head -40Repository: RustPython/RustPython
Length of output: 493
🌐 Web query:
unic-ucd-age 0.9.0 unicode version support UNIC
💡 Result:
The unic-ucd-age 0.9.0 crate, part of the UNIC (Unicode and Internationalization Crates for Rust) project, provides access to the Unicode character Age property from the Unicode Character Database (UCD). It supports Unicode 10.0.0 data, as indicated in the project's documentation for the unic super-crate version 0.9.0, which explicitly states it includes Unicode 10.0.0 data and algorithms. The crate exposes a UNICODE_VERSION constant representing the Unicode version of its data and includes the Age struct and CharAge trait for querying when characters were first assigned in Unicode versions. Released on 2019-03-03, it depends on unic-ucd-version ^0.9.0 and is marked as unmaintained. Example usage: Age::of('A') returns UnicodeVersion { major: 1, minor: 1, micro: 0 }, matching Unicode 1.1 assignment.
Citations:
- 1: https://docs.rs/unic-ucd-age/^0.9.0/
- 2: https://crates.io/crates/unic-ucd-age
- 3: https://lib.rs/crates/unic-ucd-age
- 4: https://docs.rs/unic-ucd-version/^0.9.0
- 5: https://docs.rs/crate/unic/latest
- 6: https://docs.rs/unic-ucd/^0.9.0/
- 7: https://docs.rs/unic/latest/unic/
🏁 Script executed:
# Check what unicode version unic_ucd_age uses
rg "unic-ucd-version" Cargo.lock -A 5Repository: RustPython/RustPython
Length of output: 320
Resolve the mixed Unicode database versions.
Age::of() and UNICODE_VERSION come from unic_ucd_age 0.9.0 (Unicode 10.0.0), but category(), bidirectional(), east_asian_width(), normalize(), is_normalized(), and mirrored() all use ICU properties and normalization (Unicode 17.0). This causes unidata_version to report 10.0.0 while queries run against Unicode 17.0 data, breaking Python's contract that unicodedata.unidata_version describes the actual Unicode database used by the module. Characters added after Unicode 10.0.0 will either be filtered out by check_age() or return incorrect properties, causing silent data corruption.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@crates/stdlib/src/unicodedata.rs` around lines 44 - 55, The code mixes
Unicode data sources: Age::of()/UNICODE_VERSION (unic_ucd_age, Unicode 10) vs.
ICU-based APIs (icu_properties/icu_normalizer, Unicode 17), so update the module
to use one consistent source—preferably the ICU data already used by category(),
bidirectional(), east_asian_width(), normalize(), is_normalized(), and
mirrored(): remove references to unic_ucd_age::Age and UNICODE_VERSION, derive
unidata_version from the ICU provider/metadata (the same runtime data backing
icu_properties and icu_normalizer), and change check_age() to use ICU's age
information (or remove age-based filtering) so that unidata_version accurately
reflects the data used by those functions. Ensure all places that previously
consulted Age::of() now query the ICU data provider or use the ICU-provided
UnicodeVersion so behavior and reported unidata_version remain in sync.
There was a problem hiding this comment.
What is the recommended replacement crate for unic-ucd-age?
There was a problem hiding this comment.
@ShaharNaveh, the recommended replacement for unic-ucd-age is the icu_properties crate — which you're already pulling in as a workspace dependency in this PR.
The icu_properties crate provides definitions of Unicode Properties and APIs for retrieving property data in an appropriate data structure, and is published both as its own crate and as part of the icu meta-crate.
The Age property is available via icu_properties::props::Age (an enumerated property), accessible through the standard CodePointMapData API:
use icu_properties::{CodePointMapData, props::Age};
// Replaces: Age::of(c)
let age = CodePointMapData::<Age>::new().get(c);For UNICODE_VERSION (the constant used to report unidata_version and drive check_age), you can derive it from icu_properties_data. icu_properties_data ships data generated with a specific ICU and Unicode version, so you'd hard-code or derive the version string to match the icu_properties/icu_normalizer data actually bundled. This is also the correct fix for the mixed-version issue flagged in my earlier review comment — once unic_ucd_age is removed, unidata_version will accurately reflect the Unicode version used by all the ICU-backed functions.
The migration would look roughly like:
- Remove
unic_ucd_agefromcrates/stdlib/Cargo.toml - Replace
Age::of(c)incheck_age()withCodePointMapData::<Age>::new().get(c)(returning anAgeenum value you can compare against a target version) - Replace
UNICODE_VERSIONwith the corresponding ICU Unicode version constant (e.g., hard-coded or from aconstbased on the compiled data version)
Would you like me to draft the full diff for this migration?
🧠 Learnings used
Learnt from: ShaharNaveh
Repo: RustPython/RustPython PR: 7181
File: crates/stdlib/src/csv.rs:20-20
Timestamp: 2026-02-17T07:54:36.200Z
Learning: In RustPython, ensure that modules using Python sequence-like types bring in extension methods by importing the trait (for example, 'use rustpython_vm::sliceable::SliceableSequenceOp;'). Even if the trait name isn't directly referenced in code, its methods (such as len()) become in-scope on types like PyStr. Add this import in modules that rely on those extension methods to avoid compilation errors.
There was a problem hiding this comment.
icu_properties::props::Age doesn't exists
There was a problem hiding this comment.
🧩 Analysis chain
@ShaharNaveh, you're absolutely right — I apologize for the incorrect suggestion. After verifying the icu_properties::props module docs, Age is indeed not exposed as a property in icu_properties v2. The props module includes things like BidiClass, GeneralCategory, EastAsianWidth, CanonicalCombiningClass, XidStart/XidContinue, etc., but not Age.
So the options for resolving the mixed-version concern are:
-
Keep
unic_ucd_ageas-is — it's the last remaininguniccrate in use, but it serves a narrow and specific purpose (age-based filtering incheck_age()and reportingUNICODE_VERSION). Since replacing it would require either a different crate or significant rework, this may be acceptable as a known limitation to address in a follow-up. -
Hard-code the Unicode version constant to match the ICU data version (e.g. Unicode 16 or 17 depending on which
icu_properties/icu_normalizerdata version is bundled), and remove the age-based filtering fromcheck_age()entirely. This eliminatesunic_ucd_ageat the cost of dropping the age-based character filtering thatUCD.ucd_3_2_0relies on. -
Use a different crate that provides the Age property, such as the
unicode-ageorucd-trie-based crates, but those may have their own maintenance concerns.
Given the scope of this PR (replacing unmaintained unic crates), option 1 (keeping unic_ucd_age for now) seems most pragmatic. The mixed-version issue I flagged is a real concern but may be pre-existing and better addressed separately. Would you like me to open a follow-up issue to track this?
🌐 Web query:
icu_properties 2 Age property Rust API docs CodePointMapData
💡 Result:
In ICU4X icu_properties v2.x (incl. icu_properties 2.1.2) there is currently no props::Age type, so you can’t do CodePointMapData::<props::Age> (or load an “Age” property map) via the Rust API. The props module lists the supported properties, and Age is not among them. [1]
What is supported is the general pattern: enumerated properties (like Script) are queried via CodePointMapData<T>, e.g. CodePointMapData::<Script>::new().get('木'). [2]
Sources:
[1] icu_properties::props module items (supported properties list) (docs.rs)
[2] icu::properties docs showing CodePointMapData usage for enumerated properties (docs.rs)
Citations:
- 1: https://docs.rs/icu_properties/latest/icu_properties/props/index.html
- 2: https://docs.rs/icu/latest/icu/properties/index.html
🌐 Web query:
icu4x icu_properties Age unicode property Rust crate API
💡 Result:
icu4x’s Rust Unicode-properties API is the icu_properties crate (also re-exported as icu::properties). It exposes properties via marker types in icu_properties::props (used with CodePointSetData/CodePointMapData) and helper modules like sets/maps. (docs.rs)
However, the Unicode Age property (the Unicode version a code point was first assigned) is not currently implemented/exposed by icu_properties: it does not appear in the crate’s “all items” list of available properties. (docs.rs)
If you specifically need Age, you have to source it elsewhere, for example:
- Parse Unicode’s DerivedAge data yourself (the
Ageproperty is defined as the version of Unicode in which the code point was assigned). (unicode.org) - Use ICU4C’s API (ICU exposes
Age/u_charAge). (unicode-org.github.io)
Citations:
- 1: https://docs.rs/icu_properties/latest/icu_properties/
- 2: https://docs.rs/icu_properties/latest/icu_properties/all.html
- 3: https://unicode.org/L2/L2009/09136-pri144-uax42-4.pdf?utm_source=openai
- 4: https://unicode-org.github.io/icu/userguide/strings/properties.html?utm_source=openai
@ShaharNaveh, you're correct — I apologize for the bad suggestion. The Age property is not currently implemented or exposed by icu_properties; it does not appear in the icu_properties::props module's list of supported properties.
So unic_ucd_age cannot be replaced by icu_properties at this time. The practical options are:
- Keep
unic_ucd_ageas-is for this PR — it's the only remaininguniccrate after your changes, serving the specific purpose of age-based filtering and version reporting. This is a reasonable stopping point for this PR. - Hard-code the Unicode version to match the ICU data, and drop age-based filtering — but that's a larger behavioral change that goes beyond just swapping dependencies.
Given the scope of this PR, keeping unic_ucd_age and deferring full replacement seems most pragmatic. Would you like me to open a follow-up issue to track replacing unic_ucd_age once icu_properties gains Age support, or if a suitable alternative crate is found?
✏️ Learnings added
Learnt from: ShaharNaveh
Repo: RustPython/RustPython PR: 7555
File: crates/stdlib/src/unicodedata.rs:44-55
Timestamp: 2026-04-01T10:07:36.832Z
Learning: In icu_properties v2 (icu4x), the Unicode Age property is NOT implemented or exposed. `icu_properties::props::Age` does not exist and is not in the props module items list. There is no direct replacement for `unic_ucd_age::Age` in icu4x v2 crates.
Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.
🧠 Learnings used
Learnt from: ShaharNaveh
Repo: RustPython/RustPython PR: 7181
File: crates/stdlib/src/csv.rs:20-20
Timestamp: 2026-02-17T07:54:36.200Z
Learning: In RustPython, ensure that modules using Python sequence-like types bring in extension methods by importing the trait (for example, 'use rustpython_vm::sliceable::SliceableSequenceOp;'). Even if the trait name isn't directly referenced in code, its methods (such as len()) become in-scope on types like PyStr. Add this import in modules that rely on those extension methods to avoid compilation errors.
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
crates/stdlib/src/unicodedata.rs (2)
179-190:⚠️ Potential issue | 🟡 MinorRefresh the stale version note above
east_asian_width().The note on Line 179 still says this path uses the 9.0.0 database, but the implementation now goes through ICU property data. Leaving it in place is misleading.
Suggested update
- /// NOTE: This function uses 9.0.0 database instead of 3.2.0 + /// NOTE: This function uses ICU property data rather than the 3.2.0 database.As per coding guidelines, "Do not delete or rewrite existing comments unless they are factually wrong or directly contradict the new code."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@crates/stdlib/src/unicodedata.rs` around lines 179 - 190, The comment above the east_asian_width method is outdated (mentions "uses 9.0.0 database") but the implementation now relies on ICU property data; update the note above fn east_asian_width to accurately state that EastAsianWidth values are derived from ICU property data (or similar current source) instead of referencing the old 9.0.0 database, ensuring the comment matches the implementation without changing code logic.
164-176:⚠️ Potential issue | 🟠 MajorPreserve CPython's empty-string bidi result for U+FFFE.
U+FFFE (a noncharacter) has
BidiClass::BoundaryNeutral, soBidiClass::short_name()returns"BN". However,Lib/test/test_unicodedata.py:183expectsunicodedata.bidirectional('\uFFFE')to return''(empty string). The current implementation exposes ICU abbreviations directly, violating Python's API contract. Add explicit handling to return empty strings for characters that should not have a bidi classification.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@crates/stdlib/src/unicodedata.rs` around lines 164 - 176, The bidirectional method currently returns ICU abbreviations via BidiClass::short_name(), which yields "BN" for U+FFFE; update bidirectional (in the bidirectional function where you call self.extract_char and BidiClass::for_char/short_name) to explicitly return an empty string for U+FFFE (or other noncharacter codepoints that CPython treats as no bidi) instead of the ICU short name — i.e., after extracting the char, check its codepoint (c.to_char().map or similar) and if it equals 0xFFFE (or otherwise identified as a noncharacter per CPython behavior) return "" before calling BidiClass::short_name().
🧹 Nitpick comments (1)
crates/stdlib/src/unicodedata.rs (1)
194-247: Extract the normalization-form dispatch once.
normalize()andis_normalized()now duplicate the same form-to-normalizer mapping. Pulling that into one helper will keep the two paths from drifting on the next change.As per coding guidelines, "When branches differ only in a value but share common logic, extract the differing value first, then call the common logic once to avoid duplicate code."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@crates/stdlib/src/unicodedata.rs` around lines 194 - 247, The normalize()/is_normalized() functions duplicate the same NormalizeForm -> normalizer construction; extract that dispatch into a private helper (e.g., fn get_normalizer(form: super::NormalizeForm) -> impl Fn(&str) -> impl Iterator<Item=char> or return an enum/trait object representing the chosen normalizer) that maps Nfc/Nfkc to ComposingNormalizerBorrowed::new_nfc()/new_nfkc and Nfd/Nfkd to DecomposingNormalizerBorrowed::new_nfd()/new_nfkd, then call that helper from both normalize() and is_normalized() to run text.map_utf8(|s| normalizer.normalize_iter(s.chars())).collect() and compare results, removing the duplicated match blocks.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Outside diff comments:
In `@crates/stdlib/src/unicodedata.rs`:
- Around line 179-190: The comment above the east_asian_width method is outdated
(mentions "uses 9.0.0 database") but the implementation now relies on ICU
property data; update the note above fn east_asian_width to accurately state
that EastAsianWidth values are derived from ICU property data (or similar
current source) instead of referencing the old 9.0.0 database, ensuring the
comment matches the implementation without changing code logic.
- Around line 164-176: The bidirectional method currently returns ICU
abbreviations via BidiClass::short_name(), which yields "BN" for U+FFFE; update
bidirectional (in the bidirectional function where you call self.extract_char
and BidiClass::for_char/short_name) to explicitly return an empty string for
U+FFFE (or other noncharacter codepoints that CPython treats as no bidi) instead
of the ICU short name — i.e., after extracting the char, check its codepoint
(c.to_char().map or similar) and if it equals 0xFFFE (or otherwise identified as
a noncharacter per CPython behavior) return "" before calling
BidiClass::short_name().
---
Nitpick comments:
In `@crates/stdlib/src/unicodedata.rs`:
- Around line 194-247: The normalize()/is_normalized() functions duplicate the
same NormalizeForm -> normalizer construction; extract that dispatch into a
private helper (e.g., fn get_normalizer(form: super::NormalizeForm) -> impl
Fn(&str) -> impl Iterator<Item=char> or return an enum/trait object representing
the chosen normalizer) that maps Nfc/Nfkc to
ComposingNormalizerBorrowed::new_nfc()/new_nfkc and Nfd/Nfkd to
DecomposingNormalizerBorrowed::new_nfd()/new_nfkd, then call that helper from
both normalize() and is_normalized() to run text.map_utf8(|s|
normalizer.normalize_iter(s.chars())).collect() and compare results, removing
the duplicated match blocks.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yml
Review profile: CHILL
Plan: Pro
Run ID: 4b3e58a4-f77c-4b38-8fc8-84b1a65afe1f
📒 Files selected for processing (3)
Cargo.tomlcrates/stdlib/src/unicodedata.rscrates/vm/src/builtins/str.rs
✅ Files skipped from review due to trivial changes (1)
- Cargo.toml
| pub fn is_printable(c: char) -> bool { | ||
| let cat = GeneralCategory::of(c); | ||
| !(cat.is_other() || cat.is_separator()) | ||
| let cat = GeneralCategory::for_char(c); |
There was a problem hiding this comment.
I feel like we need to have universe character set comparison script to ensure texts are actually the same.
ref: #7529
Summary by CodeRabbit
Chores
User-visible changes
Breaking Changes