Implement locale-aware 'n' format specifier for int, float, complex#7350
Implement locale-aware 'n' format specifier for int, float, complex#7350youknowone merged 3 commits intoRustPython:mainfrom
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds locale-aware "n" formatting: new public Changes
Sequence Diagram(s)sequenceDiagram
participant Caller
participant Builtin as Builtin.__format__
participant Spec as FormatSpec.parse / has_locale_format
participant Locale as get_locale_info()
participant Formatter as FormatSpec::format_*_locale
Caller->>Builtin: __format__(value, spec)
Builtin->>Spec: parse(spec)
alt parse error
Spec-->>Builtin: Err
Builtin-->>Caller: raise PyException
else parsed
Spec->>Spec: has_locale_format()?
alt locale requested
Builtin->>Locale: get_locale_info()
Locale-->>Builtin: LocaleInfo
Builtin->>Formatter: format_*_locale(value, LocaleInfo)
Formatter-->>Builtin: formatted string / Err
Builtin-->>Caller: return or raise PyException
else default path
Builtin->>Spec: format_* (non-locale)
Spec-->>Builtin: formatted string / Err
Builtin-->>Caller: return or raise PyException
end
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
Code has been automatically formatted The code in this PR has been formatted using:
git pull origin locale-n |
📦 Library DependenciesThe following Lib/ modules were modified. Here are their dependencies: [x] test: cpython/Lib/test/test_format.py (TODO: 6) dependencies: dependent tests: (no tests depend on format) [x] lib: cpython/Lib/types.py dependencies:
dependent tests: (52 tests)
Legend:
|
|
You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard. |
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@crates/common/src/format.rs`:
- Around line 640-690: The magnitude components are being formatted with
sign/width/alignment from the original spec (via magnitude_format.format_float),
causing duplicated signs and inner padding; also the real-part suppression
currently hides a zero real part even when an explicit complex format type was
requested. Fix by building magnitude_format from self but explicitly clear sign,
width, align, fill, zero_pad and alternate (so magnitude_format has no
sign/width/alignment) before calling magnitude_format.format_float /
Self::apply_locale_formatting, and change the re suppression condition to only
skip the real part when num.re == 0.0 && !num.re.is_sign_negative() AND the
original FormatSpec did not have an explicit format type (e.g., check
self.format_type.is_none() or the equivalent flag used to detect an explicit
type) so typed complex formatting still shows the real component; use the
existing symbols magnitude_format, format_float, apply_locale_formatting, and
the re_str logic to implement these changes.
- Around line 526-536: apply_locale_formatting currently passes the entire
pre-decimal chunk (magnitude_str split by '.') to insert_locale_grouping which
corrupts exponent or non-digit suffixes like "1e+06"; update
apply_locale_formatting to isolate the contiguous digit sequence (allowing an
optional leading sign) from the integer part, call Self::insert_locale_grouping
only on that digit sequence, then rebuild the integer part by concatenating the
grouped digits with any trailing non-digit suffix (e.g., exponent markers), and
finally reattach the decimal point/frac part using locale.decimal_point; refer
to apply_locale_formatting, Self::insert_locale_grouping, magnitude_str, and
locale.decimal_point to locate where to implement this change.
In `@crates/vm/src/format.rs`:
- Around line 49-67: parse_grouping currently always appends a trailing 0
(meaning "repeat last") even when the C sequence terminated with CHAR_MAX
(meaning "stop grouping entirely"). Modify parse_grouping to distinguish the
terminator: iterate over grouping (pointer grouping) until you reach 0 or
libc::c_char::MAX, keep the bytes you read in result, then if the terminator was
a NUL (0) append a trailing 0 to indicate "repeat last", but if the terminator
was CHAR_MAX do not append anything (stop grouping). Update the doc comment to
reflect that CHAR_MAX means "stop grouping" and NUL means "repeat last". Use the
existing symbols parse_grouping, grouping, result, ptr, libc::c_char::MAX and
the checked terminator to implement this behavior.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yml
Review profile: CHILL
Plan: Pro
Run ID: b91f907e-b731-492a-a941-15d354c29a80
⛔ Files ignored due to path filters (1)
Lib/test/test_format.pyis excluded by!Lib/**
📒 Files selected for processing (5)
crates/common/src/format.rscrates/vm/src/builtins/complex.rscrates/vm/src/builtins/float.rscrates/vm/src/builtins/int.rscrates/vm/src/format.rs
crates/common/src/format.rs
Outdated
| let magnitude_format = FormatSpec { | ||
| format_type: Some(format_type), | ||
| precision: Some(if precision == 0 { 1 } else { precision }), | ||
| ..*self | ||
| }; | ||
|
|
||
| let re_str = if num.re == 0.0 && !num.re.is_sign_negative() { | ||
| None | ||
| } else { | ||
| let re_sign = if num.re.is_sign_negative() && !num.re.is_nan() { | ||
| "-" | ||
| } else { | ||
| "" | ||
| }; | ||
| let re_mag = magnitude_format.format_float(num.re.abs())?; | ||
| let re_grouped = Self::apply_locale_formatting(re_mag, locale); | ||
| Some(format!("{re_sign}{re_grouped}")) | ||
| }; | ||
|
|
||
| let im_sign = if num.im.is_sign_negative() && !num.im.is_nan() { | ||
| "-" | ||
| } else { | ||
| "" | ||
| }; | ||
| let im_mag = magnitude_format.format_float(num.im.abs())?; | ||
| let im_grouped = Self::apply_locale_formatting(im_mag, locale); | ||
|
|
||
| let has_re = re_str.is_some(); | ||
| let formatted = if let Some(re_str) = re_str { | ||
| let sep = if num.im >= 0.0 || num.im.is_nan() { | ||
| "+" | ||
| } else { | ||
| "" | ||
| }; | ||
| format!("({re_str}{sep}{im_sign}{im_grouped}j)") | ||
| } else { | ||
| format!("{im_sign}{im_grouped}j") | ||
| }; | ||
|
|
||
| let format_sign = self.sign.unwrap_or(FormatSign::Minus); | ||
| let sign_str = match format_sign { | ||
| FormatSign::Plus => "+", | ||
| FormatSign::Minus => "", | ||
| FormatSign::MinusOrSpace => " ", | ||
| }; | ||
|
|
||
| self.format_sign_and_align( | ||
| &AsciiStr::new(&formatted), | ||
| if has_re { "" } else { sign_str }, | ||
| FormatAlign::Right, | ||
| ) |
There was a problem hiding this comment.
format_complex_locale is formatting component magnitudes with sign/width logic.
Using format_float here injects per-component sign/alignment/width, then signs are added again in this function. That can yield invalid strings (-+..., ++...) and inner padding artifacts.
Also, the real-part suppression condition drops 0 real part even with an explicit type, which diverges from the existing typed complex formatting flow.
💡 Suggested fix
- let re_str = if num.re == 0.0 && !num.re.is_sign_negative() {
+ let re_str = if num.re == 0.0 && !num.re.is_sign_negative() && self.format_type.is_none() {
None
} else {
let re_sign = if num.re.is_sign_negative() && !num.re.is_nan() {
"-"
} else {
""
};
- let re_mag = magnitude_format.format_float(num.re.abs())?;
+ let re_mag = magnitude_format.format_complex_float(num.re.abs())?;
let re_grouped = Self::apply_locale_formatting(re_mag, locale);
Some(format!("{re_sign}{re_grouped}"))
};
@@
- let im_mag = magnitude_format.format_float(num.im.abs())?;
+ let im_mag = magnitude_format.format_complex_float(num.im.abs())?;🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@crates/common/src/format.rs` around lines 640 - 690, The magnitude components
are being formatted with sign/width/alignment from the original spec (via
magnitude_format.format_float), causing duplicated signs and inner padding; also
the real-part suppression currently hides a zero real part even when an explicit
complex format type was requested. Fix by building magnitude_format from self
but explicitly clear sign, width, align, fill, zero_pad and alternate (so
magnitude_format has no sign/width/alignment) before calling
magnitude_format.format_float / Self::apply_locale_formatting, and change the re
suppression condition to only skip the real part when num.re == 0.0 &&
!num.re.is_sign_negative() AND the original FormatSpec did not have an explicit
format type (e.g., check self.format_type.is_none() or the equivalent flag used
to detect an explicit type) so typed complex formatting still shows the real
component; use the existing symbols magnitude_format, format_float,
apply_locale_formatting, and the re_str logic to implement these changes.
| /// Parse C `lconv.grouping` into a `Vec<u8>`. | ||
| /// Reads bytes until 0 or CHAR_MAX, then appends 0 (meaning "repeat last group"). | ||
| #[cfg(unix)] | ||
| unsafe fn parse_grouping(grouping: *const libc::c_char) -> Vec<u8> { | ||
| let mut result = Vec::new(); | ||
| if grouping.is_null() { | ||
| return result; | ||
| } | ||
| unsafe { | ||
| let mut ptr = grouping; | ||
| while ![0, libc::c_char::MAX].contains(&*ptr) { | ||
| result.push(*ptr as u8); | ||
| ptr = ptr.add(1); | ||
| } | ||
| } | ||
| if !result.is_empty() { | ||
| result.push(0); | ||
| } | ||
| result |
There was a problem hiding this comment.
🧩 Analysis chain
🌐 Web query:
C POSIX localeconv grouping lconv.grouping CHAR_MAX 0 terminator semantics
💡 Result:
In struct lconv (from localeconv()), grouping/mon_grouping is a char * that points to a sequence of byte values (often treated like an unsigned-byte array) describing how to group digits to the left of the radix point:
grouping[0]= size of the rightmost group;grouping[1]= next group to the left; etc. [1]- If an element is
0, it means repeat the previous group size for all remaining more-significant digits. Because0is also'\0', this is typically where the sequence ends (e.g."\3\0"means “group by 3 forever”). [1] - If an element is
CHAR_MAX, it means stop grouping: no further separators are inserted for any more-significant digits. [1] - If
groupingis the empty string (""), that means no grouping information / no grouping (implementation/locale dependent, but common). [2]
Notes:
- Some platforms use
-1(in achar) instead ofCHAR_MAXdue tocharsignedness; robust code often compares after casting tounsigned char. [3]
Sources: Linux locale(7) description of grouping semantics (including 0 and CHAR_MAX) [1]; NetBSD localeconv(3) notes on CHAR_MAX as “unavailable” in some fields/locales [2]; Gnulib portability note about -1 vs CHAR_MAX on some platforms [3].
[1] (man7.org) [2] (man.netbsd.org) [3] (gnu.org)
Citations:
- 1: https://man7.org/linux/man-pages/man7/locale.7.html?utm_source=openai
- 2: https://man.netbsd.org/localeconv.3?utm_source=openai
- 3: https://www.gnu.org/software/gnulib/manual/html_node/localeconv.html?utm_source=openai
🏁 Script executed:
# First, let's check if there are any clippy or format issues with the current implementation
cd crates/vm/src && head -70 format.rs | tail -25Repository: RustPython/RustPython
Length of output: 674
🏁 Script executed:
cd crates/vm && grep -n "parse_grouping\|get_locale_info" src/format.rs | head -20Repository: RustPython/RustPython
Length of output: 296
🏁 Script executed:
# Verify the suggested fix more carefully by checking the actual current state
cd crates/vm/src && cat -n format.rs | sed -n '49,68p'Repository: RustPython/RustPython
Length of output: 761
parse_grouping conflates CHAR_MAX with repeat-last behavior, violating C/POSIX semantics.
When the C grouping sequence ends with CHAR_MAX, it means "stop grouping entirely". The current implementation appends 0 unconditionally, which incorrectly translates this to "repeat previous group forever". Additionally, the doc comment is misleading.
Suggested fix
unsafe fn parse_grouping(grouping: *const libc::c_char) -> Vec<u8> {
let mut result = Vec::new();
if grouping.is_null() {
return result;
}
- unsafe {
- let mut ptr = grouping;
- while ![0, libc::c_char::MAX].contains(&*ptr) {
- result.push(*ptr as u8);
- ptr = ptr.add(1);
- }
- }
- if !result.is_empty() {
+ let mut ptr = grouping;
+ let mut repeat_last = false;
+ while !ptr.is_null() {
+ let v = *ptr;
+ if v == 0 {
+ repeat_last = true;
+ break;
+ }
+ if v == libc::c_char::MAX {
+ repeat_last = false;
+ break;
+ }
+ result.push(v as u8);
+ ptr = ptr.add(1);
+ }
+ if repeat_last && !result.is_empty() {
result.push(0);
}
result
}🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@crates/vm/src/format.rs` around lines 49 - 67, parse_grouping currently
always appends a trailing 0 (meaning "repeat last") even when the C sequence
terminated with CHAR_MAX (meaning "stop grouping entirely"). Modify
parse_grouping to distinguish the terminator: iterate over grouping (pointer
grouping) until you reach 0 or libc::c_char::MAX, keep the bytes you read in
result, then if the terminator was a NUL (0) append a trailing 0 to indicate
"repeat last", but if the terminator was CHAR_MAX do not append anything (stop
grouping). Update the doc comment to reflect that CHAR_MAX means "stop grouping"
and NUL means "repeat last". Use the existing symbols parse_grouping, grouping,
result, ptr, libc::c_char::MAX and the checked terminator to implement this
behavior.
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (1)
crates/common/src/format.rs (1)
525-536:⚠️ Potential issue | 🟠 Major
apply_locale_formattingstill corrupts exponent notation for numbers without decimal points.The previous review flagged this issue and a fix was suggested, but the current code doesn't include that fix. When
format_generalproduces output like"1e+06"(no decimal point),splitn(2, '.')yields the entire string asint_part, andinsert_locale_groupingwill corrupt it by inserting separators into non-digit characters.💡 Suggested fix: handle exponent notation
fn apply_locale_formatting(magnitude_str: String, locale: &LocaleInfo) -> String { + // Separate exponent part before processing + let (mantissa, exponent) = match magnitude_str.find(|c| c == 'e' || c == 'E') { + Some(pos) => (&magnitude_str[..pos], Some(&magnitude_str[pos..])), + None => (magnitude_str.as_str(), None), + }; + - let mut parts = magnitude_str.splitn(2, '.'); + let mut parts = mantissa.splitn(2, '.'); let int_part = parts.next().unwrap(); let grouped = Self::insert_locale_grouping(int_part, locale); - if let Some(frac_part) = parts.next() { + let result = if let Some(frac_part) = parts.next() { format!("{grouped}{}{frac_part}", locale.decimal_point) } else { grouped + }; + + match exponent { + Some(exp) => format!("{result}{exp}"), + None => result, } }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@crates/common/src/format.rs` around lines 525 - 536, apply_locale_formatting currently corrupts exponent notation like "1e+06" because it applies insert_locale_grouping to the whole int_part; fix by detecting an exponent ('e' or 'E') in the original magnitude_str (or int_part) and split off the exponent suffix before grouping, then only call insert_locale_grouping on the mantissa integer portion (preserving any leading sign), finally reassemble as "{grouped_mantissa}{optional_decimal_point}{frac_part}{exponent_suffix}" so exponents remain untouched; update apply_locale_formatting to locate 'e'/'E', separate exponent, preserve sign, call insert_locale_grouping on the numeric digits, and then concatenate the exponent back.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@crates/common/src/format.rs`:
- Line 564: The padding is computed using AsciiStr.char_len() (byte length),
which miscounts multi-byte locale separators in magnitude_str from
apply_locale_formatting(); update format_sign_and_align and its caller site so
magnitude_str is passed/treated as a Unicode &str (not AsciiStr) and compute
visual length with a Unicode-aware count (e.g., s.chars().count() or
UnicodeSegmentation::graphemes(true).count()) when calculating width/fill;
replace references to AsciiStr::new(&magnitude_str) and char_len() in
format_sign_and_align with the Unicode-aware length and adjust any
types/signatures that assumed AsciiStr accordingly.
---
Duplicate comments:
In `@crates/common/src/format.rs`:
- Around line 525-536: apply_locale_formatting currently corrupts exponent
notation like "1e+06" because it applies insert_locale_grouping to the whole
int_part; fix by detecting an exponent ('e' or 'E') in the original
magnitude_str (or int_part) and split off the exponent suffix before grouping,
then only call insert_locale_grouping on the mantissa integer portion
(preserving any leading sign), finally reassemble as
"{grouped_mantissa}{optional_decimal_point}{frac_part}{exponent_suffix}" so
exponents remain untouched; update apply_locale_formatting to locate 'e'/'E',
separate exponent, preserve sign, call insert_locale_grouping on the numeric
digits, and then concatenate the exponent back.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yml
Review profile: CHILL
Plan: Pro
Run ID: 38611da9-2a7f-42db-803e-f50a9b839612
⛔ Files ignored due to path filters (1)
Lib/test/test_types.pyis excluded by!Lib/**
📒 Files selected for processing (1)
crates/common/src/format.rs
| }, | ||
| }; | ||
|
|
||
| self.format_sign_and_align(&AsciiStr::new(&magnitude_str), sign_str, FormatAlign::Right) |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# First, let's look at the AsciiStr implementation
rg -n "impl.*AsciiStr" crates/common/src/format.rs -A 30 | head -100Repository: RustPython/RustPython
Length of output: 1208
🏁 Script executed:
# Check the specific lines mentioned in the comment
sed -n '560,570p' crates/common/src/format.rsRepository: RustPython/RustPython
Length of output: 362
🏁 Script executed:
# Also check lines 604 and 656
sed -n '600,610p' crates/common/src/format.rs
sed -n '652,662p' crates/common/src/format.rsRepository: RustPython/RustPython
Length of output: 774
🏁 Script executed:
# Look for the char_len method definition
rg -n "fn char_len" crates/common/src/format.rs -A 5Repository: RustPython/RustPython
Length of output: 305
🏁 Script executed:
# Check how locale separators are inserted into the magnitude_str
rg -n "thousands_sep\|decimal_point" crates/common/src/format.rs -B 2 -A 2Repository: RustPython/RustPython
Length of output: 47
🏁 Script executed:
# Look for locale definition and separator types
rg -n "struct.*Locale\|decimal_point\|thousands_sep" crates/common/src/ -A 3Repository: RustPython/RustPython
Length of output: 47
🏁 Script executed:
# Search for Locale struct and locale-related code
find crates -name "*.rs" | xargs grep -l "locale\|Locale" | head -5Repository: RustPython/RustPython
Length of output: 204
🏁 Script executed:
# Search more broadly for where locale is used in format.rs
rg -n "locale\." crates/common/src/format.rs -B 2 -A 2Repository: RustPython/RustPython
Length of output: 1144
🏁 Script executed:
# Look for how magnitude_str is constructed
rg -n "magnitude_str.*format\|let.*magnitude_str" crates/common/src/format.rs -B 2 -A 2 | head -50Repository: RustPython/RustPython
Length of output: 47
🏁 Script executed:
# Find LocaleInfo definition
rg -n "struct LocaleInfo\|^pub struct.*Locale" crates/common/src/format.rs -A 10Repository: RustPython/RustPython
Length of output: 47
🏁 Script executed:
# Check the complete insert_locale_grouping function
sed -n '482,535p' crates/common/src/format.rsRepository: RustPython/RustPython
Length of output: 1970
🏁 Script executed:
# Look at the format_float_locale function to see how it constructs magnitude_str
sed -n '567,600p' crates/common/src/format.rsRepository: RustPython/RustPython
Length of output: 1303
🏁 Script executed:
# Check the locale module to understand the locale structure
head -100 crates/stdlib/src/locale.rsRepository: RustPython/RustPython
Length of output: 3356
🏁 Script executed:
# Find format_sign_and_align implementation
rg -n "fn format_sign_and_align" crates/common/src/format.rs -A 20Repository: RustPython/RustPython
Length of output: 851
🏁 Script executed:
# Check how char_len is used in format_sign_and_align
sed -n '900,1000p' crates/common/src/format.rs | grep -n "char_len\|width"Repository: RustPython/RustPython
Length of output: 184
🏁 Script executed:
# Find the complete format_sign_and_align function
rg -n "fn format_sign_and_align" crates/common/src/format.rs -A 30Repository: RustPython/RustPython
Length of output: 1253
🏁 Script executed:
# Let's look for where LocaleInfo is defined
rg -n "struct LocaleInfo\|pub.*LocaleInfo" crates/common/src/ -A 5Repository: RustPython/RustPython
Length of output: 47
🏁 Script executed:
# Check if there are other implementations of CharLen
rg -n "impl CharLen" crates/common/src/format.rsRepository: RustPython/RustPython
Length of output: 101
🏁 Script executed:
# Let's look for a Unicode-aware version or other CharLen implementations
rg -n "impl.*CharLen" crates/ -A 3Repository: RustPython/RustPython
Length of output: 1007
🏁 Script executed:
# Check if locales can actually have non-ASCII separators
rg -n "thousands_sep\|decimal_point" crates/stdlib/src/locale.rs -B 2 -A 2 | head -30Repository: RustPython/RustPython
Length of output: 47
🏁 Script executed:
# Let me check the full context - look for where apply_locale_formatting is called and how its result is used
sed -n '567,610p' crates/common/src/format.rsRepository: RustPython/RustPython
Length of output: 1569
🏁 Script executed:
# Let's verify that the magnitude_str can contain non-ASCII characters
# Check what types are used for thousands_sep and decimal_point
rg -n "pub struct LocaleInfo\|thousands_sep:\|decimal_point:" crates/common/src/ -A 2Repository: RustPython/RustPython
Length of output: 47
🏁 Script executed:
# Let's trace through what happens - check if the string from apply_locale_formatting
# is used directly or if there's any normalization
sed -n '527,545p' crates/common/src/format.rsRepository: RustPython/RustPython
Length of output: 687
🏁 Script executed:
# Check how sign_str.len() is used - does it account for multi-byte characters?
sed -n '985,990p' crates/common/src/format.rsRepository: RustPython/RustPython
Length of output: 331
🏁 Script executed:
# Find the LocaleInfo struct definition
rg -n "^pub struct LocaleInfo\|^struct LocaleInfo" crates/common/src/format.rs -A 10Repository: RustPython/RustPython
Length of output: 47
🏁 Script executed:
# Check where LocaleInfo is created/initialized
rg -n "LocaleInfo {" crates/common/src/format.rs -B 2 -A 5Repository: RustPython/RustPython
Length of output: 152
🏁 Script executed:
# Let's trace how locale.thousands_sep is set - look at the locale module more carefully
sed -n '1,150p' crates/stdlib/src/locale.rs | grep -A 5 -B 5 "thousands_sep"Repository: RustPython/RustPython
Length of output: 559
🏁 Script executed:
# Search for LocaleInfo definition differently
rg "LocaleInfo" crates/common/src/format.rs | head -5Repository: RustPython/RustPython
Length of output: 312
🏁 Script executed:
# Get context around the first LocaleInfo reference
rg -n "LocaleInfo" crates/common/src/format.rs | head -1Repository: RustPython/RustPython
Length of output: 91
🏁 Script executed:
# Look for where LocaleInfo is defined - broader search
rg -n "struct LocaleInfo" crates/ -A 10Repository: RustPython/RustPython
Length of output: 772
🏁 Script executed:
# Now let's verify: check what sign_str can be and if it has the same byte-length issue
sed -n '590,605p' crates/common/src/format.rsRepository: RustPython/RustPython
Length of output: 611
🏁 Script executed:
# Check the similar code in format_int_locale and format_complex_locale for the same issue
sed -n '620,650p' crates/common/src/format.rsRepository: RustPython/RustPython
Length of output: 1283
🏁 Script executed:
# Let's also verify if there are other CharLen implementations that correctly handle Unicode
sed -n '1520,1535p' crates/vm/src/builtins/str.rsRepository: RustPython/RustPython
Length of output: 429
🏁 Script executed:
# Final verification: check if French locale actually uses non-breaking space
python3 << 'EOF'
import locale
# Check what the French locale uses for thousands separator
# Note: this is for documentation purposes - we know from Unicode that:
# - \u{00A0} is non-breaking space (NBSP) - 2 bytes in UTF-8
# - French locale typically uses NBSP as thousands separator
# - This is confirmed by CPython behavior
print("UTF-8 byte length of non-breaking space: ", len("\u00A0".encode('utf-8')))
print("Character count of '1\u00A0234': ", len("1\u00A02 34"))
print("Byte length of '1\u00A0234': ", len("1\u00A0234".encode('utf-8')))
EOFRepository: RustPython/RustPython
Length of output: 166
🏁 Script executed:
# More direct verification
python3 << 'EOF'
nbsp_string = "1\u00A0234" # 1, NBSP, 2, 3, 4
print(f"String: {repr(nbsp_string)}")
print(f"Character count: {len(nbsp_string)}")
print(f"Byte count: {len(nbsp_string.encode('utf-8'))}")
print(f"Difference: {len(nbsp_string.encode('utf-8')) - len(nbsp_string)} bytes extra")
EOFRepository: RustPython/RustPython
Length of output: 142
AsciiStr returns byte length instead of character count, causing incorrect width padding with non-ASCII locale separators.
In format_sign_and_align() (line 988), the width calculation uses magnitude_str.char_len() which AsciiStr implements to return self.inner.len() (byte length). However, magnitude_str from apply_locale_formatting() can contain multi-byte UTF-8 characters from locale separators—for example, French locales use non-breaking space (\u{00A0}, 2 bytes in UTF-8) as the thousands separator. This causes the fill character calculation to be off by 1+ bytes when formatting with width specs like "{:15n}".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@crates/common/src/format.rs` at line 564, The padding is computed using
AsciiStr.char_len() (byte length), which miscounts multi-byte locale separators
in magnitude_str from apply_locale_formatting(); update format_sign_and_align
and its caller site so magnitude_str is passed/treated as a Unicode &str (not
AsciiStr) and compute visual length with a Unicode-aware count (e.g.,
s.chars().count() or UnicodeSegmentation::graphemes(true).count()) when
calculating width/fill; replace references to AsciiStr::new(&magnitude_str) and
char_len() in format_sign_and_align with the Unicode-aware length and adjust any
types/signatures that assumed AsciiStr accordingly.
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (3)
crates/common/src/format.rs (2)
564-564:⚠️ Potential issue | 🟡 MinorLocale width/alignment uses byte length instead of character count.
At Line 564, Line 604, and Line 656, locale-formatted strings are wrapped as
AsciiStr.AsciiStr::char_len()(Line 1040) uses byte length, so width padding is wrong for multibyte separators (e.g., NBSP thousands separator).💡 Proposed direction
- self.format_sign_and_align(&AsciiStr::new(&magnitude_str), sign_str, FormatAlign::Right) + self.format_sign_and_align(&UnicodeStr::new(&magnitude_str), sign_str, FormatAlign::Right)// Add alongside AsciiStr struct UnicodeStr<'a> { inner: &'a str, } impl<'a> UnicodeStr<'a> { const fn new(inner: &'a str) -> Self { Self { inner } } } impl CharLen for UnicodeStr<'_> { fn char_len(&self) -> usize { self.inner.chars().count() } } impl Deref for UnicodeStr<'_> { type Target = str; fn deref(&self) -> &Self::Target { self.inner } }#!/bin/bash # Verify current locale-format callsites and CharLen behavior. rg -n "AsciiStr::new\\(&magnitude_str\\)" crates/common/src/format.rs sed -n '1038,1042p' crates/common/src/format.rsAlso applies to: 604-604, 656-656
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@crates/common/src/format.rs` at line 564, Locale-formatted strings are being wrapped with AsciiStr (via AsciiStr::new(&magnitude_str)) but AsciiStr::char_len() returns byte length, breaking width/alignment for multibyte separators; introduce a UnicodeStr<'a> wrapper (with new(inner: &'a str)), implement CharLen for it using self.inner.chars().count(), and Deref<Target=str> so it behaves like &str, then replace the locale-related AsciiStr::new(...) callsites (the calls into format_sign_and_align / format_* that pass &AsciiStr::new(&magnitude_str)) with &UnicodeStr::new(&magnitude_str) and add any necessary use/imports.
525-536:⚠️ Potential issue | 🟠 MajorGroup only mantissa digits before locale insertion.
At Line 527, the whole pre-
.chunk is grouped. Scientific notation without a decimal point (e.g.,1e+06) gets corrupted by separator insertion. This also affects callers at Line 591 and Line 650.💡 Proposed fix
fn apply_locale_formatting(magnitude_str: String, locale: &LocaleInfo) -> String { - let mut parts = magnitude_str.splitn(2, '.'); - let int_part = parts.next().unwrap(); - let grouped = Self::insert_locale_grouping(int_part, locale); - - if let Some(frac_part) = parts.next() { - format!("{grouped}{}{frac_part}", locale.decimal_point) - } else { - grouped - } + let exp_pos = magnitude_str.find(|c| c == 'e' || c == 'E'); + let (mantissa, exponent) = exp_pos + .map(|i| (&magnitude_str[..i], Some(&magnitude_str[i..]))) + .unwrap_or((magnitude_str.as_str(), None)); + + let (int_part, frac_part) = mantissa + .split_once('.') + .map_or((mantissa, None), |(i, f)| (i, Some(f))); + + let grouped = if int_part.chars().all(|c| c.is_ascii_digit()) { + Self::insert_locale_grouping(int_part, locale) + } else { + int_part.to_owned() + }; + + let mut out = if let Some(frac_part) = frac_part { + format!("{grouped}{}{frac_part}", locale.decimal_point) + } else { + grouped + }; + if let Some(exponent) = exponent { + out.push_str(exponent); + } + out }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@crates/common/src/format.rs` around lines 525 - 536, The current apply_locale_formatting function applies grouping to the entire pre-decimal chunk, corrupting scientific exponents like "1e+06"; update apply_locale_formatting to detect and preserve any exponent suffix (look for 'e' or 'E' plus the rest) by splitting magnitude_str into mantissa, optional fractional part, and optional exponent, apply Self::insert_locale_grouping only to the mantissa's integer digits, then recombine mantissa (with locale.decimal_point and frac if present) and append the untouched exponent; ensure magnitude_str, insert_locale_grouping, and locale.decimal_point are used accordingly so exponents remain unmodified (this change will also fix the callers that rely on apply_locale_formatting).crates/vm/src/format.rs (1)
49-67:⚠️ Potential issue | 🟠 MajorPreserve
CHAR_MAXsemantics inparse_grouping.At Line 50 and Line 64, the implementation always appends
0after parsing non-empty grouping data. If the C sequence ends withCHAR_MAX, this incorrectly changes behavior from “stop grouping” to “repeat last group forever.”💡 Proposed fix
/// Parse C `lconv.grouping` into a `Vec<u8>`. -/// Reads bytes until 0 or CHAR_MAX, then appends 0 (meaning "repeat last group"). +/// Reads bytes until 0 or CHAR_MAX. +/// - 0 means "repeat last group size" +/// - CHAR_MAX means "stop grouping" #[cfg(unix)] unsafe fn parse_grouping(grouping: *const libc::c_char) -> Vec<u8> { let mut result = Vec::new(); if grouping.is_null() { return result; } - unsafe { - let mut ptr = grouping; - while ![0, libc::c_char::MAX].contains(&*ptr) { - result.push(*ptr as u8); - ptr = ptr.add(1); - } + let mut ptr = grouping; + let mut repeat_last = false; + loop { + let v = *ptr; + if v == 0 { + repeat_last = true; + break; + } + if v == libc::c_char::MAX { + break; + } + result.push(v as u8); + ptr = ptr.add(1); } - if !result.is_empty() { + if repeat_last && !result.is_empty() { result.push(0); } result }In POSIX/C `localeconv()`, for `lconv.grouping`, what are the semantics of terminators `0` and `CHAR_MAX`, and should `CHAR_MAX` stop further grouping rather than repeat the previous size?🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@crates/vm/src/format.rs` around lines 49 - 67, The parse_grouping function currently always appends a 0 terminator after collecting bytes, which incorrectly changes semantics when the original C sequence ended with CHAR_MAX (meaning "stop grouping"); modify parse_grouping to detect which terminator stopped the loop (0 vs CHAR_MAX) by inspecting the terminating byte at the stop condition and only append a 0 to result if the terminator was 0; reference the unsafe fn parse_grouping(grouping: *const libc::c_char), the loop that reads *ptr, and libc::c_char::MAX/0 to implement this conditional behavior.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@crates/common/src/format.rs`:
- Around line 607-657: format_complex_locale skips the same complex-format flag
validation performed in format_complex, allowing invalid '=' alignment and
zero-padding to pass; fix by enforcing the same checks before formatting: in
format_complex_locale (inside impl for FormatSpec) replicate or call the same
validation logic used by format_complex to return
Err(FormatSpecError::AlignmentFlag) when the alignment is '='
(FormatAlign::AfterSign) and Err(FormatSpecError::ZeroPadding) when
zero-fill/zero-padding rules are violated (e.g., fill == '0' with a sign-aware
width), or better yet extract that logic into a small helper (e.g.,
validate_complex_format_flags(&self) -> Result<(), FormatSpecError>) and invoke
it at the start of format_complex_locale so the later calls to
format_complex_re_im and format_sign_and_align behave consistently.
---
Duplicate comments:
In `@crates/common/src/format.rs`:
- Line 564: Locale-formatted strings are being wrapped with AsciiStr (via
AsciiStr::new(&magnitude_str)) but AsciiStr::char_len() returns byte length,
breaking width/alignment for multibyte separators; introduce a UnicodeStr<'a>
wrapper (with new(inner: &'a str)), implement CharLen for it using
self.inner.chars().count(), and Deref<Target=str> so it behaves like &str, then
replace the locale-related AsciiStr::new(...) callsites (the calls into
format_sign_and_align / format_* that pass &AsciiStr::new(&magnitude_str)) with
&UnicodeStr::new(&magnitude_str) and add any necessary use/imports.
- Around line 525-536: The current apply_locale_formatting function applies
grouping to the entire pre-decimal chunk, corrupting scientific exponents like
"1e+06"; update apply_locale_formatting to detect and preserve any exponent
suffix (look for 'e' or 'E' plus the rest) by splitting magnitude_str into
mantissa, optional fractional part, and optional exponent, apply
Self::insert_locale_grouping only to the mantissa's integer digits, then
recombine mantissa (with locale.decimal_point and frac if present) and append
the untouched exponent; ensure magnitude_str, insert_locale_grouping, and
locale.decimal_point are used accordingly so exponents remain unmodified (this
change will also fix the callers that rely on apply_locale_formatting).
In `@crates/vm/src/format.rs`:
- Around line 49-67: The parse_grouping function currently always appends a 0
terminator after collecting bytes, which incorrectly changes semantics when the
original C sequence ended with CHAR_MAX (meaning "stop grouping"); modify
parse_grouping to detect which terminator stopped the loop (0 vs CHAR_MAX) by
inspecting the terminating byte at the stop condition and only append a 0 to
result if the terminator was 0; reference the unsafe fn parse_grouping(grouping:
*const libc::c_char), the loop that reads *ptr, and libc::c_char::MAX/0 to
implement this conditional behavior.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yml
Review profile: CHILL
Plan: Pro
Run ID: 70a4d2f2-7c14-41b2-aa12-3176ce3a29d5
⛔ Files ignored due to path filters (2)
Lib/test/test_format.pyis excluded by!Lib/**Lib/test/test_types.pyis excluded by!Lib/**
📒 Files selected for processing (5)
crates/common/src/format.rscrates/vm/src/builtins/complex.rscrates/vm/src/builtins/float.rscrates/vm/src/builtins/int.rscrates/vm/src/format.rs
| /// Format a complex number with locale-aware 'n' format. | ||
| pub fn format_complex_locale( | ||
| &self, | ||
| num: &Complex64, | ||
| locale: &LocaleInfo, | ||
| ) -> Result<String, FormatSpecError> { | ||
| // Reuse format_complex_re_im with 'g' type to get the base formatted parts, | ||
| // then apply locale grouping. This matches CPython's format_complex_internal: | ||
| // 'n' → 'g', add_parens=0, skip_re=0. | ||
| let locale_spec = FormatSpec { | ||
| format_type: Some(FormatType::GeneralFormat(Case::Lower)), | ||
| ..*self | ||
| }; | ||
| let (formatted_re, formatted_im) = locale_spec.format_complex_re_im(num)?; | ||
|
|
||
| // Apply locale grouping to both parts | ||
| let grouped_re = if formatted_re.is_empty() { | ||
| formatted_re | ||
| } else { | ||
| // Split sign from magnitude, apply grouping, recombine | ||
| let (sign, mag) = if formatted_re.starts_with('-') | ||
| || formatted_re.starts_with('+') | ||
| || formatted_re.starts_with(' ') | ||
| { | ||
| formatted_re.split_at(1) | ||
| } else { | ||
| ("", formatted_re.as_str()) | ||
| }; | ||
| format!( | ||
| "{sign}{}", | ||
| Self::apply_locale_formatting(mag.to_string(), locale) | ||
| ) | ||
| }; | ||
|
|
||
| // formatted_im is like "+1234j" or "-1234j" or "1234j" | ||
| // Split sign, magnitude, and 'j' suffix | ||
| let im_str = &formatted_im; | ||
| let (im_sign, im_rest) = if im_str.starts_with('+') || im_str.starts_with('-') { | ||
| im_str.split_at(1) | ||
| } else { | ||
| ("", im_str.as_str()) | ||
| }; | ||
| let im_mag = im_rest.strip_suffix('j').unwrap_or(im_rest); | ||
| let im_grouped = Self::apply_locale_formatting(im_mag.to_string(), locale); | ||
| let grouped_im = format!("{im_sign}{im_grouped}j"); | ||
|
|
||
| // No parentheses for 'n' format (CPython: add_parens=0) | ||
| let magnitude_str = format!("{grouped_re}{grouped_im}"); | ||
|
|
||
| self.format_sign_and_align(&AsciiStr::new(&magnitude_str), "", FormatAlign::Right) | ||
| } |
There was a problem hiding this comment.
format_complex_locale bypasses complex-format flag validation.
At Line 656, the locale path returns formatted output without enforcing the same checks as format_complex (Line 864-Line 870). = alignment and zero-padding should raise AlignmentFlag and ZeroPadding consistently.
💡 Proposed fix
// No parentheses for 'n' format (CPython: add_parens=0)
let magnitude_str = format!("{grouped_re}{grouped_im}");
-
- self.format_sign_and_align(&AsciiStr::new(&magnitude_str), "", FormatAlign::Right)
+ if let Some(FormatAlign::AfterSign) = &self.align {
+ return Err(FormatSpecError::AlignmentFlag);
+ }
+ match &self.fill.unwrap_or(' '.into()).to_char() {
+ Some('0') => Err(FormatSpecError::ZeroPadding),
+ _ => self.format_sign_and_align(&AsciiStr::new(&magnitude_str), "", FormatAlign::Right),
+ }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@crates/common/src/format.rs` around lines 607 - 657, format_complex_locale
skips the same complex-format flag validation performed in format_complex,
allowing invalid '=' alignment and zero-padding to pass; fix by enforcing the
same checks before formatting: in format_complex_locale (inside impl for
FormatSpec) replicate or call the same validation logic used by format_complex
to return Err(FormatSpecError::AlignmentFlag) when the alignment is '='
(FormatAlign::AfterSign) and Err(FormatSpecError::ZeroPadding) when
zero-fill/zero-padding rules are violated (e.g., fill == '0' with a sign-aware
width), or better yet extract that logic into a small helper (e.g.,
validate_complex_format_flags(&self) -> Result<(), FormatSpecError>) and invoke
it at the start of format_complex_locale so the later calls to
format_complex_re_im and format_sign_and_align behave consistently.
There was a problem hiding this comment.
♻️ Duplicate comments (4)
crates/common/src/format.rs (3)
525-536:⚠️ Potential issue | 🟠 MajorDon’t run grouping across exponent text in
apply_locale_formatting.At Line 527-Line 533, grouping is applied to everything before
.; exponent forms like1e+06can be altered incorrectly.Suggested fix
fn apply_locale_formatting(magnitude_str: String, locale: &LocaleInfo) -> String { - let mut parts = magnitude_str.splitn(2, '.'); - let int_part = parts.next().unwrap(); - let grouped = Self::insert_locale_grouping(int_part, locale); - - if let Some(frac_part) = parts.next() { - format!("{grouped}{}{frac_part}", locale.decimal_point) - } else { - grouped - } + let exp_pos = magnitude_str.find(|c| c == 'e' || c == 'E'); + let (mantissa, exponent) = if let Some(pos) = exp_pos { + (&magnitude_str[..pos], Some(&magnitude_str[pos..])) + } else { + (magnitude_str.as_str(), None) + }; + let (int_part, frac_part) = mantissa + .split_once('.') + .map_or((mantissa, None), |(i, f)| (i, Some(f))); + + let grouped = if int_part.chars().all(|c| c.is_ascii_digit()) { + Self::insert_locale_grouping(int_part, locale) + } else { + int_part.to_owned() + }; + + let mut out = if let Some(frac_part) = frac_part { + format!("{grouped}{}{frac_part}", locale.decimal_point) + } else { + grouped + }; + if let Some(exponent) = exponent { + out.push_str(exponent); + } + out }For Python format specifier 'n', should locale grouping/decimal substitution affect exponent parts (e.g., '1e+06') or only the mantissa?🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@crates/common/src/format.rs` around lines 525 - 536, The current apply_locale_formatting applies grouping to the whole pre-decimal text and can corrupt exponent text like "1e+06"; update apply_locale_formatting to first split magnitude_str into mantissa and exponent by searching for 'e' or 'E' (preserve the original exponent sign and digits), then perform the existing splitn('.') and call insert_locale_grouping only on the mantissa's integer part and replace the decimal point only in the mantissa, finally reattach the exponent unchanged; reference apply_locale_formatting and insert_locale_grouping so the mantissa/exponent split and reassembly are done before grouping/decimal replacement.
607-657:⚠️ Potential issue | 🟠 MajorLocale complex path should enforce the same invalid-flag checks as
format_complex.
format_complex_localecurrently skips the=-alignment and zero-padding validation present informat_complex, so invalid specs can slip through.Suggested fix
pub fn format_complex_locale( &self, num: &Complex64, locale: &LocaleInfo, ) -> Result<String, FormatSpecError> { + if let Some(FormatAlign::AfterSign) = &self.align { + return Err(FormatSpecError::AlignmentFlag); + } + if matches!(self.fill.unwrap_or(' '.into()).to_char(), Some('0')) { + return Err(FormatSpecError::ZeroPadding); + } + // Reuse format_complex_re_im with 'g' type to get the base formatted parts, // then apply locale grouping. This matches CPython's format_complex_internal: // 'n' → 'g', add_parens=0, skip_re=0.#!/bin/bash # Compare validation behavior in locale/non-locale complex formatting rg -n "fn format_complex_locale|fn format_complex|AlignmentFlag|ZeroPadding" crates/common/src/format.rs -A20 -B10🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@crates/common/src/format.rs` around lines 607 - 657, format_complex_locale skips the same validation performed by format_complex for illegal '=' alignment and zero-padding flags, allowing invalid format specs; update format_complex_locale to run the identical validation path as format_complex (either call the same validation helper or inline the same checks) before proceeding—specifically validate AlignmentFlag '=' and zero-padding combinations on the FormatSpec used (see FormatSpec, format_complex_re_im, format_complex, format_sign_and_align) and return the same FormatSpecError when flags are invalid so locale handling cannot bypass the checks.
564-565:⚠️ Potential issue | 🟡 MinorLocale width alignment is byte-counted, so multibyte separators can mispad.
These locale paths pass
AsciiStr, whosechar_len()is byte length; locale separators can be multibyte UTF-8, so width/fill becomes inaccurate.In Rust, does str::len() return byte length or Unicode scalar count, and what is the recommended way to compute character count for width-sensitive formatting?Also applies to: 604-605, 656-657
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@crates/common/src/format.rs` around lines 564 - 565, The code calls format_sign_and_align(&AsciiStr::new(&magnitude_str), sign_str, FormatAlign::Right) and relies on AsciiStr/char_len() (which yields byte length) for width calculations, causing mispadding when locale separators are multibyte UTF‑8; change the logic to use a Unicode-aware display width (e.g., use unicode_width::UnicodeWidthStr::width on &str or compute grapheme clusters if needed) instead of byte length, update the call sites that construct AsciiStr (including other occurrences around format_sign_and_align) to pass a &str or a precomputed display width, and modify format_sign_and_align to accept and use the unicode-width result (or accept a width parameter) so padding/fill is based on displayed character width rather than bytes.crates/vm/src/format.rs (1)
49-68:⚠️ Potential issue | 🟠 Major
parse_groupingcurrently conflates0andCHAR_MAXterminators.At Line 64-Line 66,
0is appended for any non-empty sequence, which incorrectly turnsCHAR_MAX(“stop grouping”) into “repeat previous group”.Suggested fix
unsafe fn parse_grouping(grouping: *const libc::c_char) -> Vec<u8> { let mut result = Vec::new(); if grouping.is_null() { return result; } - unsafe { - let mut ptr = grouping; - while ![0, libc::c_char::MAX].contains(&*ptr) { - result.push(*ptr as u8); - ptr = ptr.add(1); - } - } - if !result.is_empty() { + let mut ptr = grouping; + let mut repeat_last = false; + loop { + let v = unsafe { *ptr }; + if v == 0 { + repeat_last = true; + break; + } + if v == libc::c_char::MAX { + break; + } + result.push(v as u8); + ptr = unsafe { ptr.add(1) }; + } + if repeat_last && !result.is_empty() { result.push(0); } result }In POSIX/C localeconv().grouping semantics, what is the difference between a 0 byte and CHAR_MAX terminator, and should they be handled differently when parsing grouping rules?🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@crates/vm/src/format.rs` around lines 49 - 68, The parse_grouping function conflates CHAR_MAX and NUL terminators by always appending a 0 byte for non-empty results; change it so the code inspects the terminator byte at the stopping pointer and only push a trailing 0 (the "repeat last group" marker) when the terminator was NUL (0), not when it was libc::c_char::MAX; update unsafe fn parse_grouping to stop on either terminator as before but after the loop check *ptr (or the last read byte) and conditionally push 0 only if that terminator == 0, leaving CHAR_MAX-terminated sequences without the appended 0.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@crates/common/src/format.rs`:
- Around line 525-536: The current apply_locale_formatting applies grouping to
the whole pre-decimal text and can corrupt exponent text like "1e+06"; update
apply_locale_formatting to first split magnitude_str into mantissa and exponent
by searching for 'e' or 'E' (preserve the original exponent sign and digits),
then perform the existing splitn('.') and call insert_locale_grouping only on
the mantissa's integer part and replace the decimal point only in the mantissa,
finally reattach the exponent unchanged; reference apply_locale_formatting and
insert_locale_grouping so the mantissa/exponent split and reassembly are done
before grouping/decimal replacement.
- Around line 607-657: format_complex_locale skips the same validation performed
by format_complex for illegal '=' alignment and zero-padding flags, allowing
invalid format specs; update format_complex_locale to run the identical
validation path as format_complex (either call the same validation helper or
inline the same checks) before proceeding—specifically validate AlignmentFlag
'=' and zero-padding combinations on the FormatSpec used (see FormatSpec,
format_complex_re_im, format_complex, format_sign_and_align) and return the same
FormatSpecError when flags are invalid so locale handling cannot bypass the
checks.
- Around line 564-565: The code calls
format_sign_and_align(&AsciiStr::new(&magnitude_str), sign_str,
FormatAlign::Right) and relies on AsciiStr/char_len() (which yields byte length)
for width calculations, causing mispadding when locale separators are multibyte
UTF‑8; change the logic to use a Unicode-aware display width (e.g., use
unicode_width::UnicodeWidthStr::width on &str or compute grapheme clusters if
needed) instead of byte length, update the call sites that construct AsciiStr
(including other occurrences around format_sign_and_align) to pass a &str or a
precomputed display width, and modify format_sign_and_align to accept and use
the unicode-width result (or accept a width parameter) so padding/fill is based
on displayed character width rather than bytes.
In `@crates/vm/src/format.rs`:
- Around line 49-68: The parse_grouping function conflates CHAR_MAX and NUL
terminators by always appending a 0 byte for non-empty results; change it so the
code inspects the terminator byte at the stopping pointer and only push a
trailing 0 (the "repeat last group" marker) when the terminator was NUL (0), not
when it was libc::c_char::MAX; update unsafe fn parse_grouping to stop on either
terminator as before but after the loop check *ptr (or the last read byte) and
conditionally push 0 only if that terminator == 0, leaving CHAR_MAX-terminated
sequences without the appended 0.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yml
Review profile: CHILL
Plan: Pro
Run ID: 025562ee-f1c0-478c-8018-769534ac57c0
⛔ Files ignored due to path filters (2)
Lib/test/test_format.pyis excluded by!Lib/**Lib/test/test_types.pyis excluded by!Lib/**
📒 Files selected for processing (5)
crates/common/src/format.rscrates/vm/src/builtins/complex.rscrates/vm/src/builtins/float.rscrates/vm/src/builtins/int.rscrates/vm/src/format.rs
🚧 Files skipped from review as they are similar to previous changes (2)
- crates/vm/src/builtins/int.rs
- crates/vm/src/builtins/complex.rs
Add LocaleInfo struct and locale-aware formatting methods to FormatSpec. The 'n' format type now reads thousands_sep, decimal_point, and grouping from C localeconv() and applies proper locale-based number grouping. Remove @unittest.skip from test_format.test_locale.
Rewrite format_complex_locale to reuse format_complex_re_im, matching formatter_unicode.c: add_parens=0 and skip_re=0 for 'n' type. Remove @expectedfailure from test_float__format__locale and test_int__format__locale in test_types.py.
There was a problem hiding this comment.
♻️ Duplicate comments (4)
crates/common/src/format.rs (3)
564-564:⚠️ Potential issue | 🟡 MinorWidth/padding can be wrong for multibyte locale separators.
Lines 564, 604, and 653 pass locale-formatted strings through
AsciiStr;char_len()is byte-based, so width calculation drifts whenthousands_sep/decimal_pointare multibyte UTF-8.💡 Proposed fix
- self.format_sign_and_align(&AsciiStr::new(&magnitude_str), sign_str, FormatAlign::Right) + self.format_sign_and_align(magnitude_str.as_str(), sign_str, FormatAlign::Right)And update
format_sign_and_alignto compute Unicode character count for&strinput:- fn format_sign_and_align<T>(&self, magnitude_str: &T, sign_str: &str, default_align: FormatAlign) -> Result<String, FormatSpecError> - where - T: CharLen + Deref<Target = str>, + fn format_sign_and_align( + &self, + magnitude_str: &str, + sign_str: &str, + default_align: FormatAlign, + ) -> Result<String, FormatSpecError> { let align = self.align.unwrap_or(default_align); - let num_chars = magnitude_str.char_len(); + let num_chars = magnitude_str.chars().count();Also applies to: 604-604, 653-653
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@crates/common/src/format.rs` at line 564, The code incorrectly wraps locale-formatted strings with AsciiStr and uses byte-based char_len(), causing wrong width/padding when thousands_sep/decimal_point are multibyte; update the call sites (the three places passing AsciiStr instances) to pass plain &str (the formatted magnitude strings and any sign_str inputs) into format_sign_and_align, and change format_sign_and_align to accept &str instead of AsciiStr and compute widths using Unicode character count (e.g., s.chars().count()) when calculating padding/alignment so multibyte separators are measured correctly.
525-536:⚠️ Potential issue | 🟠 Major
apply_locale_formattingcan group non-digit mantissas (e.g., scientific notation).Line 529 groups the full pre-
.chunk. Values like1e+06,inf, ornancan be transformed incorrectly (for example,1e,+06).💡 Proposed fix
fn apply_locale_formatting(magnitude_str: String, locale: &LocaleInfo) -> String { - let mut parts = magnitude_str.splitn(2, '.'); - let int_part = parts.next().unwrap(); - let grouped = Self::insert_locale_grouping(int_part, locale); - - if let Some(frac_part) = parts.next() { - format!("{grouped}{}{frac_part}", locale.decimal_point) - } else { - grouped - } + let exp_pos = magnitude_str.find(|c| c == 'e' || c == 'E'); + let (mantissa, exponent) = if let Some(pos) = exp_pos { + (&magnitude_str[..pos], Some(&magnitude_str[pos..])) + } else { + (magnitude_str.as_str(), None) + }; + let (int_part, frac_part) = mantissa + .split_once('.') + .map_or((mantissa, None), |(i, f)| (i, Some(f))); + + let grouped = if int_part.chars().all(|c| c.is_ascii_digit()) { + Self::insert_locale_grouping(int_part, locale) + } else { + int_part.to_owned() + }; + + let mut out = if let Some(frac_part) = frac_part { + format!("{grouped}{}{frac_part}", locale.decimal_point) + } else { + grouped + }; + if let Some(exponent) = exponent { + out.push_str(exponent); + } + out }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@crates/common/src/format.rs` around lines 525 - 536, apply_locale_formatting currently passes the entire pre-decimal chunk to insert_locale_grouping which corrupts non-digit mantissas like "1e+06", "inf", or "nan"; fix by isolating an optional leading sign and the contiguous digit sequence in int_part, call insert_locale_grouping only on that digit sequence (or on "0" when no digits), then reassemble sign + grouped_digits + trailing_suffix (the remaining non-digit chars) before attaching the fractional part and decimal_point; update apply_locale_formatting to parse int_part into (sign, digit_run, suffix) and use insert_locale_grouping on digit_run so symbols like apply_locale_formatting, insert_locale_grouping, magnitude_str, and LocaleInfo are the only points changed.
650-654:⚠️ Potential issue | 🟠 MajorLocale complex formatting skips complex-specific flag validation.
Lines 653-654 directly format output, but this path does not enforce the same
AlignmentFlag/ZeroPaddingchecks used informat_complex(Lines 861-867). Invalid specs can pass silently here.💡 Proposed fix
// No parentheses for 'n' format (CPython: add_parens=0) let magnitude_str = format!("{grouped_re}{grouped_im}"); - - self.format_sign_and_align(&AsciiStr::new(&magnitude_str), "", FormatAlign::Right) + if let Some(FormatAlign::AfterSign) = &self.align { + return Err(FormatSpecError::AlignmentFlag); + } + match &self.fill.unwrap_or(' '.into()).to_char() { + Some('0') => Err(FormatSpecError::ZeroPadding), + _ => self.format_sign_and_align(&AsciiStr::new(&magnitude_str), "", FormatAlign::Right), + } }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@crates/common/src/format.rs` around lines 650 - 654, The locale-specific branch builds magnitude_str and calls self.format_sign_and_align without validating complex-number-specific flags, allowing invalid AlignmentFlag/ZeroPadding specs to pass; add the same validation used in format_complex (the checks around AlignmentFlag and ZeroPadding in the format_complex path) before formatting here or refactor to call the shared validator, then only proceed to call self.format_sign_and_align(&AsciiStr::new(&magnitude_str), "", FormatAlign::Right) if the validation passes so invalid specs are rejected consistently.crates/vm/src/format.rs (1)
49-67:⚠️ Potential issue | 🟠 Major
parse_groupingcurrently treatsCHAR_MAXlike repeat-last grouping.At Lines 64-66, a trailing
0is appended for any non-empty sequence. That makesCHAR_MAX-terminated patterns behave as “repeat last group forever” instead of “stop further grouping”.💡 Proposed fix
/// Parse C `lconv.grouping` into a `Vec<u8>`. -/// Reads bytes until 0 or CHAR_MAX, then appends 0 (meaning "repeat last group"). +/// Reads bytes until 0 or CHAR_MAX. +/// - 0 means "repeat last group" +/// - CHAR_MAX means "stop further grouping" #[cfg(unix)] unsafe fn parse_grouping(grouping: *const libc::c_char) -> Vec<u8> { let mut result = Vec::new(); if grouping.is_null() { return result; } - unsafe { - let mut ptr = grouping; - while ![0, libc::c_char::MAX].contains(&*ptr) { - result.push(*ptr as u8); - ptr = ptr.add(1); - } - } - if !result.is_empty() { + let mut ptr = grouping; + let mut repeat_last = false; + loop { + let value = unsafe { *ptr }; + if value == 0 { + repeat_last = true; + break; + } + if value == libc::c_char::MAX { + break; + } + result.push(value as u8); + ptr = unsafe { ptr.add(1) }; + } + if repeat_last && !result.is_empty() { result.push(0); } result }In POSIX/C localeconv() semantics for lconv.grouping, what do terminators 0 and CHAR_MAX mean, and should they be handled differently when converting grouping rules?🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@crates/vm/src/format.rs` around lines 49 - 67, parse_grouping currently always appends a trailing 0 (meaning "repeat last group") even when the original C terminator was CHAR_MAX (meaning "stop grouping"); modify parse_grouping to inspect the terminator byte after the loop and only append 0 if the terminator was actually 0, and do not append anything when the terminator was libc::c_char::MAX. Locate the unsafe fn parse_grouping(grouping: *const libc::c_char) -> Vec<u8>, keep the loop that advances ptr while *ptr is not 0 or libc::c_char::MAX, then check the byte at ptr: if *ptr == 0 { if !result.is_empty() { result.push(0); } } else /* CHAR_MAX */ { /* leave result as-is */ } and return result.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@crates/common/src/format.rs`:
- Line 564: The code incorrectly wraps locale-formatted strings with AsciiStr
and uses byte-based char_len(), causing wrong width/padding when
thousands_sep/decimal_point are multibyte; update the call sites (the three
places passing AsciiStr instances) to pass plain &str (the formatted magnitude
strings and any sign_str inputs) into format_sign_and_align, and change
format_sign_and_align to accept &str instead of AsciiStr and compute widths
using Unicode character count (e.g., s.chars().count()) when calculating
padding/alignment so multibyte separators are measured correctly.
- Around line 525-536: apply_locale_formatting currently passes the entire
pre-decimal chunk to insert_locale_grouping which corrupts non-digit mantissas
like "1e+06", "inf", or "nan"; fix by isolating an optional leading sign and the
contiguous digit sequence in int_part, call insert_locale_grouping only on that
digit sequence (or on "0" when no digits), then reassemble sign + grouped_digits
+ trailing_suffix (the remaining non-digit chars) before attaching the
fractional part and decimal_point; update apply_locale_formatting to parse
int_part into (sign, digit_run, suffix) and use insert_locale_grouping on
digit_run so symbols like apply_locale_formatting, insert_locale_grouping,
magnitude_str, and LocaleInfo are the only points changed.
- Around line 650-654: The locale-specific branch builds magnitude_str and calls
self.format_sign_and_align without validating complex-number-specific flags,
allowing invalid AlignmentFlag/ZeroPadding specs to pass; add the same
validation used in format_complex (the checks around AlignmentFlag and
ZeroPadding in the format_complex path) before formatting here or refactor to
call the shared validator, then only proceed to call
self.format_sign_and_align(&AsciiStr::new(&magnitude_str), "",
FormatAlign::Right) if the validation passes so invalid specs are rejected
consistently.
In `@crates/vm/src/format.rs`:
- Around line 49-67: parse_grouping currently always appends a trailing 0
(meaning "repeat last group") even when the original C terminator was CHAR_MAX
(meaning "stop grouping"); modify parse_grouping to inspect the terminator byte
after the loop and only append 0 if the terminator was actually 0, and do not
append anything when the terminator was libc::c_char::MAX. Locate the unsafe fn
parse_grouping(grouping: *const libc::c_char) -> Vec<u8>, keep the loop that
advances ptr while *ptr is not 0 or libc::c_char::MAX, then check the byte at
ptr: if *ptr == 0 { if !result.is_empty() { result.push(0); } } else /* CHAR_MAX
*/ { /* leave result as-is */ } and return result.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yml
Review profile: CHILL
Plan: Pro
Run ID: 87132815-7bf6-459b-a039-b60fd04f64a4
⛔ Files ignored due to path filters (2)
Lib/test/test_format.pyis excluded by!Lib/**Lib/test/test_types.pyis excluded by!Lib/**
📒 Files selected for processing (5)
crates/common/src/format.rscrates/vm/src/builtins/complex.rscrates/vm/src/builtins/float.rscrates/vm/src/builtins/int.rscrates/vm/src/format.rs
…ustPython#7350) * Implement locale-aware 'n' format specifier for int, float, complex Add LocaleInfo struct and locale-aware formatting methods to FormatSpec. The 'n' format type now reads thousands_sep, decimal_point, and grouping from C localeconv() and applies proper locale-based number grouping. Remove @unittest.skip from test_format.test_locale. * Fix complex 'n' format and remove locale expectedFailure markers Rewrite format_complex_locale to reuse format_complex_re_im, matching formatter_unicode.c: add_parens=0 and skip_re=0 for 'n' type. Remove @expectedfailure from test_float__format__locale and test_int__format__locale in test_types.py. * Auto-format: cargo fmt --all --------- Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Add LocaleInfo struct and locale-aware formatting methods to FormatSpec. The 'n' format type now reads thousands_sep, decimal_point, and grouping from C localeconv() and applies proper locale-based number grouping. Remove @unittest.skip from test_format.test_locale.
Summary by CodeRabbit