Skip to content

Implement locale-aware 'n' format specifier for int, float, complex#7350

Merged
youknowone merged 3 commits intoRustPython:mainfrom
youknowone:locale-n
Mar 5, 2026
Merged

Implement locale-aware 'n' format specifier for int, float, complex#7350
youknowone merged 3 commits intoRustPython:mainfrom
youknowone:locale-n

Conversation

@youknowone
Copy link
Copy Markdown
Member

@youknowone youknowone commented Mar 4, 2026

Add LocaleInfo struct and locale-aware formatting methods to FormatSpec. The 'n' format type now reads thousands_sep, decimal_point, and grouping from C localeconv() and applies proper locale-based number grouping. Remove @unittest.skip from test_format.test_locale.

Summary by CodeRabbit

  • New Features
    • Locale-aware number formatting for integers, floats, and complex numbers, respecting thousands separator, decimal point, and grouping with sensible platform fallbacks.
  • Bug Fixes
    • Format-spec parse errors are now raised as proper Python exceptions, improving clarity and consistency of formatting error reporting.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Mar 4, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds locale-aware "n" formatting: new public LocaleInfo, locale-capable FormatSpec methods and helpers, platform-specific get_locale_info() (Unix via localeconv()), and updates builtins int/float/complex __format__ to map parse errors to Python exceptions and call locale-aware formatters when requested.

Changes

Cohort / File(s) Summary
Core Locale Formatting
crates/common/src/format.rs
Add public LocaleInfo and locale-aware FormatSpec APIs: has_locale_format, format_int_locale, format_float_locale, format_complex_locale; add internal helpers insert_locale_grouping and apply_locale_formatting.
Locale Info Retrieval
crates/vm/src/format.rs
Add get_locale_info() (Unix: uses libc::localeconv() and parse_grouping() to build LocaleInfo; non-Unix: returns defaults); add parse_grouping() (unsafe, Unix-only).
Builtin Formatting Call Sites
crates/vm/src/builtins/int.rs, crates/vm/src/builtins/float.rs, crates/vm/src/builtins/complex.rs
__format__ implementations now convert FormatSpec::parse errors into Python exceptions and branch to locale-aware formatting: if has_locale_format() use get_locale_info()format_*_locale, otherwise use existing non-locale format_*. Error-to-exception mapping adjusted.

Sequence Diagram(s)

sequenceDiagram
  participant Caller
  participant Builtin as Builtin.__format__
  participant Spec as FormatSpec.parse / has_locale_format
  participant Locale as get_locale_info()
  participant Formatter as FormatSpec::format_*_locale

  Caller->>Builtin: __format__(value, spec)
  Builtin->>Spec: parse(spec)
  alt parse error
    Spec-->>Builtin: Err
    Builtin-->>Caller: raise PyException
  else parsed
    Spec->>Spec: has_locale_format()?
    alt locale requested
      Builtin->>Locale: get_locale_info()
      Locale-->>Builtin: LocaleInfo
      Builtin->>Formatter: format_*_locale(value, LocaleInfo)
      Formatter-->>Builtin: formatted string / Err
      Builtin-->>Caller: return or raise PyException
    else default path
      Builtin->>Spec: format_* (non-locale)
      Spec-->>Builtin: formatted string / Err
      Builtin-->>Caller: return or raise PyException
    end
  end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

Poem

🐇
I hop through digits, dot and comma bright,
I tuck in groups by moonlit locale light,
Unix whispers separators soft and sweet,
Floats, ints, complex line up neat on their feet,
I leave a carrot-shaped comma for tonight.

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and specifically summarizes the main change: implementing locale-aware 'n' format specifier support for int, float, and complex number types.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Mar 4, 2026

Code has been automatically formatted

The code in this PR has been formatted using:

  • cargo fmt --all
    Please pull the latest changes before pushing again:
git pull origin locale-n

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Mar 4, 2026

📦 Library Dependencies

The following Lib/ modules were modified. Here are their dependencies:

[x] test: cpython/Lib/test/test_format.py (TODO: 6)

dependencies:

dependent tests: (no tests depend on format)

[x] lib: cpython/Lib/types.py
[ ] test: cpython/Lib/test/test_types.py (TODO: 8)

dependencies:

  • types

dependent tests: (52 tests)

  • types: test_annotationlib test_ast test_asyncgen test_asyncio test_builtin test_call test_code test_collections test_compile test_compiler_assemble test_coroutines test_decorators test_descr test_dis test_doctest test_dtrace test_dynamicclassattribute test_email test_enum test_exception_group test_fstring test_funcattrs test_generators test_genericalias test_hmac test_importlib test_inspect test_listcomps test_marshal test_monitoring test_opcache test_os test_positional_only_arg test_pprint test_pyclbr test_pydoc test_raise test_string test_subclassinit test_subprocess test_tempfile test_threading test_trace test_traceback test_type_aliases test_type_annotations test_type_params test_types test_typing test_unittest test_xml_etree test_xml_etree_c

Legend:

  • [+] path exists in CPython
  • [x] up-to-date, [ ] outdated

@youknowone youknowone marked this pull request as ready for review March 5, 2026 00:14
@chatgpt-codex-connector
Copy link
Copy Markdown

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.
To continue using code reviews, you can upgrade your account or add credits to your account and enable them for code reviews in your settings.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@crates/common/src/format.rs`:
- Around line 640-690: The magnitude components are being formatted with
sign/width/alignment from the original spec (via magnitude_format.format_float),
causing duplicated signs and inner padding; also the real-part suppression
currently hides a zero real part even when an explicit complex format type was
requested. Fix by building magnitude_format from self but explicitly clear sign,
width, align, fill, zero_pad and alternate (so magnitude_format has no
sign/width/alignment) before calling magnitude_format.format_float /
Self::apply_locale_formatting, and change the re suppression condition to only
skip the real part when num.re == 0.0 && !num.re.is_sign_negative() AND the
original FormatSpec did not have an explicit format type (e.g., check
self.format_type.is_none() or the equivalent flag used to detect an explicit
type) so typed complex formatting still shows the real component; use the
existing symbols magnitude_format, format_float, apply_locale_formatting, and
the re_str logic to implement these changes.
- Around line 526-536: apply_locale_formatting currently passes the entire
pre-decimal chunk (magnitude_str split by '.') to insert_locale_grouping which
corrupts exponent or non-digit suffixes like "1e+06"; update
apply_locale_formatting to isolate the contiguous digit sequence (allowing an
optional leading sign) from the integer part, call Self::insert_locale_grouping
only on that digit sequence, then rebuild the integer part by concatenating the
grouped digits with any trailing non-digit suffix (e.g., exponent markers), and
finally reattach the decimal point/frac part using locale.decimal_point; refer
to apply_locale_formatting, Self::insert_locale_grouping, magnitude_str, and
locale.decimal_point to locate where to implement this change.

In `@crates/vm/src/format.rs`:
- Around line 49-67: parse_grouping currently always appends a trailing 0
(meaning "repeat last") even when the C sequence terminated with CHAR_MAX
(meaning "stop grouping entirely"). Modify parse_grouping to distinguish the
terminator: iterate over grouping (pointer grouping) until you reach 0 or
libc::c_char::MAX, keep the bytes you read in result, then if the terminator was
a NUL (0) append a trailing 0 to indicate "repeat last", but if the terminator
was CHAR_MAX do not append anything (stop grouping). Update the doc comment to
reflect that CHAR_MAX means "stop grouping" and NUL means "repeat last". Use the
existing symbols parse_grouping, grouping, result, ptr, libc::c_char::MAX and
the checked terminator to implement this behavior.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yml

Review profile: CHILL

Plan: Pro

Run ID: b91f907e-b731-492a-a941-15d354c29a80

📥 Commits

Reviewing files that changed from the base of the PR and between 5c29074 and 8ab371e.

⛔ Files ignored due to path filters (1)
  • Lib/test/test_format.py is excluded by !Lib/**
📒 Files selected for processing (5)
  • crates/common/src/format.rs
  • crates/vm/src/builtins/complex.rs
  • crates/vm/src/builtins/float.rs
  • crates/vm/src/builtins/int.rs
  • crates/vm/src/format.rs

Comment on lines +640 to +690
let magnitude_format = FormatSpec {
format_type: Some(format_type),
precision: Some(if precision == 0 { 1 } else { precision }),
..*self
};

let re_str = if num.re == 0.0 && !num.re.is_sign_negative() {
None
} else {
let re_sign = if num.re.is_sign_negative() && !num.re.is_nan() {
"-"
} else {
""
};
let re_mag = magnitude_format.format_float(num.re.abs())?;
let re_grouped = Self::apply_locale_formatting(re_mag, locale);
Some(format!("{re_sign}{re_grouped}"))
};

let im_sign = if num.im.is_sign_negative() && !num.im.is_nan() {
"-"
} else {
""
};
let im_mag = magnitude_format.format_float(num.im.abs())?;
let im_grouped = Self::apply_locale_formatting(im_mag, locale);

let has_re = re_str.is_some();
let formatted = if let Some(re_str) = re_str {
let sep = if num.im >= 0.0 || num.im.is_nan() {
"+"
} else {
""
};
format!("({re_str}{sep}{im_sign}{im_grouped}j)")
} else {
format!("{im_sign}{im_grouped}j")
};

let format_sign = self.sign.unwrap_or(FormatSign::Minus);
let sign_str = match format_sign {
FormatSign::Plus => "+",
FormatSign::Minus => "",
FormatSign::MinusOrSpace => " ",
};

self.format_sign_and_align(
&AsciiStr::new(&formatted),
if has_re { "" } else { sign_str },
FormatAlign::Right,
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

format_complex_locale is formatting component magnitudes with sign/width logic.

Using format_float here injects per-component sign/alignment/width, then signs are added again in this function. That can yield invalid strings (-+..., ++...) and inner padding artifacts.

Also, the real-part suppression condition drops 0 real part even with an explicit type, which diverges from the existing typed complex formatting flow.

💡 Suggested fix
-        let re_str = if num.re == 0.0 && !num.re.is_sign_negative() {
+        let re_str = if num.re == 0.0 && !num.re.is_sign_negative() && self.format_type.is_none() {
             None
         } else {
             let re_sign = if num.re.is_sign_negative() && !num.re.is_nan() {
                 "-"
             } else {
                 ""
             };
-            let re_mag = magnitude_format.format_float(num.re.abs())?;
+            let re_mag = magnitude_format.format_complex_float(num.re.abs())?;
             let re_grouped = Self::apply_locale_formatting(re_mag, locale);
             Some(format!("{re_sign}{re_grouped}"))
         };
@@
-        let im_mag = magnitude_format.format_float(num.im.abs())?;
+        let im_mag = magnitude_format.format_complex_float(num.im.abs())?;
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/common/src/format.rs` around lines 640 - 690, The magnitude components
are being formatted with sign/width/alignment from the original spec (via
magnitude_format.format_float), causing duplicated signs and inner padding; also
the real-part suppression currently hides a zero real part even when an explicit
complex format type was requested. Fix by building magnitude_format from self
but explicitly clear sign, width, align, fill, zero_pad and alternate (so
magnitude_format has no sign/width/alignment) before calling
magnitude_format.format_float / Self::apply_locale_formatting, and change the re
suppression condition to only skip the real part when num.re == 0.0 &&
!num.re.is_sign_negative() AND the original FormatSpec did not have an explicit
format type (e.g., check self.format_type.is_none() or the equivalent flag used
to detect an explicit type) so typed complex formatting still shows the real
component; use the existing symbols magnitude_format, format_float,
apply_locale_formatting, and the re_str logic to implement these changes.

Comment on lines +49 to +67
/// Parse C `lconv.grouping` into a `Vec<u8>`.
/// Reads bytes until 0 or CHAR_MAX, then appends 0 (meaning "repeat last group").
#[cfg(unix)]
unsafe fn parse_grouping(grouping: *const libc::c_char) -> Vec<u8> {
let mut result = Vec::new();
if grouping.is_null() {
return result;
}
unsafe {
let mut ptr = grouping;
while ![0, libc::c_char::MAX].contains(&*ptr) {
result.push(*ptr as u8);
ptr = ptr.add(1);
}
}
if !result.is_empty() {
result.push(0);
}
result
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🌐 Web query:

C POSIX localeconv grouping lconv.grouping CHAR_MAX 0 terminator semantics

💡 Result:

In struct lconv (from localeconv()), grouping/mon_grouping is a char * that points to a sequence of byte values (often treated like an unsigned-byte array) describing how to group digits to the left of the radix point:

  • grouping[0] = size of the rightmost group; grouping[1] = next group to the left; etc. [1]
  • If an element is 0, it means repeat the previous group size for all remaining more-significant digits. Because 0 is also '\0', this is typically where the sequence ends (e.g. "\3\0" means “group by 3 forever”). [1]
  • If an element is CHAR_MAX, it means stop grouping: no further separators are inserted for any more-significant digits. [1]
  • If grouping is the empty string (""), that means no grouping information / no grouping (implementation/locale dependent, but common). [2]

Notes:

  • Some platforms use -1 (in a char) instead of CHAR_MAX due to char signedness; robust code often compares after casting to unsigned char. [3]

Sources: Linux locale(7) description of grouping semantics (including 0 and CHAR_MAX) [1]; NetBSD localeconv(3) notes on CHAR_MAX as “unavailable” in some fields/locales [2]; Gnulib portability note about -1 vs CHAR_MAX on some platforms [3].

[1] (man7.org) [2] (man.netbsd.org) [3] (gnu.org)

Citations:


🏁 Script executed:

# First, let's check if there are any clippy or format issues with the current implementation
cd crates/vm/src && head -70 format.rs | tail -25

Repository: RustPython/RustPython

Length of output: 674


🏁 Script executed:

cd crates/vm && grep -n "parse_grouping\|get_locale_info" src/format.rs | head -20

Repository: RustPython/RustPython

Length of output: 296


🏁 Script executed:

# Verify the suggested fix more carefully by checking the actual current state
cd crates/vm/src && cat -n format.rs | sed -n '49,68p'

Repository: RustPython/RustPython

Length of output: 761


parse_grouping conflates CHAR_MAX with repeat-last behavior, violating C/POSIX semantics.

When the C grouping sequence ends with CHAR_MAX, it means "stop grouping entirely". The current implementation appends 0 unconditionally, which incorrectly translates this to "repeat previous group forever". Additionally, the doc comment is misleading.

Suggested fix
 unsafe fn parse_grouping(grouping: *const libc::c_char) -> Vec<u8> {
     let mut result = Vec::new();
     if grouping.is_null() {
         return result;
     }
-    unsafe {
-        let mut ptr = grouping;
-        while ![0, libc::c_char::MAX].contains(&*ptr) {
-            result.push(*ptr as u8);
-            ptr = ptr.add(1);
-        }
-    }
-    if !result.is_empty() {
+    let mut ptr = grouping;
+    let mut repeat_last = false;
+    while !ptr.is_null() {
+        let v = *ptr;
+        if v == 0 {
+            repeat_last = true;
+            break;
+        }
+        if v == libc::c_char::MAX {
+            repeat_last = false;
+            break;
+        }
+        result.push(v as u8);
+        ptr = ptr.add(1);
+    }
+    if repeat_last && !result.is_empty() {
         result.push(0);
     }
     result
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/vm/src/format.rs` around lines 49 - 67, parse_grouping currently
always appends a trailing 0 (meaning "repeat last") even when the C sequence
terminated with CHAR_MAX (meaning "stop grouping entirely"). Modify
parse_grouping to distinguish the terminator: iterate over grouping (pointer
grouping) until you reach 0 or libc::c_char::MAX, keep the bytes you read in
result, then if the terminator was a NUL (0) append a trailing 0 to indicate
"repeat last", but if the terminator was CHAR_MAX do not append anything (stop
grouping). Update the doc comment to reflect that CHAR_MAX means "stop grouping"
and NUL means "repeat last". Use the existing symbols parse_grouping, grouping,
result, ptr, libc::c_char::MAX and the checked terminator to implement this
behavior.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
crates/common/src/format.rs (1)

525-536: ⚠️ Potential issue | 🟠 Major

apply_locale_formatting still corrupts exponent notation for numbers without decimal points.

The previous review flagged this issue and a fix was suggested, but the current code doesn't include that fix. When format_general produces output like "1e+06" (no decimal point), splitn(2, '.') yields the entire string as int_part, and insert_locale_grouping will corrupt it by inserting separators into non-digit characters.

💡 Suggested fix: handle exponent notation
 fn apply_locale_formatting(magnitude_str: String, locale: &LocaleInfo) -> String {
+    // Separate exponent part before processing
+    let (mantissa, exponent) = match magnitude_str.find(|c| c == 'e' || c == 'E') {
+        Some(pos) => (&magnitude_str[..pos], Some(&magnitude_str[pos..])),
+        None => (magnitude_str.as_str(), None),
+    };
+
-    let mut parts = magnitude_str.splitn(2, '.');
+    let mut parts = mantissa.splitn(2, '.');
     let int_part = parts.next().unwrap();
     let grouped = Self::insert_locale_grouping(int_part, locale);

-    if let Some(frac_part) = parts.next() {
+    let result = if let Some(frac_part) = parts.next() {
         format!("{grouped}{}{frac_part}", locale.decimal_point)
     } else {
         grouped
+    };
+
+    match exponent {
+        Some(exp) => format!("{result}{exp}"),
+        None => result,
     }
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/common/src/format.rs` around lines 525 - 536, apply_locale_formatting
currently corrupts exponent notation like "1e+06" because it applies
insert_locale_grouping to the whole int_part; fix by detecting an exponent ('e'
or 'E') in the original magnitude_str (or int_part) and split off the exponent
suffix before grouping, then only call insert_locale_grouping on the mantissa
integer portion (preserving any leading sign), finally reassemble as
"{grouped_mantissa}{optional_decimal_point}{frac_part}{exponent_suffix}" so
exponents remain untouched; update apply_locale_formatting to locate 'e'/'E',
separate exponent, preserve sign, call insert_locale_grouping on the numeric
digits, and then concatenate the exponent back.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@crates/common/src/format.rs`:
- Line 564: The padding is computed using AsciiStr.char_len() (byte length),
which miscounts multi-byte locale separators in magnitude_str from
apply_locale_formatting(); update format_sign_and_align and its caller site so
magnitude_str is passed/treated as a Unicode &str (not AsciiStr) and compute
visual length with a Unicode-aware count (e.g., s.chars().count() or
UnicodeSegmentation::graphemes(true).count()) when calculating width/fill;
replace references to AsciiStr::new(&magnitude_str) and char_len() in
format_sign_and_align with the Unicode-aware length and adjust any
types/signatures that assumed AsciiStr accordingly.

---

Duplicate comments:
In `@crates/common/src/format.rs`:
- Around line 525-536: apply_locale_formatting currently corrupts exponent
notation like "1e+06" because it applies insert_locale_grouping to the whole
int_part; fix by detecting an exponent ('e' or 'E') in the original
magnitude_str (or int_part) and split off the exponent suffix before grouping,
then only call insert_locale_grouping on the mantissa integer portion
(preserving any leading sign), finally reassemble as
"{grouped_mantissa}{optional_decimal_point}{frac_part}{exponent_suffix}" so
exponents remain untouched; update apply_locale_formatting to locate 'e'/'E',
separate exponent, preserve sign, call insert_locale_grouping on the numeric
digits, and then concatenate the exponent back.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yml

Review profile: CHILL

Plan: Pro

Run ID: 38611da9-2a7f-42db-803e-f50a9b839612

📥 Commits

Reviewing files that changed from the base of the PR and between 8ab371e and be787c6.

⛔ Files ignored due to path filters (1)
  • Lib/test/test_types.py is excluded by !Lib/**
📒 Files selected for processing (1)
  • crates/common/src/format.rs

},
};

self.format_sign_and_align(&AsciiStr::new(&magnitude_str), sign_str, FormatAlign::Right)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

# First, let's look at the AsciiStr implementation
rg -n "impl.*AsciiStr" crates/common/src/format.rs -A 30 | head -100

Repository: RustPython/RustPython

Length of output: 1208


🏁 Script executed:

# Check the specific lines mentioned in the comment
sed -n '560,570p' crates/common/src/format.rs

Repository: RustPython/RustPython

Length of output: 362


🏁 Script executed:

# Also check lines 604 and 656
sed -n '600,610p' crates/common/src/format.rs
sed -n '652,662p' crates/common/src/format.rs

Repository: RustPython/RustPython

Length of output: 774


🏁 Script executed:

# Look for the char_len method definition
rg -n "fn char_len" crates/common/src/format.rs -A 5

Repository: RustPython/RustPython

Length of output: 305


🏁 Script executed:

# Check how locale separators are inserted into the magnitude_str
rg -n "thousands_sep\|decimal_point" crates/common/src/format.rs -B 2 -A 2

Repository: RustPython/RustPython

Length of output: 47


🏁 Script executed:

# Look for locale definition and separator types
rg -n "struct.*Locale\|decimal_point\|thousands_sep" crates/common/src/ -A 3

Repository: RustPython/RustPython

Length of output: 47


🏁 Script executed:

# Search for Locale struct and locale-related code
find crates -name "*.rs" | xargs grep -l "locale\|Locale" | head -5

Repository: RustPython/RustPython

Length of output: 204


🏁 Script executed:

# Search more broadly for where locale is used in format.rs
rg -n "locale\." crates/common/src/format.rs -B 2 -A 2

Repository: RustPython/RustPython

Length of output: 1144


🏁 Script executed:

# Look for how magnitude_str is constructed
rg -n "magnitude_str.*format\|let.*magnitude_str" crates/common/src/format.rs -B 2 -A 2 | head -50

Repository: RustPython/RustPython

Length of output: 47


🏁 Script executed:

# Find LocaleInfo definition
rg -n "struct LocaleInfo\|^pub struct.*Locale" crates/common/src/format.rs -A 10

Repository: RustPython/RustPython

Length of output: 47


🏁 Script executed:

# Check the complete insert_locale_grouping function
sed -n '482,535p' crates/common/src/format.rs

Repository: RustPython/RustPython

Length of output: 1970


🏁 Script executed:

# Look at the format_float_locale function to see how it constructs magnitude_str
sed -n '567,600p' crates/common/src/format.rs

Repository: RustPython/RustPython

Length of output: 1303


🏁 Script executed:

# Check the locale module to understand the locale structure
head -100 crates/stdlib/src/locale.rs

Repository: RustPython/RustPython

Length of output: 3356


🏁 Script executed:

# Find format_sign_and_align implementation
rg -n "fn format_sign_and_align" crates/common/src/format.rs -A 20

Repository: RustPython/RustPython

Length of output: 851


🏁 Script executed:

# Check how char_len is used in format_sign_and_align
sed -n '900,1000p' crates/common/src/format.rs | grep -n "char_len\|width"

Repository: RustPython/RustPython

Length of output: 184


🏁 Script executed:

# Find the complete format_sign_and_align function
rg -n "fn format_sign_and_align" crates/common/src/format.rs -A 30

Repository: RustPython/RustPython

Length of output: 1253


🏁 Script executed:

# Let's look for where LocaleInfo is defined
rg -n "struct LocaleInfo\|pub.*LocaleInfo" crates/common/src/ -A 5

Repository: RustPython/RustPython

Length of output: 47


🏁 Script executed:

# Check if there are other implementations of CharLen
rg -n "impl CharLen" crates/common/src/format.rs

Repository: RustPython/RustPython

Length of output: 101


🏁 Script executed:

# Let's look for a Unicode-aware version or other CharLen implementations
rg -n "impl.*CharLen" crates/ -A 3

Repository: RustPython/RustPython

Length of output: 1007


🏁 Script executed:

# Check if locales can actually have non-ASCII separators
rg -n "thousands_sep\|decimal_point" crates/stdlib/src/locale.rs -B 2 -A 2 | head -30

Repository: RustPython/RustPython

Length of output: 47


🏁 Script executed:

# Let me check the full context - look for where apply_locale_formatting is called and how its result is used
sed -n '567,610p' crates/common/src/format.rs

Repository: RustPython/RustPython

Length of output: 1569


🏁 Script executed:

# Let's verify that the magnitude_str can contain non-ASCII characters
# Check what types are used for thousands_sep and decimal_point
rg -n "pub struct LocaleInfo\|thousands_sep:\|decimal_point:" crates/common/src/ -A 2

Repository: RustPython/RustPython

Length of output: 47


🏁 Script executed:

# Let's trace through what happens - check if the string from apply_locale_formatting
# is used directly or if there's any normalization
sed -n '527,545p' crates/common/src/format.rs

Repository: RustPython/RustPython

Length of output: 687


🏁 Script executed:

# Check how sign_str.len() is used - does it account for multi-byte characters?
sed -n '985,990p' crates/common/src/format.rs

Repository: RustPython/RustPython

Length of output: 331


🏁 Script executed:

# Find the LocaleInfo struct definition
rg -n "^pub struct LocaleInfo\|^struct LocaleInfo" crates/common/src/format.rs -A 10

Repository: RustPython/RustPython

Length of output: 47


🏁 Script executed:

# Check where LocaleInfo is created/initialized
rg -n "LocaleInfo {" crates/common/src/format.rs -B 2 -A 5

Repository: RustPython/RustPython

Length of output: 152


🏁 Script executed:

# Let's trace how locale.thousands_sep is set - look at the locale module more carefully
sed -n '1,150p' crates/stdlib/src/locale.rs | grep -A 5 -B 5 "thousands_sep"

Repository: RustPython/RustPython

Length of output: 559


🏁 Script executed:

# Search for LocaleInfo definition differently
rg "LocaleInfo" crates/common/src/format.rs | head -5

Repository: RustPython/RustPython

Length of output: 312


🏁 Script executed:

# Get context around the first LocaleInfo reference
rg -n "LocaleInfo" crates/common/src/format.rs | head -1

Repository: RustPython/RustPython

Length of output: 91


🏁 Script executed:

# Look for where LocaleInfo is defined - broader search
rg -n "struct LocaleInfo" crates/ -A 10

Repository: RustPython/RustPython

Length of output: 772


🏁 Script executed:

# Now let's verify: check what sign_str can be and if it has the same byte-length issue
sed -n '590,605p' crates/common/src/format.rs

Repository: RustPython/RustPython

Length of output: 611


🏁 Script executed:

# Check the similar code in format_int_locale and format_complex_locale for the same issue
sed -n '620,650p' crates/common/src/format.rs

Repository: RustPython/RustPython

Length of output: 1283


🏁 Script executed:

# Let's also verify if there are other CharLen implementations that correctly handle Unicode
sed -n '1520,1535p' crates/vm/src/builtins/str.rs

Repository: RustPython/RustPython

Length of output: 429


🏁 Script executed:

# Final verification: check if French locale actually uses non-breaking space
python3 << 'EOF'
import locale
# Check what the French locale uses for thousands separator
# Note: this is for documentation purposes - we know from Unicode that:
# - \u{00A0} is non-breaking space (NBSP) - 2 bytes in UTF-8
# - French locale typically uses NBSP as thousands separator
# - This is confirmed by CPython behavior

print("UTF-8 byte length of non-breaking space: ", len("\u00A0".encode('utf-8')))
print("Character count of '1\u00A0234': ", len("1\u00A02 34"))
print("Byte length of '1\u00A0234': ", len("1\u00A0234".encode('utf-8')))
EOF

Repository: RustPython/RustPython

Length of output: 166


🏁 Script executed:

# More direct verification
python3 << 'EOF'
nbsp_string = "1\u00A0234"  # 1, NBSP, 2, 3, 4
print(f"String: {repr(nbsp_string)}")
print(f"Character count: {len(nbsp_string)}")
print(f"Byte count: {len(nbsp_string.encode('utf-8'))}")
print(f"Difference: {len(nbsp_string.encode('utf-8')) - len(nbsp_string)} bytes extra")
EOF

Repository: RustPython/RustPython

Length of output: 142


AsciiStr returns byte length instead of character count, causing incorrect width padding with non-ASCII locale separators.

In format_sign_and_align() (line 988), the width calculation uses magnitude_str.char_len() which AsciiStr implements to return self.inner.len() (byte length). However, magnitude_str from apply_locale_formatting() can contain multi-byte UTF-8 characters from locale separators—for example, French locales use non-breaking space (\u{00A0}, 2 bytes in UTF-8) as the thousands separator. This causes the fill character calculation to be off by 1+ bytes when formatting with width specs like "{:15n}".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/common/src/format.rs` at line 564, The padding is computed using
AsciiStr.char_len() (byte length), which miscounts multi-byte locale separators
in magnitude_str from apply_locale_formatting(); update format_sign_and_align
and its caller site so magnitude_str is passed/treated as a Unicode &str (not
AsciiStr) and compute visual length with a Unicode-aware count (e.g.,
s.chars().count() or UnicodeSegmentation::graphemes(true).count()) when
calculating width/fill; replace references to AsciiStr::new(&magnitude_str) and
char_len() in format_sign_and_align with the Unicode-aware length and adjust any
types/signatures that assumed AsciiStr accordingly.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (3)
crates/common/src/format.rs (2)

564-564: ⚠️ Potential issue | 🟡 Minor

Locale width/alignment uses byte length instead of character count.

At Line 564, Line 604, and Line 656, locale-formatted strings are wrapped as AsciiStr. AsciiStr::char_len() (Line 1040) uses byte length, so width padding is wrong for multibyte separators (e.g., NBSP thousands separator).

💡 Proposed direction
- self.format_sign_and_align(&AsciiStr::new(&magnitude_str), sign_str, FormatAlign::Right)
+ self.format_sign_and_align(&UnicodeStr::new(&magnitude_str), sign_str, FormatAlign::Right)
// Add alongside AsciiStr
struct UnicodeStr<'a> {
    inner: &'a str,
}

impl<'a> UnicodeStr<'a> {
    const fn new(inner: &'a str) -> Self {
        Self { inner }
    }
}

impl CharLen for UnicodeStr<'_> {
    fn char_len(&self) -> usize {
        self.inner.chars().count()
    }
}

impl Deref for UnicodeStr<'_> {
    type Target = str;
    fn deref(&self) -> &Self::Target {
        self.inner
    }
}
#!/bin/bash
# Verify current locale-format callsites and CharLen behavior.
rg -n "AsciiStr::new\\(&magnitude_str\\)" crates/common/src/format.rs
sed -n '1038,1042p' crates/common/src/format.rs

Also applies to: 604-604, 656-656

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/common/src/format.rs` at line 564, Locale-formatted strings are being
wrapped with AsciiStr (via AsciiStr::new(&magnitude_str)) but
AsciiStr::char_len() returns byte length, breaking width/alignment for multibyte
separators; introduce a UnicodeStr<'a> wrapper (with new(inner: &'a str)),
implement CharLen for it using self.inner.chars().count(), and Deref<Target=str>
so it behaves like &str, then replace the locale-related AsciiStr::new(...)
callsites (the calls into format_sign_and_align / format_* that pass
&AsciiStr::new(&magnitude_str)) with &UnicodeStr::new(&magnitude_str) and add
any necessary use/imports.

525-536: ⚠️ Potential issue | 🟠 Major

Group only mantissa digits before locale insertion.

At Line 527, the whole pre-. chunk is grouped. Scientific notation without a decimal point (e.g., 1e+06) gets corrupted by separator insertion. This also affects callers at Line 591 and Line 650.

💡 Proposed fix
 fn apply_locale_formatting(magnitude_str: String, locale: &LocaleInfo) -> String {
-    let mut parts = magnitude_str.splitn(2, '.');
-    let int_part = parts.next().unwrap();
-    let grouped = Self::insert_locale_grouping(int_part, locale);
-
-    if let Some(frac_part) = parts.next() {
-        format!("{grouped}{}{frac_part}", locale.decimal_point)
-    } else {
-        grouped
-    }
+    let exp_pos = magnitude_str.find(|c| c == 'e' || c == 'E');
+    let (mantissa, exponent) = exp_pos
+        .map(|i| (&magnitude_str[..i], Some(&magnitude_str[i..])))
+        .unwrap_or((magnitude_str.as_str(), None));
+
+    let (int_part, frac_part) = mantissa
+        .split_once('.')
+        .map_or((mantissa, None), |(i, f)| (i, Some(f)));
+
+    let grouped = if int_part.chars().all(|c| c.is_ascii_digit()) {
+        Self::insert_locale_grouping(int_part, locale)
+    } else {
+        int_part.to_owned()
+    };
+
+    let mut out = if let Some(frac_part) = frac_part {
+        format!("{grouped}{}{frac_part}", locale.decimal_point)
+    } else {
+        grouped
+    };
+    if let Some(exponent) = exponent {
+        out.push_str(exponent);
+    }
+    out
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/common/src/format.rs` around lines 525 - 536, The current
apply_locale_formatting function applies grouping to the entire pre-decimal
chunk, corrupting scientific exponents like "1e+06"; update
apply_locale_formatting to detect and preserve any exponent suffix (look for 'e'
or 'E' plus the rest) by splitting magnitude_str into mantissa, optional
fractional part, and optional exponent, apply Self::insert_locale_grouping only
to the mantissa's integer digits, then recombine mantissa (with
locale.decimal_point and frac if present) and append the untouched exponent;
ensure magnitude_str, insert_locale_grouping, and locale.decimal_point are used
accordingly so exponents remain unmodified (this change will also fix the
callers that rely on apply_locale_formatting).
crates/vm/src/format.rs (1)

49-67: ⚠️ Potential issue | 🟠 Major

Preserve CHAR_MAX semantics in parse_grouping.

At Line 50 and Line 64, the implementation always appends 0 after parsing non-empty grouping data. If the C sequence ends with CHAR_MAX, this incorrectly changes behavior from “stop grouping” to “repeat last group forever.”

💡 Proposed fix
 /// Parse C `lconv.grouping` into a `Vec<u8>`.
-/// Reads bytes until 0 or CHAR_MAX, then appends 0 (meaning "repeat last group").
+/// Reads bytes until 0 or CHAR_MAX.
+/// - 0 means "repeat last group size"
+/// - CHAR_MAX means "stop grouping"
 #[cfg(unix)]
 unsafe fn parse_grouping(grouping: *const libc::c_char) -> Vec<u8> {
     let mut result = Vec::new();
     if grouping.is_null() {
         return result;
     }
-    unsafe {
-        let mut ptr = grouping;
-        while ![0, libc::c_char::MAX].contains(&*ptr) {
-            result.push(*ptr as u8);
-            ptr = ptr.add(1);
-        }
+    let mut ptr = grouping;
+    let mut repeat_last = false;
+    loop {
+        let v = *ptr;
+        if v == 0 {
+            repeat_last = true;
+            break;
+        }
+        if v == libc::c_char::MAX {
+            break;
+        }
+        result.push(v as u8);
+        ptr = ptr.add(1);
     }
-    if !result.is_empty() {
+    if repeat_last && !result.is_empty() {
         result.push(0);
     }
     result
 }
In POSIX/C `localeconv()`, for `lconv.grouping`, what are the semantics of terminators `0` and `CHAR_MAX`, and should `CHAR_MAX` stop further grouping rather than repeat the previous size?
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/vm/src/format.rs` around lines 49 - 67, The parse_grouping function
currently always appends a 0 terminator after collecting bytes, which
incorrectly changes semantics when the original C sequence ended with CHAR_MAX
(meaning "stop grouping"); modify parse_grouping to detect which terminator
stopped the loop (0 vs CHAR_MAX) by inspecting the terminating byte at the stop
condition and only append a 0 to result if the terminator was 0; reference the
unsafe fn parse_grouping(grouping: *const libc::c_char), the loop that reads
*ptr, and libc::c_char::MAX/0 to implement this conditional behavior.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@crates/common/src/format.rs`:
- Around line 607-657: format_complex_locale skips the same complex-format flag
validation performed in format_complex, allowing invalid '=' alignment and
zero-padding to pass; fix by enforcing the same checks before formatting: in
format_complex_locale (inside impl for FormatSpec) replicate or call the same
validation logic used by format_complex to return
Err(FormatSpecError::AlignmentFlag) when the alignment is '='
(FormatAlign::AfterSign) and Err(FormatSpecError::ZeroPadding) when
zero-fill/zero-padding rules are violated (e.g., fill == '0' with a sign-aware
width), or better yet extract that logic into a small helper (e.g.,
validate_complex_format_flags(&self) -> Result<(), FormatSpecError>) and invoke
it at the start of format_complex_locale so the later calls to
format_complex_re_im and format_sign_and_align behave consistently.

---

Duplicate comments:
In `@crates/common/src/format.rs`:
- Line 564: Locale-formatted strings are being wrapped with AsciiStr (via
AsciiStr::new(&magnitude_str)) but AsciiStr::char_len() returns byte length,
breaking width/alignment for multibyte separators; introduce a UnicodeStr<'a>
wrapper (with new(inner: &'a str)), implement CharLen for it using
self.inner.chars().count(), and Deref<Target=str> so it behaves like &str, then
replace the locale-related AsciiStr::new(...) callsites (the calls into
format_sign_and_align / format_* that pass &AsciiStr::new(&magnitude_str)) with
&UnicodeStr::new(&magnitude_str) and add any necessary use/imports.
- Around line 525-536: The current apply_locale_formatting function applies
grouping to the entire pre-decimal chunk, corrupting scientific exponents like
"1e+06"; update apply_locale_formatting to detect and preserve any exponent
suffix (look for 'e' or 'E' plus the rest) by splitting magnitude_str into
mantissa, optional fractional part, and optional exponent, apply
Self::insert_locale_grouping only to the mantissa's integer digits, then
recombine mantissa (with locale.decimal_point and frac if present) and append
the untouched exponent; ensure magnitude_str, insert_locale_grouping, and
locale.decimal_point are used accordingly so exponents remain unmodified (this
change will also fix the callers that rely on apply_locale_formatting).

In `@crates/vm/src/format.rs`:
- Around line 49-67: The parse_grouping function currently always appends a 0
terminator after collecting bytes, which incorrectly changes semantics when the
original C sequence ended with CHAR_MAX (meaning "stop grouping"); modify
parse_grouping to detect which terminator stopped the loop (0 vs CHAR_MAX) by
inspecting the terminating byte at the stop condition and only append a 0 to
result if the terminator was 0; reference the unsafe fn parse_grouping(grouping:
*const libc::c_char), the loop that reads *ptr, and libc::c_char::MAX/0 to
implement this conditional behavior.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yml

Review profile: CHILL

Plan: Pro

Run ID: 70a4d2f2-7c14-41b2-aa12-3176ce3a29d5

📥 Commits

Reviewing files that changed from the base of the PR and between be787c6 and 092bbda.

⛔ Files ignored due to path filters (2)
  • Lib/test/test_format.py is excluded by !Lib/**
  • Lib/test/test_types.py is excluded by !Lib/**
📒 Files selected for processing (5)
  • crates/common/src/format.rs
  • crates/vm/src/builtins/complex.rs
  • crates/vm/src/builtins/float.rs
  • crates/vm/src/builtins/int.rs
  • crates/vm/src/format.rs

Comment on lines +607 to +657
/// Format a complex number with locale-aware 'n' format.
pub fn format_complex_locale(
&self,
num: &Complex64,
locale: &LocaleInfo,
) -> Result<String, FormatSpecError> {
// Reuse format_complex_re_im with 'g' type to get the base formatted parts,
// then apply locale grouping. This matches CPython's format_complex_internal:
// 'n' → 'g', add_parens=0, skip_re=0.
let locale_spec = FormatSpec {
format_type: Some(FormatType::GeneralFormat(Case::Lower)),
..*self
};
let (formatted_re, formatted_im) = locale_spec.format_complex_re_im(num)?;

// Apply locale grouping to both parts
let grouped_re = if formatted_re.is_empty() {
formatted_re
} else {
// Split sign from magnitude, apply grouping, recombine
let (sign, mag) = if formatted_re.starts_with('-')
|| formatted_re.starts_with('+')
|| formatted_re.starts_with(' ')
{
formatted_re.split_at(1)
} else {
("", formatted_re.as_str())
};
format!(
"{sign}{}",
Self::apply_locale_formatting(mag.to_string(), locale)
)
};

// formatted_im is like "+1234j" or "-1234j" or "1234j"
// Split sign, magnitude, and 'j' suffix
let im_str = &formatted_im;
let (im_sign, im_rest) = if im_str.starts_with('+') || im_str.starts_with('-') {
im_str.split_at(1)
} else {
("", im_str.as_str())
};
let im_mag = im_rest.strip_suffix('j').unwrap_or(im_rest);
let im_grouped = Self::apply_locale_formatting(im_mag.to_string(), locale);
let grouped_im = format!("{im_sign}{im_grouped}j");

// No parentheses for 'n' format (CPython: add_parens=0)
let magnitude_str = format!("{grouped_re}{grouped_im}");

self.format_sign_and_align(&AsciiStr::new(&magnitude_str), "", FormatAlign::Right)
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

format_complex_locale bypasses complex-format flag validation.

At Line 656, the locale path returns formatted output without enforcing the same checks as format_complex (Line 864-Line 870). = alignment and zero-padding should raise AlignmentFlag and ZeroPadding consistently.

💡 Proposed fix
     // No parentheses for 'n' format (CPython: add_parens=0)
     let magnitude_str = format!("{grouped_re}{grouped_im}");
-
-    self.format_sign_and_align(&AsciiStr::new(&magnitude_str), "", FormatAlign::Right)
+    if let Some(FormatAlign::AfterSign) = &self.align {
+        return Err(FormatSpecError::AlignmentFlag);
+    }
+    match &self.fill.unwrap_or(' '.into()).to_char() {
+        Some('0') => Err(FormatSpecError::ZeroPadding),
+        _ => self.format_sign_and_align(&AsciiStr::new(&magnitude_str), "", FormatAlign::Right),
+    }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/common/src/format.rs` around lines 607 - 657, format_complex_locale
skips the same complex-format flag validation performed in format_complex,
allowing invalid '=' alignment and zero-padding to pass; fix by enforcing the
same checks before formatting: in format_complex_locale (inside impl for
FormatSpec) replicate or call the same validation logic used by format_complex
to return Err(FormatSpecError::AlignmentFlag) when the alignment is '='
(FormatAlign::AfterSign) and Err(FormatSpecError::ZeroPadding) when
zero-fill/zero-padding rules are violated (e.g., fill == '0' with a sign-aware
width), or better yet extract that logic into a small helper (e.g.,
validate_complex_format_flags(&self) -> Result<(), FormatSpecError>) and invoke
it at the start of format_complex_locale so the later calls to
format_complex_re_im and format_sign_and_align behave consistently.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (4)
crates/common/src/format.rs (3)

525-536: ⚠️ Potential issue | 🟠 Major

Don’t run grouping across exponent text in apply_locale_formatting.

At Line 527-Line 533, grouping is applied to everything before .; exponent forms like 1e+06 can be altered incorrectly.

Suggested fix
 fn apply_locale_formatting(magnitude_str: String, locale: &LocaleInfo) -> String {
-    let mut parts = magnitude_str.splitn(2, '.');
-    let int_part = parts.next().unwrap();
-    let grouped = Self::insert_locale_grouping(int_part, locale);
-
-    if let Some(frac_part) = parts.next() {
-        format!("{grouped}{}{frac_part}", locale.decimal_point)
-    } else {
-        grouped
-    }
+    let exp_pos = magnitude_str.find(|c| c == 'e' || c == 'E');
+    let (mantissa, exponent) = if let Some(pos) = exp_pos {
+        (&magnitude_str[..pos], Some(&magnitude_str[pos..]))
+    } else {
+        (magnitude_str.as_str(), None)
+    };
+    let (int_part, frac_part) = mantissa
+        .split_once('.')
+        .map_or((mantissa, None), |(i, f)| (i, Some(f)));
+
+    let grouped = if int_part.chars().all(|c| c.is_ascii_digit()) {
+        Self::insert_locale_grouping(int_part, locale)
+    } else {
+        int_part.to_owned()
+    };
+
+    let mut out = if let Some(frac_part) = frac_part {
+        format!("{grouped}{}{frac_part}", locale.decimal_point)
+    } else {
+        grouped
+    };
+    if let Some(exponent) = exponent {
+        out.push_str(exponent);
+    }
+    out
 }
For Python format specifier 'n', should locale grouping/decimal substitution affect exponent parts (e.g., '1e+06') or only the mantissa?
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/common/src/format.rs` around lines 525 - 536, The current
apply_locale_formatting applies grouping to the whole pre-decimal text and can
corrupt exponent text like "1e+06"; update apply_locale_formatting to first
split magnitude_str into mantissa and exponent by searching for 'e' or 'E'
(preserve the original exponent sign and digits), then perform the existing
splitn('.') and call insert_locale_grouping only on the mantissa's integer part
and replace the decimal point only in the mantissa, finally reattach the
exponent unchanged; reference apply_locale_formatting and insert_locale_grouping
so the mantissa/exponent split and reassembly are done before grouping/decimal
replacement.

607-657: ⚠️ Potential issue | 🟠 Major

Locale complex path should enforce the same invalid-flag checks as format_complex.

format_complex_locale currently skips the =-alignment and zero-padding validation present in format_complex, so invalid specs can slip through.

Suggested fix
 pub fn format_complex_locale(
     &self,
     num: &Complex64,
     locale: &LocaleInfo,
 ) -> Result<String, FormatSpecError> {
+    if let Some(FormatAlign::AfterSign) = &self.align {
+        return Err(FormatSpecError::AlignmentFlag);
+    }
+    if matches!(self.fill.unwrap_or(' '.into()).to_char(), Some('0')) {
+        return Err(FormatSpecError::ZeroPadding);
+    }
+
     // Reuse format_complex_re_im with 'g' type to get the base formatted parts,
     // then apply locale grouping. This matches CPython's format_complex_internal:
     // 'n' → 'g', add_parens=0, skip_re=0.
#!/bin/bash
# Compare validation behavior in locale/non-locale complex formatting
rg -n "fn format_complex_locale|fn format_complex|AlignmentFlag|ZeroPadding" crates/common/src/format.rs -A20 -B10
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/common/src/format.rs` around lines 607 - 657, format_complex_locale
skips the same validation performed by format_complex for illegal '=' alignment
and zero-padding flags, allowing invalid format specs; update
format_complex_locale to run the identical validation path as format_complex
(either call the same validation helper or inline the same checks) before
proceeding—specifically validate AlignmentFlag '=' and zero-padding combinations
on the FormatSpec used (see FormatSpec, format_complex_re_im, format_complex,
format_sign_and_align) and return the same FormatSpecError when flags are
invalid so locale handling cannot bypass the checks.

564-565: ⚠️ Potential issue | 🟡 Minor

Locale width alignment is byte-counted, so multibyte separators can mispad.

These locale paths pass AsciiStr, whose char_len() is byte length; locale separators can be multibyte UTF-8, so width/fill becomes inaccurate.

In Rust, does str::len() return byte length or Unicode scalar count, and what is the recommended way to compute character count for width-sensitive formatting?

Also applies to: 604-605, 656-657

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/common/src/format.rs` around lines 564 - 565, The code calls
format_sign_and_align(&AsciiStr::new(&magnitude_str), sign_str,
FormatAlign::Right) and relies on AsciiStr/char_len() (which yields byte length)
for width calculations, causing mispadding when locale separators are multibyte
UTF‑8; change the logic to use a Unicode-aware display width (e.g., use
unicode_width::UnicodeWidthStr::width on &str or compute grapheme clusters if
needed) instead of byte length, update the call sites that construct AsciiStr
(including other occurrences around format_sign_and_align) to pass a &str or a
precomputed display width, and modify format_sign_and_align to accept and use
the unicode-width result (or accept a width parameter) so padding/fill is based
on displayed character width rather than bytes.
crates/vm/src/format.rs (1)

49-68: ⚠️ Potential issue | 🟠 Major

parse_grouping currently conflates 0 and CHAR_MAX terminators.

At Line 64-Line 66, 0 is appended for any non-empty sequence, which incorrectly turns CHAR_MAX (“stop grouping”) into “repeat previous group”.

Suggested fix
 unsafe fn parse_grouping(grouping: *const libc::c_char) -> Vec<u8> {
     let mut result = Vec::new();
     if grouping.is_null() {
         return result;
     }
-    unsafe {
-        let mut ptr = grouping;
-        while ![0, libc::c_char::MAX].contains(&*ptr) {
-            result.push(*ptr as u8);
-            ptr = ptr.add(1);
-        }
-    }
-    if !result.is_empty() {
+    let mut ptr = grouping;
+    let mut repeat_last = false;
+    loop {
+        let v = unsafe { *ptr };
+        if v == 0 {
+            repeat_last = true;
+            break;
+        }
+        if v == libc::c_char::MAX {
+            break;
+        }
+        result.push(v as u8);
+        ptr = unsafe { ptr.add(1) };
+    }
+    if repeat_last && !result.is_empty() {
         result.push(0);
     }
     result
 }
In POSIX/C localeconv().grouping semantics, what is the difference between a 0 byte and CHAR_MAX terminator, and should they be handled differently when parsing grouping rules?
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/vm/src/format.rs` around lines 49 - 68, The parse_grouping function
conflates CHAR_MAX and NUL terminators by always appending a 0 byte for
non-empty results; change it so the code inspects the terminator byte at the
stopping pointer and only push a trailing 0 (the "repeat last group" marker)
when the terminator was NUL (0), not when it was libc::c_char::MAX; update
unsafe fn parse_grouping to stop on either terminator as before but after the
loop check *ptr (or the last read byte) and conditionally push 0 only if that
terminator == 0, leaving CHAR_MAX-terminated sequences without the appended 0.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@crates/common/src/format.rs`:
- Around line 525-536: The current apply_locale_formatting applies grouping to
the whole pre-decimal text and can corrupt exponent text like "1e+06"; update
apply_locale_formatting to first split magnitude_str into mantissa and exponent
by searching for 'e' or 'E' (preserve the original exponent sign and digits),
then perform the existing splitn('.') and call insert_locale_grouping only on
the mantissa's integer part and replace the decimal point only in the mantissa,
finally reattach the exponent unchanged; reference apply_locale_formatting and
insert_locale_grouping so the mantissa/exponent split and reassembly are done
before grouping/decimal replacement.
- Around line 607-657: format_complex_locale skips the same validation performed
by format_complex for illegal '=' alignment and zero-padding flags, allowing
invalid format specs; update format_complex_locale to run the identical
validation path as format_complex (either call the same validation helper or
inline the same checks) before proceeding—specifically validate AlignmentFlag
'=' and zero-padding combinations on the FormatSpec used (see FormatSpec,
format_complex_re_im, format_complex, format_sign_and_align) and return the same
FormatSpecError when flags are invalid so locale handling cannot bypass the
checks.
- Around line 564-565: The code calls
format_sign_and_align(&AsciiStr::new(&magnitude_str), sign_str,
FormatAlign::Right) and relies on AsciiStr/char_len() (which yields byte length)
for width calculations, causing mispadding when locale separators are multibyte
UTF‑8; change the logic to use a Unicode-aware display width (e.g., use
unicode_width::UnicodeWidthStr::width on &str or compute grapheme clusters if
needed) instead of byte length, update the call sites that construct AsciiStr
(including other occurrences around format_sign_and_align) to pass a &str or a
precomputed display width, and modify format_sign_and_align to accept and use
the unicode-width result (or accept a width parameter) so padding/fill is based
on displayed character width rather than bytes.

In `@crates/vm/src/format.rs`:
- Around line 49-68: The parse_grouping function conflates CHAR_MAX and NUL
terminators by always appending a 0 byte for non-empty results; change it so the
code inspects the terminator byte at the stopping pointer and only push a
trailing 0 (the "repeat last group" marker) when the terminator was NUL (0), not
when it was libc::c_char::MAX; update unsafe fn parse_grouping to stop on either
terminator as before but after the loop check *ptr (or the last read byte) and
conditionally push 0 only if that terminator == 0, leaving CHAR_MAX-terminated
sequences without the appended 0.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yml

Review profile: CHILL

Plan: Pro

Run ID: 025562ee-f1c0-478c-8018-769534ac57c0

📥 Commits

Reviewing files that changed from the base of the PR and between 092bbda and c92c2d1.

⛔ Files ignored due to path filters (2)
  • Lib/test/test_format.py is excluded by !Lib/**
  • Lib/test/test_types.py is excluded by !Lib/**
📒 Files selected for processing (5)
  • crates/common/src/format.rs
  • crates/vm/src/builtins/complex.rs
  • crates/vm/src/builtins/float.rs
  • crates/vm/src/builtins/int.rs
  • crates/vm/src/format.rs
🚧 Files skipped from review as they are similar to previous changes (2)
  • crates/vm/src/builtins/int.rs
  • crates/vm/src/builtins/complex.rs

Add LocaleInfo struct and locale-aware formatting methods to FormatSpec.
The 'n' format type now reads thousands_sep, decimal_point, and grouping
from C localeconv() and applies proper locale-based number grouping.
Remove @unittest.skip from test_format.test_locale.
Rewrite format_complex_locale to reuse format_complex_re_im, matching
formatter_unicode.c: add_parens=0 and skip_re=0 for 'n' type.
Remove @expectedfailure from test_float__format__locale and
test_int__format__locale in test_types.py.
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (4)
crates/common/src/format.rs (3)

564-564: ⚠️ Potential issue | 🟡 Minor

Width/padding can be wrong for multibyte locale separators.

Lines 564, 604, and 653 pass locale-formatted strings through AsciiStr; char_len() is byte-based, so width calculation drifts when thousands_sep / decimal_point are multibyte UTF-8.

💡 Proposed fix
- self.format_sign_and_align(&AsciiStr::new(&magnitude_str), sign_str, FormatAlign::Right)
+ self.format_sign_and_align(magnitude_str.as_str(), sign_str, FormatAlign::Right)

And update format_sign_and_align to compute Unicode character count for &str input:

- fn format_sign_and_align<T>(&self, magnitude_str: &T, sign_str: &str, default_align: FormatAlign) -> Result<String, FormatSpecError>
- where
-     T: CharLen + Deref<Target = str>,
+ fn format_sign_and_align(
+     &self,
+     magnitude_str: &str,
+     sign_str: &str,
+     default_align: FormatAlign,
+ ) -> Result<String, FormatSpecError>
 {
     let align = self.align.unwrap_or(default_align);
-    let num_chars = magnitude_str.char_len();
+    let num_chars = magnitude_str.chars().count();

Also applies to: 604-604, 653-653

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/common/src/format.rs` at line 564, The code incorrectly wraps
locale-formatted strings with AsciiStr and uses byte-based char_len(), causing
wrong width/padding when thousands_sep/decimal_point are multibyte; update the
call sites (the three places passing AsciiStr instances) to pass plain &str (the
formatted magnitude strings and any sign_str inputs) into format_sign_and_align,
and change format_sign_and_align to accept &str instead of AsciiStr and compute
widths using Unicode character count (e.g., s.chars().count()) when calculating
padding/alignment so multibyte separators are measured correctly.

525-536: ⚠️ Potential issue | 🟠 Major

apply_locale_formatting can group non-digit mantissas (e.g., scientific notation).

Line 529 groups the full pre-. chunk. Values like 1e+06, inf, or nan can be transformed incorrectly (for example, 1e,+06).

💡 Proposed fix
 fn apply_locale_formatting(magnitude_str: String, locale: &LocaleInfo) -> String {
-    let mut parts = magnitude_str.splitn(2, '.');
-    let int_part = parts.next().unwrap();
-    let grouped = Self::insert_locale_grouping(int_part, locale);
-
-    if let Some(frac_part) = parts.next() {
-        format!("{grouped}{}{frac_part}", locale.decimal_point)
-    } else {
-        grouped
-    }
+    let exp_pos = magnitude_str.find(|c| c == 'e' || c == 'E');
+    let (mantissa, exponent) = if let Some(pos) = exp_pos {
+        (&magnitude_str[..pos], Some(&magnitude_str[pos..]))
+    } else {
+        (magnitude_str.as_str(), None)
+    };
+    let (int_part, frac_part) = mantissa
+        .split_once('.')
+        .map_or((mantissa, None), |(i, f)| (i, Some(f)));
+
+    let grouped = if int_part.chars().all(|c| c.is_ascii_digit()) {
+        Self::insert_locale_grouping(int_part, locale)
+    } else {
+        int_part.to_owned()
+    };
+
+    let mut out = if let Some(frac_part) = frac_part {
+        format!("{grouped}{}{frac_part}", locale.decimal_point)
+    } else {
+        grouped
+    };
+    if let Some(exponent) = exponent {
+        out.push_str(exponent);
+    }
+    out
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/common/src/format.rs` around lines 525 - 536, apply_locale_formatting
currently passes the entire pre-decimal chunk to insert_locale_grouping which
corrupts non-digit mantissas like "1e+06", "inf", or "nan"; fix by isolating an
optional leading sign and the contiguous digit sequence in int_part, call
insert_locale_grouping only on that digit sequence (or on "0" when no digits),
then reassemble sign + grouped_digits + trailing_suffix (the remaining non-digit
chars) before attaching the fractional part and decimal_point; update
apply_locale_formatting to parse int_part into (sign, digit_run, suffix) and use
insert_locale_grouping on digit_run so symbols like apply_locale_formatting,
insert_locale_grouping, magnitude_str, and LocaleInfo are the only points
changed.

650-654: ⚠️ Potential issue | 🟠 Major

Locale complex formatting skips complex-specific flag validation.

Lines 653-654 directly format output, but this path does not enforce the same AlignmentFlag / ZeroPadding checks used in format_complex (Lines 861-867). Invalid specs can pass silently here.

💡 Proposed fix
         // No parentheses for 'n' format (CPython: add_parens=0)
         let magnitude_str = format!("{grouped_re}{grouped_im}");
-
-        self.format_sign_and_align(&AsciiStr::new(&magnitude_str), "", FormatAlign::Right)
+        if let Some(FormatAlign::AfterSign) = &self.align {
+            return Err(FormatSpecError::AlignmentFlag);
+        }
+        match &self.fill.unwrap_or(' '.into()).to_char() {
+            Some('0') => Err(FormatSpecError::ZeroPadding),
+            _ => self.format_sign_and_align(&AsciiStr::new(&magnitude_str), "", FormatAlign::Right),
+        }
     }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/common/src/format.rs` around lines 650 - 654, The locale-specific
branch builds magnitude_str and calls self.format_sign_and_align without
validating complex-number-specific flags, allowing invalid
AlignmentFlag/ZeroPadding specs to pass; add the same validation used in
format_complex (the checks around AlignmentFlag and ZeroPadding in the
format_complex path) before formatting here or refactor to call the shared
validator, then only proceed to call
self.format_sign_and_align(&AsciiStr::new(&magnitude_str), "",
FormatAlign::Right) if the validation passes so invalid specs are rejected
consistently.
crates/vm/src/format.rs (1)

49-67: ⚠️ Potential issue | 🟠 Major

parse_grouping currently treats CHAR_MAX like repeat-last grouping.

At Lines 64-66, a trailing 0 is appended for any non-empty sequence. That makes CHAR_MAX-terminated patterns behave as “repeat last group forever” instead of “stop further grouping”.

💡 Proposed fix
 /// Parse C `lconv.grouping` into a `Vec<u8>`.
-/// Reads bytes until 0 or CHAR_MAX, then appends 0 (meaning "repeat last group").
+/// Reads bytes until 0 or CHAR_MAX.
+/// - 0 means "repeat last group"
+/// - CHAR_MAX means "stop further grouping"
 #[cfg(unix)]
 unsafe fn parse_grouping(grouping: *const libc::c_char) -> Vec<u8> {
     let mut result = Vec::new();
     if grouping.is_null() {
         return result;
     }
-    unsafe {
-        let mut ptr = grouping;
-        while ![0, libc::c_char::MAX].contains(&*ptr) {
-            result.push(*ptr as u8);
-            ptr = ptr.add(1);
-        }
-    }
-    if !result.is_empty() {
+    let mut ptr = grouping;
+    let mut repeat_last = false;
+    loop {
+        let value = unsafe { *ptr };
+        if value == 0 {
+            repeat_last = true;
+            break;
+        }
+        if value == libc::c_char::MAX {
+            break;
+        }
+        result.push(value as u8);
+        ptr = unsafe { ptr.add(1) };
+    }
+    if repeat_last && !result.is_empty() {
         result.push(0);
     }
     result
 }
In POSIX/C localeconv() semantics for lconv.grouping, what do terminators 0 and CHAR_MAX mean, and should they be handled differently when converting grouping rules?
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/vm/src/format.rs` around lines 49 - 67, parse_grouping currently
always appends a trailing 0 (meaning "repeat last group") even when the original
C terminator was CHAR_MAX (meaning "stop grouping"); modify parse_grouping to
inspect the terminator byte after the loop and only append 0 if the terminator
was actually 0, and do not append anything when the terminator was
libc::c_char::MAX. Locate the unsafe fn parse_grouping(grouping: *const
libc::c_char) -> Vec<u8>, keep the loop that advances ptr while *ptr is not 0 or
libc::c_char::MAX, then check the byte at ptr: if *ptr == 0 { if
!result.is_empty() { result.push(0); } } else /* CHAR_MAX */ { /* leave result
as-is */ } and return result.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@crates/common/src/format.rs`:
- Line 564: The code incorrectly wraps locale-formatted strings with AsciiStr
and uses byte-based char_len(), causing wrong width/padding when
thousands_sep/decimal_point are multibyte; update the call sites (the three
places passing AsciiStr instances) to pass plain &str (the formatted magnitude
strings and any sign_str inputs) into format_sign_and_align, and change
format_sign_and_align to accept &str instead of AsciiStr and compute widths
using Unicode character count (e.g., s.chars().count()) when calculating
padding/alignment so multibyte separators are measured correctly.
- Around line 525-536: apply_locale_formatting currently passes the entire
pre-decimal chunk to insert_locale_grouping which corrupts non-digit mantissas
like "1e+06", "inf", or "nan"; fix by isolating an optional leading sign and the
contiguous digit sequence in int_part, call insert_locale_grouping only on that
digit sequence (or on "0" when no digits), then reassemble sign + grouped_digits
+ trailing_suffix (the remaining non-digit chars) before attaching the
fractional part and decimal_point; update apply_locale_formatting to parse
int_part into (sign, digit_run, suffix) and use insert_locale_grouping on
digit_run so symbols like apply_locale_formatting, insert_locale_grouping,
magnitude_str, and LocaleInfo are the only points changed.
- Around line 650-654: The locale-specific branch builds magnitude_str and calls
self.format_sign_and_align without validating complex-number-specific flags,
allowing invalid AlignmentFlag/ZeroPadding specs to pass; add the same
validation used in format_complex (the checks around AlignmentFlag and
ZeroPadding in the format_complex path) before formatting here or refactor to
call the shared validator, then only proceed to call
self.format_sign_and_align(&AsciiStr::new(&magnitude_str), "",
FormatAlign::Right) if the validation passes so invalid specs are rejected
consistently.

In `@crates/vm/src/format.rs`:
- Around line 49-67: parse_grouping currently always appends a trailing 0
(meaning "repeat last group") even when the original C terminator was CHAR_MAX
(meaning "stop grouping"); modify parse_grouping to inspect the terminator byte
after the loop and only append 0 if the terminator was actually 0, and do not
append anything when the terminator was libc::c_char::MAX. Locate the unsafe fn
parse_grouping(grouping: *const libc::c_char) -> Vec<u8>, keep the loop that
advances ptr while *ptr is not 0 or libc::c_char::MAX, then check the byte at
ptr: if *ptr == 0 { if !result.is_empty() { result.push(0); } } else /* CHAR_MAX
*/ { /* leave result as-is */ } and return result.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yml

Review profile: CHILL

Plan: Pro

Run ID: 87132815-7bf6-459b-a039-b60fd04f64a4

📥 Commits

Reviewing files that changed from the base of the PR and between c92c2d1 and c585541.

⛔ Files ignored due to path filters (2)
  • Lib/test/test_format.py is excluded by !Lib/**
  • Lib/test/test_types.py is excluded by !Lib/**
📒 Files selected for processing (5)
  • crates/common/src/format.rs
  • crates/vm/src/builtins/complex.rs
  • crates/vm/src/builtins/float.rs
  • crates/vm/src/builtins/int.rs
  • crates/vm/src/format.rs

@youknowone youknowone merged commit 3b91466 into RustPython:main Mar 5, 2026
14 checks passed
@youknowone youknowone deleted the locale-n branch March 5, 2026 16:09
youknowone added a commit to youknowone/RustPython that referenced this pull request Mar 22, 2026
…ustPython#7350)

* Implement locale-aware 'n' format specifier for int, float, complex

Add LocaleInfo struct and locale-aware formatting methods to FormatSpec.
The 'n' format type now reads thousands_sep, decimal_point, and grouping
from C localeconv() and applies proper locale-based number grouping.
Remove @unittest.skip from test_format.test_locale.

* Fix complex 'n' format and remove locale expectedFailure markers

Rewrite format_complex_locale to reuse format_complex_re_im, matching
formatter_unicode.c: add_parens=0 and skip_re=0 for 'n' type.
Remove @expectedfailure from test_float__format__locale and
test_int__format__locale in test_types.py.

* Auto-format: cargo fmt --all

---------

Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant