Implement locale-aware 'n' format specifier for int, float, complex by youknowone · Pull Request #7350 · RustPython/RustPython

youknowone · 2026-03-04T16:48:33Z

Add LocaleInfo struct and locale-aware formatting methods to FormatSpec. The 'n' format type now reads thousands_sep, decimal_point, and grouping from C localeconv() and applies proper locale-based number grouping. Remove @unittest.skip from test_format.test_locale.

Summary by CodeRabbit

New Features
- Locale-aware number formatting for integers, floats, and complex numbers, respecting thousands separator, decimal point, and grouping with sensible platform fallbacks.
Bug Fixes
- Format-spec parse errors are now raised as proper Python exceptions, improving clarity and consistency of formatting error reporting.

coderabbitai · 2026-03-04T16:48:43Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

Adds locale-aware "n" formatting: new public LocaleInfo, locale-capable FormatSpec methods and helpers, platform-specific get_locale_info() (Unix via localeconv()), and updates builtins int/float/complex __format__ to map parse errors to Python exceptions and call locale-aware formatters when requested.

Changes

Cohort / File(s)	Summary
Core Locale Formatting `crates/common/src/format.rs`	Add public `LocaleInfo` and locale-aware `FormatSpec` APIs: `has_locale_format`, `format_int_locale`, `format_float_locale`, `format_complex_locale`; add internal helpers `insert_locale_grouping` and `apply_locale_formatting`.
Locale Info Retrieval `crates/vm/src/format.rs`	Add `get_locale_info()` (Unix: uses `libc::localeconv()` and `parse_grouping()` to build `LocaleInfo`; non-Unix: returns defaults); add `parse_grouping()` (unsafe, Unix-only).
Builtin Formatting Call Sites `crates/vm/src/builtins/int.rs`, `crates/vm/src/builtins/float.rs`, `crates/vm/src/builtins/complex.rs`	`__format__` implementations now convert `FormatSpec::parse` errors into Python exceptions and branch to locale-aware formatting: if `has_locale_format()` use `get_locale_info()` → `format__locale`, otherwise use existing non-locale `format_`. Error-to-exception mapping adjusted.

Sequence Diagram(s)

sequenceDiagram
  participant Caller
  participant Builtin as Builtin.__format__
  participant Spec as FormatSpec.parse / has_locale_format
  participant Locale as get_locale_info()
  participant Formatter as FormatSpec::format_*_locale

  Caller->>Builtin: __format__(value, spec)
  Builtin->>Spec: parse(spec)
  alt parse error
    Spec-->>Builtin: Err
    Builtin-->>Caller: raise PyException
  else parsed
    Spec->>Spec: has_locale_format()?
    alt locale requested
      Builtin->>Locale: get_locale_info()
      Locale-->>Builtin: LocaleInfo
      Builtin->>Formatter: format_*_locale(value, LocaleInfo)
      Formatter-->>Builtin: formatted string / Err
      Builtin-->>Caller: return or raise PyException
    else default path
      Builtin->>Spec: format_* (non-locale)
      Spec-->>Builtin: formatted string / Err
      Builtin-->>Caller: return or raise PyException
    end
  end

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

Implement PyComplex's __format__ function #5900 — earlier changes extending formatter and VM paths for complex/number formatting; touches similar format internals.
Fix number __format__ #6930 — overlapping edits to builtins __format__ implementations for int/float/complex.

Poem

🐇
I hop through digits, dot and comma bright,
I tuck in groups by moonlit locale light,
Unix whispers separators soft and sweet,
Floats, ints, complex line up neat on their feet,
I leave a carrot-shaped comma for tonight.

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly and specifically summarizes the main change: implementing locale-aware 'n' format specifier support for int, float, and complex number types.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

github-actions · 2026-03-04T16:49:01Z

Code has been automatically formatted

The code in this PR has been formatted using:

cargo fmt --all
Please pull the latest changes before pushing again:

git pull origin locale-n

github-actions · 2026-03-04T16:50:03Z

📦 Library Dependencies

The following Lib/ modules were modified. Here are their dependencies:

[x] test: cpython/Lib/test/test_format.py (TODO: 6)

dependencies:

dependent tests: (no tests depend on format)

[x] lib: cpython/Lib/types.py
[ ] test: cpython/Lib/test/test_types.py (TODO: 8)

dependencies:

types

dependent tests: (52 tests)

types: test_annotationlib test_ast test_asyncgen test_asyncio test_builtin test_call test_code test_collections test_compile test_compiler_assemble test_coroutines test_decorators test_descr test_dis test_doctest test_dtrace test_dynamicclassattribute test_email test_enum test_exception_group test_fstring test_funcattrs test_generators test_genericalias test_hmac test_importlib test_inspect test_listcomps test_marshal test_monitoring test_opcache test_os test_positional_only_arg test_pprint test_pyclbr test_pydoc test_raise test_string test_subclassinit test_subprocess test_tempfile test_threading test_trace test_traceback test_type_aliases test_type_annotations test_type_params test_types test_typing test_unittest test_xml_etree test_xml_etree_c

Legend:

[+] path exists in CPython
[x] up-to-date, [ ] outdated

chatgpt-codex-connector · 2026-03-05T00:14:55Z

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.
To continue using code reviews, you can upgrade your account or add credits to your account and enable them for code reviews in your settings.

coderabbitai

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@crates/common/src/format.rs`:
- Around line 640-690: The magnitude components are being formatted with
sign/width/alignment from the original spec (via magnitude_format.format_float),
causing duplicated signs and inner padding; also the real-part suppression
currently hides a zero real part even when an explicit complex format type was
requested. Fix by building magnitude_format from self but explicitly clear sign,
width, align, fill, zero_pad and alternate (so magnitude_format has no
sign/width/alignment) before calling magnitude_format.format_float /
Self::apply_locale_formatting, and change the re suppression condition to only
skip the real part when num.re == 0.0 && !num.re.is_sign_negative() AND the
original FormatSpec did not have an explicit format type (e.g., check
self.format_type.is_none() or the equivalent flag used to detect an explicit
type) so typed complex formatting still shows the real component; use the
existing symbols magnitude_format, format_float, apply_locale_formatting, and
the re_str logic to implement these changes.
- Around line 526-536: apply_locale_formatting currently passes the entire
pre-decimal chunk (magnitude_str split by '.') to insert_locale_grouping which
corrupts exponent or non-digit suffixes like "1e+06"; update
apply_locale_formatting to isolate the contiguous digit sequence (allowing an
optional leading sign) from the integer part, call Self::insert_locale_grouping
only on that digit sequence, then rebuild the integer part by concatenating the
grouped digits with any trailing non-digit suffix (e.g., exponent markers), and
finally reattach the decimal point/frac part using locale.decimal_point; refer
to apply_locale_formatting, Self::insert_locale_grouping, magnitude_str, and
locale.decimal_point to locate where to implement this change.

In `@crates/vm/src/format.rs`:
- Around line 49-67: parse_grouping currently always appends a trailing 0
(meaning "repeat last") even when the C sequence terminated with CHAR_MAX
(meaning "stop grouping entirely"). Modify parse_grouping to distinguish the
terminator: iterate over grouping (pointer grouping) until you reach 0 or
libc::c_char::MAX, keep the bytes you read in result, then if the terminator was
a NUL (0) append a trailing 0 to indicate "repeat last", but if the terminator
was CHAR_MAX do not append anything (stop grouping). Update the doc comment to
reflect that CHAR_MAX means "stop grouping" and NUL means "repeat last". Use the
existing symbols parse_grouping, grouping, result, ptr, libc::c_char::MAX and
the checked terminator to implement this behavior.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yml

Review profile: CHILL

Plan: Pro

Run ID: b91f907e-b731-492a-a941-15d354c29a80

📥 Commits

Reviewing files that changed from the base of the PR and between 5c29074 and 8ab371e.

⛔ Files ignored due to path filters (1)

Lib/test/test_format.py is excluded by !Lib/**

📒 Files selected for processing (5)

crates/common/src/format.rs
crates/vm/src/builtins/complex.rs
crates/vm/src/builtins/float.rs
crates/vm/src/builtins/int.rs
crates/vm/src/format.rs

crates/common/src/format.rs

coderabbitai · 2026-03-05T00:24:37Z

crates/common/src/format.rs

+        let magnitude_format = FormatSpec {
+            format_type: Some(format_type),
+            precision: Some(if precision == 0 { 1 } else { precision }),
+            ..*self
+        };
+
+        let re_str = if num.re == 0.0 && !num.re.is_sign_negative() {
+            None
+        } else {
+            let re_sign = if num.re.is_sign_negative() && !num.re.is_nan() {
+                "-"
+            } else {
+                ""
+            };
+            let re_mag = magnitude_format.format_float(num.re.abs())?;
+            let re_grouped = Self::apply_locale_formatting(re_mag, locale);
+            Some(format!("{re_sign}{re_grouped}"))
+        };
+
+        let im_sign = if num.im.is_sign_negative() && !num.im.is_nan() {
+            "-"
+        } else {
+            ""
+        };
+        let im_mag = magnitude_format.format_float(num.im.abs())?;
+        let im_grouped = Self::apply_locale_formatting(im_mag, locale);
+
+        let has_re = re_str.is_some();
+        let formatted = if let Some(re_str) = re_str {
+            let sep = if num.im >= 0.0 || num.im.is_nan() {
+                "+"
+            } else {
+                ""
+            };
+            format!("({re_str}{sep}{im_sign}{im_grouped}j)")
+        } else {
+            format!("{im_sign}{im_grouped}j")
+        };
+
+        let format_sign = self.sign.unwrap_or(FormatSign::Minus);
+        let sign_str = match format_sign {
+            FormatSign::Plus => "+",
+            FormatSign::Minus => "",
+            FormatSign::MinusOrSpace => " ",
+        };
+
+        self.format_sign_and_align(
+            &AsciiStr::new(&formatted),
+            if has_re { "" } else { sign_str },
+            FormatAlign::Right,
+        )


⚠️ Potential issue | 🔴 Critical

format_complex_locale is formatting component magnitudes with sign/width logic.

Using format_float here injects per-component sign/alignment/width, then signs are added again in this function. That can yield invalid strings (-+..., ++...) and inner padding artifacts.

Also, the real-part suppression condition drops 0 real part even with an explicit type, which diverges from the existing typed complex formatting flow.

💡 Suggested fix

- let re_str = if num.re == 0.0 && !num.re.is_sign_negative() { + let re_str = if num.re == 0.0 && !num.re.is_sign_negative() && self.format_type.is_none() { None } else { let re_sign = if num.re.is_sign_negative() && !num.re.is_nan() { "-" } else { "" }; - let re_mag = magnitude_format.format_float(num.re.abs())?; + let re_mag = magnitude_format.format_complex_float(num.re.abs())?; let re_grouped = Self::apply_locale_formatting(re_mag, locale); Some(format!("{re_sign}{re_grouped}")) }; @@ - let im_mag = magnitude_format.format_float(num.im.abs())?; + let im_mag = magnitude_format.format_complex_float(num.im.abs())?;

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@crates/common/src/format.rs` around lines 640 - 690, The magnitude components are being formatted with sign/width/alignment from the original spec (via magnitude_format.format_float), causing duplicated signs and inner padding; also the real-part suppression currently hides a zero real part even when an explicit complex format type was requested. Fix by building magnitude_format from self but explicitly clear sign, width, align, fill, zero_pad and alternate (so magnitude_format has no sign/width/alignment) before calling magnitude_format.format_float / Self::apply_locale_formatting, and change the re suppression condition to only skip the real part when num.re == 0.0 && !num.re.is_sign_negative() AND the original FormatSpec did not have an explicit format type (e.g., check self.format_type.is_none() or the equivalent flag used to detect an explicit type) so typed complex formatting still shows the real component; use the existing symbols magnitude_format, format_float, apply_locale_formatting, and the re_str logic to implement these changes.

coderabbitai · 2026-03-05T00:24:37Z

crates/vm/src/format.rs

+/// Parse C `lconv.grouping` into a `Vec<u8>`.
+/// Reads bytes until 0 or CHAR_MAX, then appends 0 (meaning "repeat last group").
+#[cfg(unix)]
+unsafe fn parse_grouping(grouping: *const libc::c_char) -> Vec<u8> {
+    let mut result = Vec::new();
+    if grouping.is_null() {
+        return result;
+    }
+    unsafe {
+        let mut ptr = grouping;
+        while ![0, libc::c_char::MAX].contains(&*ptr) {
+            result.push(*ptr as u8);
+            ptr = ptr.add(1);
+        }
+    }
+    if !result.is_empty() {
+        result.push(0);
+    }
+    result


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🌐 Web query:

C POSIX localeconv grouping lconv.grouping CHAR_MAX 0 terminator semantics

💡 Result:

In struct lconv (from localeconv()), grouping/mon_grouping is a char * that points to a sequence of byte values (often treated like an unsigned-byte array) describing how to group digits to the left of the radix point:

grouping[0] = size of the rightmost group; grouping[1] = next group to the left; etc. [1]

If an element is 0, it means repeat the previous group size for all remaining more-significant digits. Because 0 is also '\0', this is typically where the sequence ends (e.g. "\3\0" means “group by 3 forever”). [1]

If an element is CHAR_MAX, it means stop grouping: no further separators are inserted for any more-significant digits. [1]

If grouping is the empty string (""), that means no grouping information / no grouping (implementation/locale dependent, but common). [2]

Notes:

Some platforms use -1 (in a char) instead of CHAR_MAX due to char signedness; robust code often compares after casting to unsigned char. [3]

Sources: Linux locale(7) description of grouping semantics (including 0 and CHAR_MAX) [1]; NetBSD localeconv(3) notes on CHAR_MAX as “unavailable” in some fields/locales [2]; Gnulib portability note about -1 vs CHAR_MAX on some platforms [3].

[1] (man7.org) [2] (man.netbsd.org) [3] (gnu.org)

Citations:

1: https://man7.org/linux/man-pages/man7/locale.7.html?utm_source=openai

2: https://man.netbsd.org/localeconv.3?utm_source=openai

3: https://www.gnu.org/software/gnulib/manual/html_node/localeconv.html?utm_source=openai

🏁 Script executed:

# First, let's check if there are any clippy or format issues with the current implementation cd crates/vm/src && head -70 format.rs | tail -25

Repository: RustPython/RustPython

Length of output: 674

🏁 Script executed:

cd crates/vm && grep -n "parse_grouping\|get_locale_info" src/format.rs | head -20

Repository: RustPython/RustPython

Length of output: 296

🏁 Script executed:

# Verify the suggested fix more carefully by checking the actual current state cd crates/vm/src && cat -n format.rs | sed -n '49,68p'

Repository: RustPython/RustPython

Length of output: 761

parse_grouping conflates CHAR_MAX with repeat-last behavior, violating C/POSIX semantics.

When the C grouping sequence ends with CHAR_MAX, it means "stop grouping entirely". The current implementation appends 0 unconditionally, which incorrectly translates this to "repeat previous group forever". Additionally, the doc comment is misleading.

Suggested fix

unsafe fn parse_grouping(grouping: *const libc::c_char) -> Vec<u8> { let mut result = Vec::new(); if grouping.is_null() { return result; } - unsafe { - let mut ptr = grouping; - while ![0, libc::c_char::MAX].contains(&*ptr) { - result.push(*ptr as u8); - ptr = ptr.add(1); - } - } - if !result.is_empty() { + let mut ptr = grouping; + let mut repeat_last = false; + while !ptr.is_null() { + let v = *ptr; + if v == 0 { + repeat_last = true; + break; + } + if v == libc::c_char::MAX { + repeat_last = false; + break; + } + result.push(v as u8); + ptr = ptr.add(1); + } + if repeat_last && !result.is_empty() { result.push(0); } result }

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@crates/vm/src/format.rs` around lines 49 - 67, parse_grouping currently always appends a trailing 0 (meaning "repeat last") even when the C sequence terminated with CHAR_MAX (meaning "stop grouping entirely"). Modify parse_grouping to distinguish the terminator: iterate over grouping (pointer grouping) until you reach 0 or libc::c_char::MAX, keep the bytes you read in result, then if the terminator was a NUL (0) append a trailing 0 to indicate "repeat last", but if the terminator was CHAR_MAX do not append anything (stop grouping). Update the doc comment to reflect that CHAR_MAX means "stop grouping" and NUL means "repeat last". Use the existing symbols parse_grouping, grouping, result, ptr, libc::c_char::MAX and the checked terminator to implement this behavior.

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (1)

crates/common/src/format.rs (1)

525-536: ⚠️ Potential issue | 🟠 Major

apply_locale_formatting still corrupts exponent notation for numbers without decimal points.

The previous review flagged this issue and a fix was suggested, but the current code doesn't include that fix. When format_general produces output like "1e+06" (no decimal point), splitn(2, '.') yields the entire string as int_part, and insert_locale_grouping will corrupt it by inserting separators into non-digit characters.

💡 Suggested fix: handle exponent notation

 fn apply_locale_formatting(magnitude_str: String, locale: &LocaleInfo) -> String {
+    // Separate exponent part before processing
+    let (mantissa, exponent) = match magnitude_str.find(|c| c == 'e' || c == 'E') {
+        Some(pos) => (&magnitude_str[..pos], Some(&magnitude_str[pos..])),
+        None => (magnitude_str.as_str(), None),
+    };
+
-    let mut parts = magnitude_str.splitn(2, '.');
+    let mut parts = mantissa.splitn(2, '.');
     let int_part = parts.next().unwrap();
     let grouped = Self::insert_locale_grouping(int_part, locale);

-    if let Some(frac_part) = parts.next() {
+    let result = if let Some(frac_part) = parts.next() {
         format!("{grouped}{}{frac_part}", locale.decimal_point)
     } else {
         grouped
+    };
+
+    match exponent {
+        Some(exp) => format!("{result}{exp}"),
+        None => result,
     }
 }

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@crates/common/src/format.rs` around lines 525 - 536, apply_locale_formatting
currently corrupts exponent notation like "1e+06" because it applies
insert_locale_grouping to the whole int_part; fix by detecting an exponent ('e'
or 'E') in the original magnitude_str (or int_part) and split off the exponent
suffix before grouping, then only call insert_locale_grouping on the mantissa
integer portion (preserving any leading sign), finally reassemble as
"{grouped_mantissa}{optional_decimal_point}{frac_part}{exponent_suffix}" so
exponents remain untouched; update apply_locale_formatting to locate 'e'/'E',
separate exponent, preserve sign, call insert_locale_grouping on the numeric
digits, and then concatenate the exponent back.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@crates/common/src/format.rs`:
- Line 564: The padding is computed using AsciiStr.char_len() (byte length),
which miscounts multi-byte locale separators in magnitude_str from
apply_locale_formatting(); update format_sign_and_align and its caller site so
magnitude_str is passed/treated as a Unicode &str (not AsciiStr) and compute
visual length with a Unicode-aware count (e.g., s.chars().count() or
UnicodeSegmentation::graphemes(true).count()) when calculating width/fill;
replace references to AsciiStr::new(&magnitude_str) and char_len() in
format_sign_and_align with the Unicode-aware length and adjust any
types/signatures that assumed AsciiStr accordingly.

---

Duplicate comments:
In `@crates/common/src/format.rs`:
- Around line 525-536: apply_locale_formatting currently corrupts exponent
notation like "1e+06" because it applies insert_locale_grouping to the whole
int_part; fix by detecting an exponent ('e' or 'E') in the original
magnitude_str (or int_part) and split off the exponent suffix before grouping,
then only call insert_locale_grouping on the mantissa integer portion
(preserving any leading sign), finally reassemble as
"{grouped_mantissa}{optional_decimal_point}{frac_part}{exponent_suffix}" so
exponents remain untouched; update apply_locale_formatting to locate 'e'/'E',
separate exponent, preserve sign, call insert_locale_grouping on the numeric
digits, and then concatenate the exponent back.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yml

Review profile: CHILL

Plan: Pro

Run ID: 38611da9-2a7f-42db-803e-f50a9b839612

📥 Commits

Reviewing files that changed from the base of the PR and between 8ab371e and be787c6.

⛔ Files ignored due to path filters (1)

Lib/test/test_types.py is excluded by !Lib/**

📒 Files selected for processing (1)

crates/common/src/format.rs

coderabbitai · 2026-03-05T11:41:51Z

crates/common/src/format.rs

+            },
+        };
+
+        self.format_sign_and_align(&AsciiStr::new(&magnitude_str), sign_str, FormatAlign::Right)


⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

# First, let's look at the AsciiStr implementation rg -n "impl.*AsciiStr" crates/common/src/format.rs -A 30 | head -100

Repository: RustPython/RustPython

Length of output: 1208

🏁 Script executed:

# Check the specific lines mentioned in the comment sed -n '560,570p' crates/common/src/format.rs

Repository: RustPython/RustPython

Length of output: 362

🏁 Script executed:

# Also check lines 604 and 656 sed -n '600,610p' crates/common/src/format.rs sed -n '652,662p' crates/common/src/format.rs

Repository: RustPython/RustPython

Length of output: 774

🏁 Script executed:

# Look for the char_len method definition rg -n "fn char_len" crates/common/src/format.rs -A 5

Repository: RustPython/RustPython

Length of output: 305

🏁 Script executed:

# Check how locale separators are inserted into the magnitude_str rg -n "thousands_sep\|decimal_point" crates/common/src/format.rs -B 2 -A 2

Repository: RustPython/RustPython

Length of output: 47

🏁 Script executed:

# Look for locale definition and separator types rg -n "struct.*Locale\|decimal_point\|thousands_sep" crates/common/src/ -A 3

Repository: RustPython/RustPython

Length of output: 47

🏁 Script executed:

# Search for Locale struct and locale-related code find crates -name "*.rs" | xargs grep -l "locale\|Locale" | head -5

Repository: RustPython/RustPython

Length of output: 204

🏁 Script executed:

# Search more broadly for where locale is used in format.rs rg -n "locale\." crates/common/src/format.rs -B 2 -A 2

Repository: RustPython/RustPython

Length of output: 1144

🏁 Script executed:

# Look for how magnitude_str is constructed rg -n "magnitude_str.*format\|let.*magnitude_str" crates/common/src/format.rs -B 2 -A 2 | head -50

Repository: RustPython/RustPython

Length of output: 47

🏁 Script executed:

# Find LocaleInfo definition rg -n "struct LocaleInfo\|^pub struct.*Locale" crates/common/src/format.rs -A 10

Repository: RustPython/RustPython

Length of output: 47

🏁 Script executed:

# Check the complete insert_locale_grouping function sed -n '482,535p' crates/common/src/format.rs

Repository: RustPython/RustPython

Length of output: 1970

🏁 Script executed:

# Look at the format_float_locale function to see how it constructs magnitude_str sed -n '567,600p' crates/common/src/format.rs

Repository: RustPython/RustPython

Length of output: 1303

🏁 Script executed:

# Check the locale module to understand the locale structure head -100 crates/stdlib/src/locale.rs

Repository: RustPython/RustPython

Length of output: 3356

🏁 Script executed:

# Find format_sign_and_align implementation rg -n "fn format_sign_and_align" crates/common/src/format.rs -A 20

Repository: RustPython/RustPython

Length of output: 851

🏁 Script executed:

# Check how char_len is used in format_sign_and_align sed -n '900,1000p' crates/common/src/format.rs | grep -n "char_len\|width"

Repository: RustPython/RustPython

Length of output: 184

🏁 Script executed:

# Find the complete format_sign_and_align function rg -n "fn format_sign_and_align" crates/common/src/format.rs -A 30

Repository: RustPython/RustPython

Length of output: 1253

🏁 Script executed:

# Let's look for where LocaleInfo is defined rg -n "struct LocaleInfo\|pub.*LocaleInfo" crates/common/src/ -A 5

Repository: RustPython/RustPython

Length of output: 47

🏁 Script executed:

# Check if there are other implementations of CharLen rg -n "impl CharLen" crates/common/src/format.rs

Repository: RustPython/RustPython

Length of output: 101

🏁 Script executed:

# Let's look for a Unicode-aware version or other CharLen implementations rg -n "impl.*CharLen" crates/ -A 3

Repository: RustPython/RustPython

Length of output: 1007

🏁 Script executed:

# Check if locales can actually have non-ASCII separators rg -n "thousands_sep\|decimal_point" crates/stdlib/src/locale.rs -B 2 -A 2 | head -30

Repository: RustPython/RustPython

Length of output: 47

🏁 Script executed:

# Let me check the full context - look for where apply_locale_formatting is called and how its result is used sed -n '567,610p' crates/common/src/format.rs

Repository: RustPython/RustPython

Length of output: 1569

🏁 Script executed:

# Let's verify that the magnitude_str can contain non-ASCII characters # Check what types are used for thousands_sep and decimal_point rg -n "pub struct LocaleInfo\|thousands_sep:\|decimal_point:" crates/common/src/ -A 2

Repository: RustPython/RustPython

Length of output: 47

🏁 Script executed:

# Let's trace through what happens - check if the string from apply_locale_formatting # is used directly or if there's any normalization sed -n '527,545p' crates/common/src/format.rs

Repository: RustPython/RustPython

Length of output: 687

🏁 Script executed:

# Check how sign_str.len() is used - does it account for multi-byte characters? sed -n '985,990p' crates/common/src/format.rs

Repository: RustPython/RustPython

Length of output: 331

🏁 Script executed:

# Find the LocaleInfo struct definition rg -n "^pub struct LocaleInfo\|^struct LocaleInfo" crates/common/src/format.rs -A 10

Repository: RustPython/RustPython

Length of output: 47

🏁 Script executed:

# Check where LocaleInfo is created/initialized rg -n "LocaleInfo {" crates/common/src/format.rs -B 2 -A 5

Repository: RustPython/RustPython

Length of output: 152

🏁 Script executed:

# Let's trace how locale.thousands_sep is set - look at the locale module more carefully sed -n '1,150p' crates/stdlib/src/locale.rs | grep -A 5 -B 5 "thousands_sep"

Repository: RustPython/RustPython

Length of output: 559

🏁 Script executed:

# Search for LocaleInfo definition differently rg "LocaleInfo" crates/common/src/format.rs | head -5

Repository: RustPython/RustPython

Length of output: 312

🏁 Script executed:

# Get context around the first LocaleInfo reference rg -n "LocaleInfo" crates/common/src/format.rs | head -1

Repository: RustPython/RustPython

Length of output: 91

🏁 Script executed:

# Look for where LocaleInfo is defined - broader search rg -n "struct LocaleInfo" crates/ -A 10

Repository: RustPython/RustPython

Length of output: 772

🏁 Script executed:

# Now let's verify: check what sign_str can be and if it has the same byte-length issue sed -n '590,605p' crates/common/src/format.rs

Repository: RustPython/RustPython

Length of output: 611

🏁 Script executed:

# Check the similar code in format_int_locale and format_complex_locale for the same issue sed -n '620,650p' crates/common/src/format.rs

Repository: RustPython/RustPython

Length of output: 1283

🏁 Script executed:

# Let's also verify if there are other CharLen implementations that correctly handle Unicode sed -n '1520,1535p' crates/vm/src/builtins/str.rs

Repository: RustPython/RustPython

Length of output: 429

🏁 Script executed:

# Final verification: check if French locale actually uses non-breaking space python3 << 'EOF' import locale # Check what the French locale uses for thousands separator # Note: this is for documentation purposes - we know from Unicode that: # - \u{00A0} is non-breaking space (NBSP) - 2 bytes in UTF-8 # - French locale typically uses NBSP as thousands separator # - This is confirmed by CPython behavior print("UTF-8 byte length of non-breaking space: ", len("\u00A0".encode('utf-8'))) print("Character count of '1\u00A0234': ", len("1\u00A02 34")) print("Byte length of '1\u00A0234': ", len("1\u00A0234".encode('utf-8'))) EOF

Repository: RustPython/RustPython

Length of output: 166

🏁 Script executed:

# More direct verification python3 << 'EOF' nbsp_string = "1\u00A0234" # 1, NBSP, 2, 3, 4 print(f"String: {repr(nbsp_string)}") print(f"Character count: {len(nbsp_string)}") print(f"Byte count: {len(nbsp_string.encode('utf-8'))}") print(f"Difference: {len(nbsp_string.encode('utf-8')) - len(nbsp_string)} bytes extra") EOF

Repository: RustPython/RustPython

Length of output: 142

AsciiStr returns byte length instead of character count, causing incorrect width padding with non-ASCII locale separators.

In format_sign_and_align() (line 988), the width calculation uses magnitude_str.char_len() which AsciiStr implements to return self.inner.len() (byte length). However, magnitude_str from apply_locale_formatting() can contain multi-byte UTF-8 characters from locale separators—for example, French locales use non-breaking space (\u{00A0}, 2 bytes in UTF-8) as the thousands separator. This causes the fill character calculation to be off by 1+ bytes when formatting with width specs like "{:15n}".

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@crates/common/src/format.rs` at line 564, The padding is computed using AsciiStr.char_len() (byte length), which miscounts multi-byte locale separators in magnitude_str from apply_locale_formatting(); update format_sign_and_align and its caller site so magnitude_str is passed/treated as a Unicode &str (not AsciiStr) and compute visual length with a Unicode-aware count (e.g., s.chars().count() or UnicodeSegmentation::graphemes(true).count()) when calculating width/fill; replace references to AsciiStr::new(&magnitude_str) and char_len() in format_sign_and_align with the Unicode-aware length and adjust any types/signatures that assumed AsciiStr accordingly.

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (3)

crates/common/src/format.rs (2)

564-564: ⚠️ Potential issue | 🟡 Minor

Locale width/alignment uses byte length instead of character count.

At Line 564, Line 604, and Line 656, locale-formatted strings are wrapped as AsciiStr. AsciiStr::char_len() (Line 1040) uses byte length, so width padding is wrong for multibyte separators (e.g., NBSP thousands separator).

💡 Proposed direction

- self.format_sign_and_align(&AsciiStr::new(&magnitude_str), sign_str, FormatAlign::Right)
+ self.format_sign_and_align(&UnicodeStr::new(&magnitude_str), sign_str, FormatAlign::Right)

// Add alongside AsciiStr
struct UnicodeStr<'a> {
    inner: &'a str,
}

impl<'a> UnicodeStr<'a> {
    const fn new(inner: &'a str) -> Self {
        Self { inner }
    }
}

impl CharLen for UnicodeStr<'_> {
    fn char_len(&self) -> usize {
        self.inner.chars().count()
    }
}

impl Deref for UnicodeStr<'_> {
    type Target = str;
    fn deref(&self) -> &Self::Target {
        self.inner
    }
}

#!/bin/bash
# Verify current locale-format callsites and CharLen behavior.
rg -n "AsciiStr::new\\(&magnitude_str\\)" crates/common/src/format.rs
sed -n '1038,1042p' crates/common/src/format.rs

Also applies to: 604-604, 656-656

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@crates/common/src/format.rs` at line 564, Locale-formatted strings are being
wrapped with AsciiStr (via AsciiStr::new(&magnitude_str)) but
AsciiStr::char_len() returns byte length, breaking width/alignment for multibyte
separators; introduce a UnicodeStr<'a> wrapper (with new(inner: &'a str)),
implement CharLen for it using self.inner.chars().count(), and Deref<Target=str>
so it behaves like &str, then replace the locale-related AsciiStr::new(...)
callsites (the calls into format_sign_and_align / format_* that pass
&AsciiStr::new(&magnitude_str)) with &UnicodeStr::new(&magnitude_str) and add
any necessary use/imports.

525-536: ⚠️ Potential issue | 🟠 Major

Group only mantissa digits before locale insertion.

At Line 527, the whole pre-. chunk is grouped. Scientific notation without a decimal point (e.g., 1e+06) gets corrupted by separator insertion. This also affects callers at Line 591 and Line 650.

💡 Proposed fix

 fn apply_locale_formatting(magnitude_str: String, locale: &LocaleInfo) -> String {
-    let mut parts = magnitude_str.splitn(2, '.');
-    let int_part = parts.next().unwrap();
-    let grouped = Self::insert_locale_grouping(int_part, locale);
-
-    if let Some(frac_part) = parts.next() {
-        format!("{grouped}{}{frac_part}", locale.decimal_point)
-    } else {
-        grouped
-    }
+    let exp_pos = magnitude_str.find(|c| c == 'e' || c == 'E');
+    let (mantissa, exponent) = exp_pos
+        .map(|i| (&magnitude_str[..i], Some(&magnitude_str[i..])))
+        .unwrap_or((magnitude_str.as_str(), None));
+
+    let (int_part, frac_part) = mantissa
+        .split_once('.')
+        .map_or((mantissa, None), |(i, f)| (i, Some(f)));
+
+    let grouped = if int_part.chars().all(|c| c.is_ascii_digit()) {
+        Self::insert_locale_grouping(int_part, locale)
+    } else {
+        int_part.to_owned()
+    };
+
+    let mut out = if let Some(frac_part) = frac_part {
+        format!("{grouped}{}{frac_part}", locale.decimal_point)
+    } else {
+        grouped
+    };
+    if let Some(exponent) = exponent {
+        out.push_str(exponent);
+    }
+    out
 }

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@crates/common/src/format.rs` around lines 525 - 536, The current
apply_locale_formatting function applies grouping to the entire pre-decimal
chunk, corrupting scientific exponents like "1e+06"; update
apply_locale_formatting to detect and preserve any exponent suffix (look for 'e'
or 'E' plus the rest) by splitting magnitude_str into mantissa, optional
fractional part, and optional exponent, apply Self::insert_locale_grouping only
to the mantissa's integer digits, then recombine mantissa (with
locale.decimal_point and frac if present) and append the untouched exponent;
ensure magnitude_str, insert_locale_grouping, and locale.decimal_point are used
accordingly so exponents remain unmodified (this change will also fix the
callers that rely on apply_locale_formatting).

crates/vm/src/format.rs (1)

49-67: ⚠️ Potential issue | 🟠 Major

Preserve CHAR_MAX semantics in parse_grouping.

At Line 50 and Line 64, the implementation always appends 0 after parsing non-empty grouping data. If the C sequence ends with CHAR_MAX, this incorrectly changes behavior from “stop grouping” to “repeat last group forever.”

💡 Proposed fix

 /// Parse C `lconv.grouping` into a `Vec<u8>`.
-/// Reads bytes until 0 or CHAR_MAX, then appends 0 (meaning "repeat last group").
+/// Reads bytes until 0 or CHAR_MAX.
+/// - 0 means "repeat last group size"
+/// - CHAR_MAX means "stop grouping"
 #[cfg(unix)]
 unsafe fn parse_grouping(grouping: *const libc::c_char) -> Vec<u8> {
     let mut result = Vec::new();
     if grouping.is_null() {
         return result;
     }
-    unsafe {
-        let mut ptr = grouping;
-        while ![0, libc::c_char::MAX].contains(&*ptr) {
-            result.push(*ptr as u8);
-            ptr = ptr.add(1);
-        }
+    let mut ptr = grouping;
+    let mut repeat_last = false;
+    loop {
+        let v = *ptr;
+        if v == 0 {
+            repeat_last = true;
+            break;
+        }
+        if v == libc::c_char::MAX {
+            break;
+        }
+        result.push(v as u8);
+        ptr = ptr.add(1);
     }
-    if !result.is_empty() {
+    if repeat_last && !result.is_empty() {
         result.push(0);
     }
     result
 }

In POSIX/C `localeconv()`, for `lconv.grouping`, what are the semantics of terminators `0` and `CHAR_MAX`, and should `CHAR_MAX` stop further grouping rather than repeat the previous size?

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@crates/vm/src/format.rs` around lines 49 - 67, The parse_grouping function
currently always appends a 0 terminator after collecting bytes, which
incorrectly changes semantics when the original C sequence ended with CHAR_MAX
(meaning "stop grouping"); modify parse_grouping to detect which terminator
stopped the loop (0 vs CHAR_MAX) by inspecting the terminating byte at the stop
condition and only append a 0 to result if the terminator was 0; reference the
unsafe fn parse_grouping(grouping: *const libc::c_char), the loop that reads
*ptr, and libc::c_char::MAX/0 to implement this conditional behavior.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@crates/common/src/format.rs`:
- Around line 607-657: format_complex_locale skips the same complex-format flag
validation performed in format_complex, allowing invalid '=' alignment and
zero-padding to pass; fix by enforcing the same checks before formatting: in
format_complex_locale (inside impl for FormatSpec) replicate or call the same
validation logic used by format_complex to return
Err(FormatSpecError::AlignmentFlag) when the alignment is '='
(FormatAlign::AfterSign) and Err(FormatSpecError::ZeroPadding) when
zero-fill/zero-padding rules are violated (e.g., fill == '0' with a sign-aware
width), or better yet extract that logic into a small helper (e.g.,
validate_complex_format_flags(&self) -> Result<(), FormatSpecError>) and invoke
it at the start of format_complex_locale so the later calls to
format_complex_re_im and format_sign_and_align behave consistently.

---

Duplicate comments:
In `@crates/common/src/format.rs`:
- Line 564: Locale-formatted strings are being wrapped with AsciiStr (via
AsciiStr::new(&magnitude_str)) but AsciiStr::char_len() returns byte length,
breaking width/alignment for multibyte separators; introduce a UnicodeStr<'a>
wrapper (with new(inner: &'a str)), implement CharLen for it using
self.inner.chars().count(), and Deref<Target=str> so it behaves like &str, then
replace the locale-related AsciiStr::new(...) callsites (the calls into
format_sign_and_align / format_* that pass &AsciiStr::new(&magnitude_str)) with
&UnicodeStr::new(&magnitude_str) and add any necessary use/imports.
- Around line 525-536: The current apply_locale_formatting function applies
grouping to the entire pre-decimal chunk, corrupting scientific exponents like
"1e+06"; update apply_locale_formatting to detect and preserve any exponent
suffix (look for 'e' or 'E' plus the rest) by splitting magnitude_str into
mantissa, optional fractional part, and optional exponent, apply
Self::insert_locale_grouping only to the mantissa's integer digits, then
recombine mantissa (with locale.decimal_point and frac if present) and append
the untouched exponent; ensure magnitude_str, insert_locale_grouping, and
locale.decimal_point are used accordingly so exponents remain unmodified (this
change will also fix the callers that rely on apply_locale_formatting).

In `@crates/vm/src/format.rs`:
- Around line 49-67: The parse_grouping function currently always appends a 0
terminator after collecting bytes, which incorrectly changes semantics when the
original C sequence ended with CHAR_MAX (meaning "stop grouping"); modify
parse_grouping to detect which terminator stopped the loop (0 vs CHAR_MAX) by
inspecting the terminating byte at the stop condition and only append a 0 to
result if the terminator was 0; reference the unsafe fn parse_grouping(grouping:
*const libc::c_char), the loop that reads *ptr, and libc::c_char::MAX/0 to
implement this conditional behavior.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yml

Review profile: CHILL

Plan: Pro

Run ID: 70a4d2f2-7c14-41b2-aa12-3176ce3a29d5

📥 Commits

Reviewing files that changed from the base of the PR and between be787c6 and 092bbda.

⛔ Files ignored due to path filters (2)

Lib/test/test_format.py is excluded by !Lib/**
Lib/test/test_types.py is excluded by !Lib/**

📒 Files selected for processing (5)

crates/common/src/format.rs
crates/vm/src/builtins/complex.rs
crates/vm/src/builtins/float.rs
crates/vm/src/builtins/int.rs
crates/vm/src/format.rs

coderabbitai · 2026-03-05T11:50:06Z

crates/common/src/format.rs

+    /// Format a complex number with locale-aware 'n' format.
+    pub fn format_complex_locale(
+        &self,
+        num: &Complex64,
+        locale: &LocaleInfo,
+    ) -> Result<String, FormatSpecError> {
+        // Reuse format_complex_re_im with 'g' type to get the base formatted parts,
+        // then apply locale grouping. This matches CPython's format_complex_internal:
+        // 'n' → 'g', add_parens=0, skip_re=0.
+        let locale_spec = FormatSpec {
+            format_type: Some(FormatType::GeneralFormat(Case::Lower)),
+            ..*self
+        };
+        let (formatted_re, formatted_im) = locale_spec.format_complex_re_im(num)?;
+
+        // Apply locale grouping to both parts
+        let grouped_re = if formatted_re.is_empty() {
+            formatted_re
+        } else {
+            // Split sign from magnitude, apply grouping, recombine
+            let (sign, mag) = if formatted_re.starts_with('-')
+                || formatted_re.starts_with('+')
+                || formatted_re.starts_with(' ')
+            {
+                formatted_re.split_at(1)
+            } else {
+                ("", formatted_re.as_str())
+            };
+            format!(
+                "{sign}{}",
+                Self::apply_locale_formatting(mag.to_string(), locale)
+            )
+        };
+
+        // formatted_im is like "+1234j" or "-1234j" or "1234j"
+        // Split sign, magnitude, and 'j' suffix
+        let im_str = &formatted_im;
+        let (im_sign, im_rest) = if im_str.starts_with('+') || im_str.starts_with('-') {
+            im_str.split_at(1)
+        } else {
+            ("", im_str.as_str())
+        };
+        let im_mag = im_rest.strip_suffix('j').unwrap_or(im_rest);
+        let im_grouped = Self::apply_locale_formatting(im_mag.to_string(), locale);
+        let grouped_im = format!("{im_sign}{im_grouped}j");
+
+        // No parentheses for 'n' format (CPython: add_parens=0)
+        let magnitude_str = format!("{grouped_re}{grouped_im}");
+
+        self.format_sign_and_align(&AsciiStr::new(&magnitude_str), "", FormatAlign::Right)
+    }


⚠️ Potential issue | 🟠 Major

format_complex_locale bypasses complex-format flag validation.

At Line 656, the locale path returns formatted output without enforcing the same checks as format_complex (Line 864-Line 870). = alignment and zero-padding should raise AlignmentFlag and ZeroPadding consistently.

💡 Proposed fix

// No parentheses for 'n' format (CPython: add_parens=0) let magnitude_str = format!("{grouped_re}{grouped_im}"); - - self.format_sign_and_align(&AsciiStr::new(&magnitude_str), "", FormatAlign::Right) + if let Some(FormatAlign::AfterSign) = &self.align { + return Err(FormatSpecError::AlignmentFlag); + } + match &self.fill.unwrap_or(' '.into()).to_char() { + Some('0') => Err(FormatSpecError::ZeroPadding), + _ => self.format_sign_and_align(&AsciiStr::new(&magnitude_str), "", FormatAlign::Right), + }

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@crates/common/src/format.rs` around lines 607 - 657, format_complex_locale skips the same complex-format flag validation performed in format_complex, allowing invalid '=' alignment and zero-padding to pass; fix by enforcing the same checks before formatting: in format_complex_locale (inside impl for FormatSpec) replicate or call the same validation logic used by format_complex to return Err(FormatSpecError::AlignmentFlag) when the alignment is '=' (FormatAlign::AfterSign) and Err(FormatSpecError::ZeroPadding) when zero-fill/zero-padding rules are violated (e.g., fill == '0' with a sign-aware width), or better yet extract that logic into a small helper (e.g., validate_complex_format_flags(&self) -> Result<(), FormatSpecError>) and invoke it at the start of format_complex_locale so the later calls to format_complex_re_im and format_sign_and_align behave consistently.

coderabbitai

♻️ Duplicate comments (4)

crates/common/src/format.rs (3)

525-536: ⚠️ Potential issue | 🟠 Major

Don’t run grouping across exponent text in apply_locale_formatting.

At Line 527-Line 533, grouping is applied to everything before .; exponent forms like 1e+06 can be altered incorrectly.

Suggested fix

 fn apply_locale_formatting(magnitude_str: String, locale: &LocaleInfo) -> String {
-    let mut parts = magnitude_str.splitn(2, '.');
-    let int_part = parts.next().unwrap();
-    let grouped = Self::insert_locale_grouping(int_part, locale);
-
-    if let Some(frac_part) = parts.next() {
-        format!("{grouped}{}{frac_part}", locale.decimal_point)
-    } else {
-        grouped
-    }
+    let exp_pos = magnitude_str.find(|c| c == 'e' || c == 'E');
+    let (mantissa, exponent) = if let Some(pos) = exp_pos {
+        (&magnitude_str[..pos], Some(&magnitude_str[pos..]))
+    } else {
+        (magnitude_str.as_str(), None)
+    };
+    let (int_part, frac_part) = mantissa
+        .split_once('.')
+        .map_or((mantissa, None), |(i, f)| (i, Some(f)));
+
+    let grouped = if int_part.chars().all(|c| c.is_ascii_digit()) {
+        Self::insert_locale_grouping(int_part, locale)
+    } else {
+        int_part.to_owned()
+    };
+
+    let mut out = if let Some(frac_part) = frac_part {
+        format!("{grouped}{}{frac_part}", locale.decimal_point)
+    } else {
+        grouped
+    };
+    if let Some(exponent) = exponent {
+        out.push_str(exponent);
+    }
+    out
 }

For Python format specifier 'n', should locale grouping/decimal substitution affect exponent parts (e.g., '1e+06') or only the mantissa?

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@crates/common/src/format.rs` around lines 525 - 536, The current
apply_locale_formatting applies grouping to the whole pre-decimal text and can
corrupt exponent text like "1e+06"; update apply_locale_formatting to first
split magnitude_str into mantissa and exponent by searching for 'e' or 'E'
(preserve the original exponent sign and digits), then perform the existing
splitn('.') and call insert_locale_grouping only on the mantissa's integer part
and replace the decimal point only in the mantissa, finally reattach the
exponent unchanged; reference apply_locale_formatting and insert_locale_grouping
so the mantissa/exponent split and reassembly are done before grouping/decimal
replacement.

607-657: ⚠️ Potential issue | 🟠 Major

Locale complex path should enforce the same invalid-flag checks as format_complex.

format_complex_locale currently skips the =-alignment and zero-padding validation present in format_complex, so invalid specs can slip through.

Suggested fix

 pub fn format_complex_locale(
     &self,
     num: &Complex64,
     locale: &LocaleInfo,
 ) -> Result<String, FormatSpecError> {
+    if let Some(FormatAlign::AfterSign) = &self.align {
+        return Err(FormatSpecError::AlignmentFlag);
+    }
+    if matches!(self.fill.unwrap_or(' '.into()).to_char(), Some('0')) {
+        return Err(FormatSpecError::ZeroPadding);
+    }
+
     // Reuse format_complex_re_im with 'g' type to get the base formatted parts,
     // then apply locale grouping. This matches CPython's format_complex_internal:
     // 'n' → 'g', add_parens=0, skip_re=0.

#!/bin/bash
# Compare validation behavior in locale/non-locale complex formatting
rg -n "fn format_complex_locale|fn format_complex|AlignmentFlag|ZeroPadding" crates/common/src/format.rs -A20 -B10

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@crates/common/src/format.rs` around lines 607 - 657, format_complex_locale
skips the same validation performed by format_complex for illegal '=' alignment
and zero-padding flags, allowing invalid format specs; update
format_complex_locale to run the identical validation path as format_complex
(either call the same validation helper or inline the same checks) before
proceeding—specifically validate AlignmentFlag '=' and zero-padding combinations
on the FormatSpec used (see FormatSpec, format_complex_re_im, format_complex,
format_sign_and_align) and return the same FormatSpecError when flags are
invalid so locale handling cannot bypass the checks.

564-565: ⚠️ Potential issue | 🟡 Minor

Locale width alignment is byte-counted, so multibyte separators can mispad.

These locale paths pass AsciiStr, whose char_len() is byte length; locale separators can be multibyte UTF-8, so width/fill becomes inaccurate.

In Rust, does str::len() return byte length or Unicode scalar count, and what is the recommended way to compute character count for width-sensitive formatting?

Also applies to: 604-605, 656-657

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@crates/common/src/format.rs` around lines 564 - 565, The code calls
format_sign_and_align(&AsciiStr::new(&magnitude_str), sign_str,
FormatAlign::Right) and relies on AsciiStr/char_len() (which yields byte length)
for width calculations, causing mispadding when locale separators are multibyte
UTF‑8; change the logic to use a Unicode-aware display width (e.g., use
unicode_width::UnicodeWidthStr::width on &str or compute grapheme clusters if
needed) instead of byte length, update the call sites that construct AsciiStr
(including other occurrences around format_sign_and_align) to pass a &str or a
precomputed display width, and modify format_sign_and_align to accept and use
the unicode-width result (or accept a width parameter) so padding/fill is based
on displayed character width rather than bytes.

crates/vm/src/format.rs (1)

49-68: ⚠️ Potential issue | 🟠 Major

parse_grouping currently conflates 0 and CHAR_MAX terminators.

At Line 64-Line 66, 0 is appended for any non-empty sequence, which incorrectly turns CHAR_MAX (“stop grouping”) into “repeat previous group”.

Suggested fix

 unsafe fn parse_grouping(grouping: *const libc::c_char) -> Vec<u8> {
     let mut result = Vec::new();
     if grouping.is_null() {
         return result;
     }
-    unsafe {
-        let mut ptr = grouping;
-        while ![0, libc::c_char::MAX].contains(&*ptr) {
-            result.push(*ptr as u8);
-            ptr = ptr.add(1);
-        }
-    }
-    if !result.is_empty() {
+    let mut ptr = grouping;
+    let mut repeat_last = false;
+    loop {
+        let v = unsafe { *ptr };
+        if v == 0 {
+            repeat_last = true;
+            break;
+        }
+        if v == libc::c_char::MAX {
+            break;
+        }
+        result.push(v as u8);
+        ptr = unsafe { ptr.add(1) };
+    }
+    if repeat_last && !result.is_empty() {
         result.push(0);
     }
     result
 }

In POSIX/C localeconv().grouping semantics, what is the difference between a 0 byte and CHAR_MAX terminator, and should they be handled differently when parsing grouping rules?

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@crates/vm/src/format.rs` around lines 49 - 68, The parse_grouping function
conflates CHAR_MAX and NUL terminators by always appending a 0 byte for
non-empty results; change it so the code inspects the terminator byte at the
stopping pointer and only push a trailing 0 (the "repeat last group" marker)
when the terminator was NUL (0), not when it was libc::c_char::MAX; update
unsafe fn parse_grouping to stop on either terminator as before but after the
loop check *ptr (or the last read byte) and conditionally push 0 only if that
terminator == 0, leaving CHAR_MAX-terminated sequences without the appended 0.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@crates/common/src/format.rs`:
- Around line 525-536: The current apply_locale_formatting applies grouping to
the whole pre-decimal text and can corrupt exponent text like "1e+06"; update
apply_locale_formatting to first split magnitude_str into mantissa and exponent
by searching for 'e' or 'E' (preserve the original exponent sign and digits),
then perform the existing splitn('.') and call insert_locale_grouping only on
the mantissa's integer part and replace the decimal point only in the mantissa,
finally reattach the exponent unchanged; reference apply_locale_formatting and
insert_locale_grouping so the mantissa/exponent split and reassembly are done
before grouping/decimal replacement.
- Around line 607-657: format_complex_locale skips the same validation performed
by format_complex for illegal '=' alignment and zero-padding flags, allowing
invalid format specs; update format_complex_locale to run the identical
validation path as format_complex (either call the same validation helper or
inline the same checks) before proceeding—specifically validate AlignmentFlag
'=' and zero-padding combinations on the FormatSpec used (see FormatSpec,
format_complex_re_im, format_complex, format_sign_and_align) and return the same
FormatSpecError when flags are invalid so locale handling cannot bypass the
checks.
- Around line 564-565: The code calls
format_sign_and_align(&AsciiStr::new(&magnitude_str), sign_str,
FormatAlign::Right) and relies on AsciiStr/char_len() (which yields byte length)
for width calculations, causing mispadding when locale separators are multibyte
UTF‑8; change the logic to use a Unicode-aware display width (e.g., use
unicode_width::UnicodeWidthStr::width on &str or compute grapheme clusters if
needed) instead of byte length, update the call sites that construct AsciiStr
(including other occurrences around format_sign_and_align) to pass a &str or a
precomputed display width, and modify format_sign_and_align to accept and use
the unicode-width result (or accept a width parameter) so padding/fill is based
on displayed character width rather than bytes.

In `@crates/vm/src/format.rs`:
- Around line 49-68: The parse_grouping function conflates CHAR_MAX and NUL
terminators by always appending a 0 byte for non-empty results; change it so the
code inspects the terminator byte at the stopping pointer and only push a
trailing 0 (the "repeat last group" marker) when the terminator was NUL (0), not
when it was libc::c_char::MAX; update unsafe fn parse_grouping to stop on either
terminator as before but after the loop check *ptr (or the last read byte) and
conditionally push 0 only if that terminator == 0, leaving CHAR_MAX-terminated
sequences without the appended 0.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yml

Review profile: CHILL

Plan: Pro

Run ID: 025562ee-f1c0-478c-8018-769534ac57c0

📥 Commits

Reviewing files that changed from the base of the PR and between 092bbda and c92c2d1.

⛔ Files ignored due to path filters (2)

Lib/test/test_format.py is excluded by !Lib/**
Lib/test/test_types.py is excluded by !Lib/**

📒 Files selected for processing (5)

crates/common/src/format.rs
crates/vm/src/builtins/complex.rs
crates/vm/src/builtins/float.rs
crates/vm/src/builtins/int.rs
crates/vm/src/format.rs

🚧 Files skipped from review as they are similar to previous changes (2)

crates/vm/src/builtins/int.rs
crates/vm/src/builtins/complex.rs

Add LocaleInfo struct and locale-aware formatting methods to FormatSpec. The 'n' format type now reads thousands_sep, decimal_point, and grouping from C localeconv() and applies proper locale-based number grouping. Remove @unittest.skip from test_format.test_locale.

@expectedfailure

Rewrite format_complex_locale to reuse format_complex_re_im, matching formatter_unicode.c: add_parens=0 and skip_re=0 for 'n' type. Remove @expectedfailure from test_float__format__locale and test_int__format__locale in test_types.py.

coderabbitai

♻️ Duplicate comments (4)

crates/common/src/format.rs (3)

564-564: ⚠️ Potential issue | 🟡 Minor

Width/padding can be wrong for multibyte locale separators.

Lines 564, 604, and 653 pass locale-formatted strings through AsciiStr; char_len() is byte-based, so width calculation drifts when thousands_sep / decimal_point are multibyte UTF-8.

💡 Proposed fix

- self.format_sign_and_align(&AsciiStr::new(&magnitude_str), sign_str, FormatAlign::Right)
+ self.format_sign_and_align(magnitude_str.as_str(), sign_str, FormatAlign::Right)

And update format_sign_and_align to compute Unicode character count for &str input:

- fn format_sign_and_align<T>(&self, magnitude_str: &T, sign_str: &str, default_align: FormatAlign) -> Result<String, FormatSpecError>
- where
-     T: CharLen + Deref<Target = str>,
+ fn format_sign_and_align(
+     &self,
+     magnitude_str: &str,
+     sign_str: &str,
+     default_align: FormatAlign,
+ ) -> Result<String, FormatSpecError>
 {
     let align = self.align.unwrap_or(default_align);
-    let num_chars = magnitude_str.char_len();
+    let num_chars = magnitude_str.chars().count();

Also applies to: 604-604, 653-653

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@crates/common/src/format.rs` at line 564, The code incorrectly wraps
locale-formatted strings with AsciiStr and uses byte-based char_len(), causing
wrong width/padding when thousands_sep/decimal_point are multibyte; update the
call sites (the three places passing AsciiStr instances) to pass plain &str (the
formatted magnitude strings and any sign_str inputs) into format_sign_and_align,
and change format_sign_and_align to accept &str instead of AsciiStr and compute
widths using Unicode character count (e.g., s.chars().count()) when calculating
padding/alignment so multibyte separators are measured correctly.

525-536: ⚠️ Potential issue | 🟠 Major

apply_locale_formatting can group non-digit mantissas (e.g., scientific notation).

Line 529 groups the full pre-. chunk. Values like 1e+06, inf, or nan can be transformed incorrectly (for example, 1e,+06).

💡 Proposed fix

 fn apply_locale_formatting(magnitude_str: String, locale: &LocaleInfo) -> String {
-    let mut parts = magnitude_str.splitn(2, '.');
-    let int_part = parts.next().unwrap();
-    let grouped = Self::insert_locale_grouping(int_part, locale);
-
-    if let Some(frac_part) = parts.next() {
-        format!("{grouped}{}{frac_part}", locale.decimal_point)
-    } else {
-        grouped
-    }
+    let exp_pos = magnitude_str.find(|c| c == 'e' || c == 'E');
+    let (mantissa, exponent) = if let Some(pos) = exp_pos {
+        (&magnitude_str[..pos], Some(&magnitude_str[pos..]))
+    } else {
+        (magnitude_str.as_str(), None)
+    };
+    let (int_part, frac_part) = mantissa
+        .split_once('.')
+        .map_or((mantissa, None), |(i, f)| (i, Some(f)));
+
+    let grouped = if int_part.chars().all(|c| c.is_ascii_digit()) {
+        Self::insert_locale_grouping(int_part, locale)
+    } else {
+        int_part.to_owned()
+    };
+
+    let mut out = if let Some(frac_part) = frac_part {
+        format!("{grouped}{}{frac_part}", locale.decimal_point)
+    } else {
+        grouped
+    };
+    if let Some(exponent) = exponent {
+        out.push_str(exponent);
+    }
+    out
 }

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@crates/common/src/format.rs` around lines 525 - 536, apply_locale_formatting
currently passes the entire pre-decimal chunk to insert_locale_grouping which
corrupts non-digit mantissas like "1e+06", "inf", or "nan"; fix by isolating an
optional leading sign and the contiguous digit sequence in int_part, call
insert_locale_grouping only on that digit sequence (or on "0" when no digits),
then reassemble sign + grouped_digits + trailing_suffix (the remaining non-digit
chars) before attaching the fractional part and decimal_point; update
apply_locale_formatting to parse int_part into (sign, digit_run, suffix) and use
insert_locale_grouping on digit_run so symbols like apply_locale_formatting,
insert_locale_grouping, magnitude_str, and LocaleInfo are the only points
changed.

650-654: ⚠️ Potential issue | 🟠 Major

Locale complex formatting skips complex-specific flag validation.

Lines 653-654 directly format output, but this path does not enforce the same AlignmentFlag / ZeroPadding checks used in format_complex (Lines 861-867). Invalid specs can pass silently here.

💡 Proposed fix

         // No parentheses for 'n' format (CPython: add_parens=0)
         let magnitude_str = format!("{grouped_re}{grouped_im}");
-
-        self.format_sign_and_align(&AsciiStr::new(&magnitude_str), "", FormatAlign::Right)
+        if let Some(FormatAlign::AfterSign) = &self.align {
+            return Err(FormatSpecError::AlignmentFlag);
+        }
+        match &self.fill.unwrap_or(' '.into()).to_char() {
+            Some('0') => Err(FormatSpecError::ZeroPadding),
+            _ => self.format_sign_and_align(&AsciiStr::new(&magnitude_str), "", FormatAlign::Right),
+        }
     }

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@crates/common/src/format.rs` around lines 650 - 654, The locale-specific
branch builds magnitude_str and calls self.format_sign_and_align without
validating complex-number-specific flags, allowing invalid
AlignmentFlag/ZeroPadding specs to pass; add the same validation used in
format_complex (the checks around AlignmentFlag and ZeroPadding in the
format_complex path) before formatting here or refactor to call the shared
validator, then only proceed to call
self.format_sign_and_align(&AsciiStr::new(&magnitude_str), "",
FormatAlign::Right) if the validation passes so invalid specs are rejected
consistently.

crates/vm/src/format.rs (1)

49-67: ⚠️ Potential issue | 🟠 Major

parse_grouping currently treats CHAR_MAX like repeat-last grouping.

At Lines 64-66, a trailing 0 is appended for any non-empty sequence. That makes CHAR_MAX-terminated patterns behave as “repeat last group forever” instead of “stop further grouping”.

💡 Proposed fix

 /// Parse C `lconv.grouping` into a `Vec<u8>`.
-/// Reads bytes until 0 or CHAR_MAX, then appends 0 (meaning "repeat last group").
+/// Reads bytes until 0 or CHAR_MAX.
+/// - 0 means "repeat last group"
+/// - CHAR_MAX means "stop further grouping"
 #[cfg(unix)]
 unsafe fn parse_grouping(grouping: *const libc::c_char) -> Vec<u8> {
     let mut result = Vec::new();
     if grouping.is_null() {
         return result;
     }
-    unsafe {
-        let mut ptr = grouping;
-        while ![0, libc::c_char::MAX].contains(&*ptr) {
-            result.push(*ptr as u8);
-            ptr = ptr.add(1);
-        }
-    }
-    if !result.is_empty() {
+    let mut ptr = grouping;
+    let mut repeat_last = false;
+    loop {
+        let value = unsafe { *ptr };
+        if value == 0 {
+            repeat_last = true;
+            break;
+        }
+        if value == libc::c_char::MAX {
+            break;
+        }
+        result.push(value as u8);
+        ptr = unsafe { ptr.add(1) };
+    }
+    if repeat_last && !result.is_empty() {
         result.push(0);
     }
     result
 }

In POSIX/C localeconv() semantics for lconv.grouping, what do terminators 0 and CHAR_MAX mean, and should they be handled differently when converting grouping rules?

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@crates/vm/src/format.rs` around lines 49 - 67, parse_grouping currently
always appends a trailing 0 (meaning "repeat last group") even when the original
C terminator was CHAR_MAX (meaning "stop grouping"); modify parse_grouping to
inspect the terminator byte after the loop and only append 0 if the terminator
was actually 0, and do not append anything when the terminator was
libc::c_char::MAX. Locate the unsafe fn parse_grouping(grouping: *const
libc::c_char) -> Vec<u8>, keep the loop that advances ptr while *ptr is not 0 or
libc::c_char::MAX, then check the byte at ptr: if *ptr == 0 { if
!result.is_empty() { result.push(0); } } else /* CHAR_MAX */ { /* leave result
as-is */ } and return result.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@crates/common/src/format.rs`:
- Line 564: The code incorrectly wraps locale-formatted strings with AsciiStr
and uses byte-based char_len(), causing wrong width/padding when
thousands_sep/decimal_point are multibyte; update the call sites (the three
places passing AsciiStr instances) to pass plain &str (the formatted magnitude
strings and any sign_str inputs) into format_sign_and_align, and change
format_sign_and_align to accept &str instead of AsciiStr and compute widths
using Unicode character count (e.g., s.chars().count()) when calculating
padding/alignment so multibyte separators are measured correctly.
- Around line 525-536: apply_locale_formatting currently passes the entire
pre-decimal chunk to insert_locale_grouping which corrupts non-digit mantissas
like "1e+06", "inf", or "nan"; fix by isolating an optional leading sign and the
contiguous digit sequence in int_part, call insert_locale_grouping only on that
digit sequence (or on "0" when no digits), then reassemble sign + grouped_digits
+ trailing_suffix (the remaining non-digit chars) before attaching the
fractional part and decimal_point; update apply_locale_formatting to parse
int_part into (sign, digit_run, suffix) and use insert_locale_grouping on
digit_run so symbols like apply_locale_formatting, insert_locale_grouping,
magnitude_str, and LocaleInfo are the only points changed.
- Around line 650-654: The locale-specific branch builds magnitude_str and calls
self.format_sign_and_align without validating complex-number-specific flags,
allowing invalid AlignmentFlag/ZeroPadding specs to pass; add the same
validation used in format_complex (the checks around AlignmentFlag and
ZeroPadding in the format_complex path) before formatting here or refactor to
call the shared validator, then only proceed to call
self.format_sign_and_align(&AsciiStr::new(&magnitude_str), "",
FormatAlign::Right) if the validation passes so invalid specs are rejected
consistently.

In `@crates/vm/src/format.rs`:
- Around line 49-67: parse_grouping currently always appends a trailing 0
(meaning "repeat last group") even when the original C terminator was CHAR_MAX
(meaning "stop grouping"); modify parse_grouping to inspect the terminator byte
after the loop and only append 0 if the terminator was actually 0, and do not
append anything when the terminator was libc::c_char::MAX. Locate the unsafe fn
parse_grouping(grouping: *const libc::c_char) -> Vec<u8>, keep the loop that
advances ptr while *ptr is not 0 or libc::c_char::MAX, then check the byte at
ptr: if *ptr == 0 { if !result.is_empty() { result.push(0); } } else /* CHAR_MAX
*/ { /* leave result as-is */ } and return result.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yml

Review profile: CHILL

Plan: Pro

Run ID: 87132815-7bf6-459b-a039-b60fd04f64a4

📥 Commits

Reviewing files that changed from the base of the PR and between c92c2d1 and c585541.

⛔ Files ignored due to path filters (2)

Lib/test/test_format.py is excluded by !Lib/**
Lib/test/test_types.py is excluded by !Lib/**

📒 Files selected for processing (5)

crates/common/src/format.rs
crates/vm/src/builtins/complex.rs
crates/vm/src/builtins/float.rs
crates/vm/src/builtins/int.rs
crates/vm/src/format.rs

@expectedfailure

…ustPython#7350) * Implement locale-aware 'n' format specifier for int, float, complex Add LocaleInfo struct and locale-aware formatting methods to FormatSpec. The 'n' format type now reads thousands_sep, decimal_point, and grouping from C localeconv() and applies proper locale-based number grouping. Remove @unittest.skip from test_format.test_locale. * Fix complex 'n' format and remove locale expectedFailure markers Rewrite format_complex_locale to reuse format_complex_re_im, matching formatter_unicode.c: add_parens=0 and skip_re=0 for 'n' type. Remove @expectedfailure from test_float__format__locale and test_int__format__locale in test_types.py. * Auto-format: cargo fmt --all --------- Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>

youknowone marked this pull request as ready for review March 5, 2026 00:14

coderabbitai bot reviewed Mar 5, 2026

View reviewed changes

youknowone force-pushed the locale-n branch from be787c6 to d469948 Compare March 5, 2026 11:32

coderabbitai bot reviewed Mar 5, 2026

View reviewed changes

youknowone force-pushed the locale-n branch from 092bbda to 7269662 Compare March 5, 2026 12:25

coderabbitai bot reviewed Mar 5, 2026

View reviewed changes

youknowone added 2 commits March 6, 2026 00:05

youknowone force-pushed the locale-n branch from c92c2d1 to c585541 Compare March 5, 2026 15:05

Auto-format: cargo fmt --all

a2d3d59

coderabbitai bot reviewed Mar 5, 2026

View reviewed changes

youknowone merged commit 3b91466 into RustPython:main Mar 5, 2026
14 checks passed

youknowone deleted the locale-n branch March 5, 2026 16:09

youknowone mentioned this pull request Mar 6, 2026

baseline to start locale support for formatting :n #4609

Closed

Conversation

youknowone commented Mar 4, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Mar 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Poem

Uh oh!

github-actions bot commented Mar 4, 2026

Uh oh!

github-actions bot commented Mar 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

📦 Library Dependencies

Uh oh!

chatgpt-codex-connector bot commented Mar 5, 2026

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot Mar 5, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 5, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 5, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 5, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

youknowone commented Mar 4, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Mar 4, 2026 •

edited

Loading

github-actions bot commented Mar 4, 2026 •

edited

Loading