Format numeric constants by lkh42t · Pull Request #5972 · astral-sh/ruff

lkh42t · 2023-07-22T09:44:27Z

Summary

Format int, float and complex constants.

Test Plan

Existing snapshots.

github-actions · 2023-07-22T10:12:53Z

PR Check Results

Benchmark

Linux

group                                      main                                   pr
-----                                      ----                                   --
formatter/large/dataset.py                 1.04      9.7±0.14ms     4.2 MB/sec    1.00      9.3±0.10ms     4.4 MB/sec
formatter/numpy/ctypeslib.py               1.04  1927.2±30.10µs     8.6 MB/sec    1.00   1854.3±4.19µs     9.0 MB/sec
formatter/numpy/globals.py                 1.05    213.5±0.56µs    13.8 MB/sec    1.00    203.0±0.59µs    14.5 MB/sec
formatter/pydantic/types.py                1.03      4.1±0.01ms     6.2 MB/sec    1.00      4.0±0.02ms     6.4 MB/sec
linter/all-rules/large/dataset.py          1.00     13.2±0.23ms     3.1 MB/sec    1.00     13.2±0.18ms     3.1 MB/sec
linter/all-rules/numpy/ctypeslib.py        1.00      3.3±0.04ms     5.0 MB/sec    1.00      3.3±0.04ms     5.0 MB/sec
linter/all-rules/numpy/globals.py          1.00    430.9±1.10µs     6.8 MB/sec    1.00    429.2±0.55µs     6.9 MB/sec
linter/all-rules/pydantic/types.py         1.01      5.9±0.10ms     4.3 MB/sec    1.00      5.9±0.07ms     4.3 MB/sec
linter/default-rules/large/dataset.py      1.00      6.6±0.05ms     6.2 MB/sec    1.00      6.6±0.06ms     6.2 MB/sec
linter/default-rules/numpy/ctypeslib.py    1.01   1411.3±2.00µs    11.8 MB/sec    1.00   1399.0±3.99µs    11.9 MB/sec
linter/default-rules/numpy/globals.py      1.02    157.9±1.19µs    18.7 MB/sec    1.00    155.0±0.87µs    19.0 MB/sec
linter/default-rules/pydantic/types.py     1.00      3.0±0.01ms     8.6 MB/sec    1.01      3.0±0.07ms     8.5 MB/sec

Windows

group                                      main                                   pr
-----                                      ----                                   --
formatter/large/dataset.py                 1.07     14.4±0.83ms     2.8 MB/sec    1.00     13.4±0.67ms     3.0 MB/sec
formatter/numpy/ctypeslib.py               1.03      2.8±0.20ms     5.9 MB/sec    1.00      2.7±0.21ms     6.1 MB/sec
formatter/numpy/globals.py                 1.07   316.7±29.17µs     9.3 MB/sec    1.00   295.9±26.09µs    10.0 MB/sec
formatter/pydantic/types.py                1.02      5.8±0.27ms     4.4 MB/sec    1.00      5.7±0.29ms     4.5 MB/sec
linter/all-rules/large/dataset.py          1.01     19.3±0.90ms     2.1 MB/sec    1.00     19.1±1.17ms     2.1 MB/sec
linter/all-rules/numpy/ctypeslib.py        1.00      4.7±0.30ms     3.6 MB/sec    1.09      5.1±0.24ms     3.3 MB/sec
linter/all-rules/numpy/globals.py          1.00   593.9±46.89µs     5.0 MB/sec    1.03   610.7±32.02µs     4.8 MB/sec
linter/all-rules/pydantic/types.py         1.00      9.0±0.55ms     2.8 MB/sec    1.03      9.2±0.54ms     2.8 MB/sec
linter/default-rules/large/dataset.py      1.03     10.2±0.54ms     4.0 MB/sec    1.00      9.9±0.49ms     4.1 MB/sec
linter/default-rules/numpy/ctypeslib.py    1.03      2.1±0.10ms     7.9 MB/sec    1.00      2.0±0.15ms     8.2 MB/sec
linter/default-rules/numpy/globals.py      1.07   246.7±18.02µs    12.0 MB/sec    1.00   230.5±14.44µs    12.8 MB/sec
linter/default-rules/pydantic/types.py     1.02      4.5±0.20ms     5.7 MB/sec    1.00      4.4±0.20ms     5.9 MB/sec

MichaReiser

Oh wow nice! Excellent work.

What I understand from the code is that the formatted now uses dynamic_text for every number constant even if it is already correctly formatted. This is, unfortunately, somewhat expensive because every dynamic_text allocates a String when writing the text.

Would you be interested in taking a stab at implementing an optimisation similar to what we did in normalize_string

ruff/crates/ruff_python_formatter/src/expression/string.rs

Lines 413 to 481 in 33657d3

    
           fn normalize_string(input: &str, quotes: StringQuotes) -> (Cow<str>, ContainsNewlines) { 
        
               // The normalized string if `input` is not yet normalized. 
        
               // `output` must remain empty if `input` is already normalized. 
        
               let mut output = String::new(); 
        
               // Tracks the last index of `input` that has been written to `output`. 
        
               // If `last_index` is `0` at the end, then the input is already normalized and can be returned as is. 
        
               let mut last_index = 0; 
        
               let mut newlines = ContainsNewlines::No; 
        
               let style = quotes.style; 
        
               let preferred_quote = style.as_char(); 
        
               let opposite_quote = style.invert().as_char(); 
        
               let mut chars = input.char_indices(); 
        
               while let Some((index, c)) = chars.next() { 
        
                   if c == '\r' { 
        
                       output.push_str(&input[last_index..index]); 
        
                       // Skip over the '\r' character, keep the `\n` 
        
                       if input.as_bytes().get(index + 1).copied() == Some(b'\n') { 
        
                           chars.next(); 
        
                       } 
        
                       // Replace the `\r` with a `\n` 
        
                       else { 
        
                           output.push('\n'); 
        
                       } 
        
                       last_index = index + '\r'.len_utf8(); 
        
                       newlines = ContainsNewlines::Yes; 
        
                   } else if c == '\n' { 
        
                       newlines = ContainsNewlines::Yes; 
        
                   } else if !quotes.triple { 
        
                       if c == '\\' { 
        
                           if let Some(next) = input.as_bytes().get(index + 1).copied().map(char::from) { 
        
                               #[allow(clippy::if_same_then_else)] 
        
                               if next == opposite_quote { 
        
                                   // Remove the escape by ending before the backslash and starting again with the quote 
        
                                   chars.next(); 
        
                                   output.push_str(&input[last_index..index]); 
        
                                   last_index = index + '\\'.len_utf8(); 
        
                               } else if next == preferred_quote { 
        
                                   // Quote is already escaped, skip over it. 
        
                                   chars.next(); 
        
                               } else if next == '\\' { 
        
                                   // Skip over escaped backslashes 
        
                                   chars.next(); 
        
                               } 
        
                           } 
        
                       } else if c == preferred_quote { 
        
                           // Escape the quote 
        
                           output.push_str(&input[last_index..index]); 
        
                           output.push('\\'); 
        
                           output.push(c); 
        
                           last_index = index + preferred_quote.len_utf8(); 
        
                       } 
        
                   } 
        
               } 
        
               let normalized = if last_index == 0 { 
        
                   Cow::Borrowed(input) 
        
               } else { 
        
                   output.push_str(&input[last_index..]); 
        
                   Cow::Owned(output) 
        
               }; 
        
               (normalized, newlines) 
        
           }

The trick is to use a Cow and return Cow::Borrowed if the number is already correctly formatted (in which case we can print the code from the source using source_text_slice (extremely fast). The implementation returns a Cow::Owned if it reformatted the number and it then uses dynamic_text.

The benefit of this optimisation is that we only pay the cost of allocating when formatting the file for the first time. Checking the formatting in subsequent passes (when there are no changes) takes the fast path instead.

Let me know if that's something you're interested in and if so, if you want to tackle this as part of this PR or want to do a follow up instead.

MichaReiser · 2023-07-22T10:32:13Z

crates/ruff_python_formatter/src/expression/number.rs

+
+        if content.starts_with("0x") || content.starts_with("0X") {
+            let hex = content.get(2..).unwrap().to_ascii_uppercase();
+            write!(f, [text("0x"), dynamic_text(&hex, None)])


We should pass range.start() as the second argument to all dynamic_text usages. The position is used in the generated source map and we'll use the source map for range formatting.

lkh42t · 2023-07-22T11:01:51Z

I'll implement optimization like normalize_string. Thanks!

MichaReiser

Alright. I'll assign this back to you. Feel free to ping me if you have any questions and request another review once your done.

…r is already normalized

crates/ruff_python_formatter/src/expression/number.rs

konstin · 2023-07-23T10:00:04Z

...ests/snapshots/black_compatibility@simple_cases__attribute_access_on_number_literals.py.snap

-+x = (0B1011).conjugate()
-+x = (0O777).real
+x = 123456789.123456789j.__add__((0b1011).bit_length())
+x = (0xB1ACC).conjugate()


now that we have no more trailing dots in floats we can also fix the parentheses rules for them (maybe better in a follow-up PR so that the scope of this PR doesn't get too large)

MichaReiser · 2023-07-24T06:39:38Z

crates/ruff_python_formatter/src/expression/number.rs

+    for (index, c) in chars {
+        if matches!(c, 'e' | 'E') {
+            has_exponent = true;
+            if input.as_bytes().get(index - 1).copied() == Some(b'.') {


I believe this could, at least in theory, result in indexing between two character boundaries if the text starts with [e10]. I don't think this is a problem in practice because it's impossible that the preceding character is a unicode character, because the number would then be part of an identifier. I still recommend fixing this just to be sure (e.g. by assigning `is_float on line 147)

lkh42t added 3 commits July 22, 2023 18:39

Format int constant

3a9f477

Format float constant

7a9553f

Format complex constant

081dff9

MichaReiser reviewed Jul 22, 2023

View reviewed changes

MichaReiser requested changes Jul 22, 2023

View reviewed changes

lkh42t requested a review from MichaReiser July 22, 2023 12:48

Use fast source_text_slice instead of dynamic_text when the numbe…

8aa894c

…r is already normalized

lkh42t force-pushed the format-number branch from f072f7e to 8aa894c Compare July 22, 2023 13:35

konstin approved these changes Jul 23, 2023

View reviewed changes

Simplify

3ed03fd

konstin added the formatter Related to the formatter label Jul 23, 2023

Avoid negative index

c7dd381

MichaReiser approved these changes Jul 24, 2023

View reviewed changes

MichaReiser enabled auto-merge (squash) July 24, 2023 06:58

MichaReiser mentioned this pull request Jul 24, 2023

📋 Formatter: Syntax Formatting #4798

Closed

72 tasks

MichaReiser merged commit dfa81b6 into astral-sh:main Jul 24, 2023

lkh42t deleted the format-number branch July 29, 2023 05:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Format numeric constants#5972

Format numeric constants#5972
MichaReiser merged 6 commits intoastral-sh:mainfrom
lkh42t:format-number

lkh42t commented Jul 22, 2023

Uh oh!

github-actions bot commented Jul 22, 2023 •

edited

Loading

Uh oh!

MichaReiser left a comment

Uh oh!

MichaReiser Jul 22, 2023

Uh oh!

lkh42t commented Jul 22, 2023

Uh oh!

MichaReiser left a comment

Uh oh!

Uh oh!

konstin Jul 23, 2023

Uh oh!

MichaReiser Jul 24, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	fn normalize_string(input: &str, quotes: StringQuotes) -> (Cow<str>, ContainsNewlines) {
	// The normalized string if `input` is not yet normalized.
	// `output` must remain empty if `input` is already normalized.
	let mut output = String::new();
	// Tracks the last index of `input` that has been written to `output`.
	// If `last_index` is `0` at the end, then the input is already normalized and can be returned as is.
	let mut last_index = 0;

	let mut newlines = ContainsNewlines::No;

	let style = quotes.style;
	let preferred_quote = style.as_char();
	let opposite_quote = style.invert().as_char();

	let mut chars = input.char_indices();

	while let Some((index, c)) = chars.next() {
	if c == '\r' {
	output.push_str(&input[last_index..index]);

	// Skip over the '\r' character, keep the `\n`
	if input.as_bytes().get(index + 1).copied() == Some(b'\n') {
	chars.next();
	}
	// Replace the `\r` with a `\n`
	else {
	output.push('\n');
	}

	last_index = index + '\r'.len_utf8();
	newlines = ContainsNewlines::Yes;
	} else if c == '\n' {
	newlines = ContainsNewlines::Yes;
	} else if !quotes.triple {
	if c == '\\' {
	if let Some(next) = input.as_bytes().get(index + 1).copied().map(char::from) {
	#[allow(clippy::if_same_then_else)]
	if next == opposite_quote {
	// Remove the escape by ending before the backslash and starting again with the quote
	chars.next();
	output.push_str(&input[last_index..index]);
	last_index = index + '\\'.len_utf8();
	} else if next == preferred_quote {
	// Quote is already escaped, skip over it.
	chars.next();
	} else if next == '\\' {
	// Skip over escaped backslashes
	chars.next();
	}
	}
	} else if c == preferred_quote {
	// Escape the quote
	output.push_str(&input[last_index..index]);
	output.push('\\');
	output.push(c);
	last_index = index + preferred_quote.len_utf8();
	}
	}
	}

	let normalized = if last_index == 0 {
	Cow::Borrowed(input)
	} else {
	output.push_str(&input[last_index..]);
	Cow::Owned(output)
	};

	(normalized, newlines)
	}

Conversation

lkh42t commented Jul 22, 2023

Summary

Test Plan

Uh oh!

github-actions bot commented Jul 22, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Check Results

Benchmark

Linux

Windows

Uh oh!

MichaReiser left a comment

Choose a reason for hiding this comment

Uh oh!

MichaReiser Jul 22, 2023

Choose a reason for hiding this comment

Uh oh!

lkh42t commented Jul 22, 2023

Uh oh!

MichaReiser left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

konstin Jul 23, 2023

Choose a reason for hiding this comment

Uh oh!

MichaReiser Jul 24, 2023

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

github-actions bot commented Jul 22, 2023 •

edited

Loading