Start tracking quoting style in the AST by AlexWaygood · Pull Request #10298 · astral-sh/ruff

AlexWaygood · 2024-03-08T13:28:53Z

Summary

This PR modifies our AST so that nodes for string literals, bytes literals and f-strings all retain the following information:

The quoting style used (double or single quotes)
Whether the string is triple-quoted or not
Whether the string is raw or not

This PR is a followup to #10256. Like with that PR, this PR does not, in itself, fix any bugs. However, it means that we will have the necessary information to preserve quoting style and rawness of strings in the ExprGenerator in a followup PR, which will allow us to provide a fix for #7799.

The PR should be easiest to review commit-by-commit:

The first commit makes the necessary modifications for tracking this information in ast::StringLiteral nodes
The second commit does the same for ast::BytesLiteral nodes
The third commit does the same for ast::FString nodes

The information is recorded on the AST nodes using a bitflag field on each node, similarly to how we recorded the information on Tok::String, Tok::FStringStart and Tok::FStringMiddle tokens in #10298. Rather than reusing the bitflag I used for the tokens, however, I decided to create a custom bitflag for each AST node.

Using different bitflags for each node allows us to make invalid states unrepresentable: it is valid to set a u prefix on a string literal, but not on a bytes literal or an f-string. It also allows us to have better debug representations for each AST node modified in this PR.

Test Plan

cargo test

AlexWaygood · 2024-03-08T13:35:33Z

Performance is neutral on all benchmarks according to Codspeed.

crates/ruff_python_parser/src/token.rs

github-actions · 2024-03-08T13:47:00Z

`ruff-ecosystem` results

Linter (stable)

✅ ecosystem check detected no linter changes.

Linter (preview)

✅ ecosystem check detected no linter changes.

Formatter (stable)

✅ ecosystem check detected no format changes.

Formatter (preview)

✅ ecosystem check detected no format changes.

MichaReiser

Nice. It would be nice to use the new string flags in the formatter to avoid parsing them from source. But that's nothing for this PR.

ruff/crates/ruff_python_formatter/src/string/mod.rs

Lines 121 to 144 in 72bf1c2

    
           pub(crate) fn parse(input: &str) -> StringPrefix { 
        
               let chars = input.chars(); 
        
               let mut prefix = StringPrefix::empty(); 
        
               for c in chars { 
        
                   let flag = match c { 
        
                       'u' | 'U' => StringPrefix::UNICODE, 
        
                       'f' | 'F' => StringPrefix::F_STRING, 
        
                       'b' | 'B' => StringPrefix::BYTE, 
        
                       'r' => StringPrefix::RAW, 
        
                       'R' => StringPrefix::RAW_UPPER, 
        
                       '\'' | '"' => break, 
        
                       c => { 
        
                           unreachable!( 
        
                               "Unexpected character '{c}' terminating the prefix of a string literal" 
        
                           ); 
        
                       } 
        
                   }; 
        
                   prefix |= flag; 
        
               } 
        
               prefix 
        
           }

I posted a solution that avoids the need to change FStringStart. I would prefer if we keep it as is to reduce the diff size.

crates/ruff_python_ast/src/nodes.rs

MichaReiser · 2024-03-08T13:44:03Z

crates/ruff_python_ast/src/nodes.rs

+    }
+
+    /// Does the string have a `u` or `U` prefix?
+    pub const fn is_u_string(&self) -> bool {


Nit,

Suggested change

pub const fn is_u_string(&self) -> bool {

pub const fn is_unicode(&self) -> bool {

Potentially without the string postfix because that's given by the enclosing type. E.g it's also is_triple_quoted and not is_triple_quoted_string. E.g let's compare some call-sites: string_literal.is_unicode_string() seems repetivie in comparison to string_literal.is_unicode()

Same response as #10298 (comment)

crates/ruff_python_ast/src/node.rs

MichaReiser · 2024-03-08T13:47:05Z

crates/ruff_python_codegen/src/stylist.rs

+impl From<ruff_python_ast::str::QuoteStyle> for Quote {
+    fn from(value: ruff_python_ast::str::QuoteStyle) -> Self {
        match value {
-            ruff_python_parser::QuoteStyle::Double => Self::Double,
-            ruff_python_parser::QuoteStyle::Single => Self::Single,
+            ruff_python_ast::str::QuoteStyle::Double => Self::Double,
+            ruff_python_ast::str::QuoteStyle::Single => Self::Single,
        }
    }
 }


Can we replace the Quote type with QuoteStyle? I would prefer to minimize the number of Quotelike types we have.

We could do, but it would require making the ruff_python_literal crate have a dependency on the ruff_python_ast crate (or vice versa), due to this:

ruff/crates/ruff_python_codegen/src/stylist.rs

Lines 123 to 130 in 965adbe

impl From<Quote> for StrQuote {

fn from(val: Quote) -> Self {

match val {

Quote::Single => StrQuote::Single,

Quote::Double => StrQuote::Double,

}

}

}

If we used ruff_python_ast::str::QuoteStyle there instead of ruff_python_codegen::stylist::Quote, we wouldn't be able to have the implementation of that trait in the ruff_python_codegen crate, since the QuoteStyle enum comes from the ruff_python_literal crate. ruff_python_literal does not currently depend on ruff_python_ast (nor vice versa).

crates/ruff_python_parser/src/token.rs

MichaReiser · 2024-03-08T13:57:33Z

crates/ruff_linter/src/rules/flake8_simplify/rules/ast_expr.rs

+        .with_value(capital_env_var.into_boxed_str())
+        .with_prefix({
+            if env_var.is_unicode() {
+                StringLiteralPrefix::UString


Nit: I think I would prefer Unicode, it avoids the string repetition

I dislike Unicode, because all strings in Python are unicode... the u prefix does nothing at runtime, it's purely there for Python 2 compatibility. (In fact, it was originally removed in Python 3.0, but they added it back in a later Python 3 version because it was so difficult to write code that worked on both Python 2 and Python 3 if the prefix was gone in Python 3.)

That's fair. But it is the unicode flag ;) Can you do a quick check on what we call it in other places in the code base? It would be nice if it is consistent.

I did a grep -- there's a pyupgrade rule called UnicodeKindPrefix, but other than that, I can't see much use of the term Unicode across the codebase to describe this prefix...

I'd still rather keep this as-is -- I personally would find it really confusing to refer to these strings as "unicode strings", given that all strings are unicode strings, even without the prefix

crates/ruff_linter/src/rules/flake8_simplify/rules/ast_expr.rs

This PR modifies our AST so that nodes for string literals, bytes literals and f-strings all retain the following information: - The quoting style used (double or single quotes) - Whether the string is triple-quoted or not - Whether the string is raw or not This PR is a followup to astral-sh#10256. Like with that PR, this PR does not, in itself, fix any bugs. However, it means that we will have the necessary information to preserve quoting style and rawness of strings in the `ExprGenerator` in a followup PR, which will allow us to provide a fix for astral-sh#7799. The information is recorded on the AST nodes using a bitflag field on each node, similarly to how we recorded the information on `Tok::String`, `Tok::FStringStart` and `Tok::FStringMiddle` tokens in astral-sh#10298. Rather than reusing the bitflag I used for the tokens, however, I decided to create a custom bitflag for each AST node. Using different bitflags for each node allows us to make invalid states unrepresentable: it is valid to set a `u` prefix on a string literal, but not on a bytes literal or an f-string. It also allows us to have better debug representations for each AST node modified in this PR.

AlexWaygood added the internal An internal refactor or improvement label Mar 8, 2024

AlexWaygood requested a review from MichaReiser as a code owner March 8, 2024 13:28

AlexWaygood commented Mar 8, 2024

View reviewed changes

crates/ruff_python_parser/src/token.rs Outdated Show resolved Hide resolved

MichaReiser approved these changes Mar 8, 2024

View reviewed changes

AlexWaygood force-pushed the quotestyle-ast branch 8 times, most recently from c8bdaf0 to 7ab7c5a Compare March 8, 2024 18:41

AlexWaygood added 3 commits March 8, 2024 18:45

Start tracking quoting style in StringLiteral nodes

3978112

Start tracking quoting style in BytesLiteral nodes

98f11e2

Track quoting style in f-string nodes

703dcf3

AlexWaygood force-pushed the quotestyle-ast branch from 7ab7c5a to 703dcf3 Compare March 8, 2024 18:45

AlexWaygood merged commit 1d97f27 into astral-sh:main Mar 8, 2024

AlexWaygood deleted the quotestyle-ast branch March 8, 2024 19:11

AlexWaygood mentioned this pull request Mar 9, 2024

Track casing of r-string prefixes in the tokenizer and AST #10314

Merged

AlexWaygood mentioned this pull request Mar 11, 2024

Preserve multiline strings in code generator #10000

Closed

dhruvmanila mentioned this pull request Mar 18, 2024

Re-write Q001-003 to use string flags #10456

Open

AlexWaygood mentioned this pull request Mar 20, 2024

Simplify formatting of strings by using flags from the AST nodes #10489

Merged

	pub(crate) fn parse(input: &str) -> StringPrefix {
	let chars = input.chars();
	let mut prefix = StringPrefix::empty();

	for c in chars {
	let flag = match c {
	'u' \| 'U' => StringPrefix::UNICODE,
	'f' \| 'F' => StringPrefix::F_STRING,
	'b' \| 'B' => StringPrefix::BYTE,
	'r' => StringPrefix::RAW,
	'R' => StringPrefix::RAW_UPPER,
	'\'' \| '"' => break,
	c => {
	unreachable!(
	"Unexpected character '{c}' terminating the prefix of a string literal"
	);
	}
	};

	prefix \|= flag;
	}

	prefix
	}

	pub const fn is_u_string(&self) -> bool {
	pub const fn is_unicode(&self) -> bool {

	impl From<Quote> for StrQuote {
	fn from(val: Quote) -> Self {
	match val {
	Quote::Single => StrQuote::Single,
	Quote::Double => StrQuote::Double,
	}
	}
	}

Conversation

AlexWaygood commented Mar 8, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test Plan

Uh oh!

AlexWaygood commented Mar 8, 2024

Uh oh!

Uh oh!

github-actions bot commented Mar 8, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

ruff-ecosystem results

Linter (stable)

Linter (preview)

Formatter (stable)

Formatter (preview)

Uh oh!

MichaReiser left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

AlexWaygood commented Mar 8, 2024 •

edited

Loading

github-actions bot commented Mar 8, 2024 •

edited

Loading

`ruff-ecosystem` results