Skip to content

[refurb] Count codepoints not bytes for slice-to-remove-prefix-or-suffix (FURB188)#13631

Merged
MichaReiser merged 9 commits intoastral-sh:mainfrom
dylwil3:affix-unicode
Oct 7, 2024
Merged

[refurb] Count codepoints not bytes for slice-to-remove-prefix-or-suffix (FURB188)#13631
MichaReiser merged 9 commits intoastral-sh:mainfrom
dylwil3:affix-unicode

Conversation

@dylwil3
Copy link
Copy Markdown
Collaborator

@dylwil3 dylwil3 commented Oct 4, 2024

This PR fixes the calculation of string length for the purposes of verifying when to suggest removeprefix/removesuffix (FURB188). Before, we used text_len which was counting bytes rather than codepoints (chars) and therefore disagreed with Python's len for non-ASCII text.

Closes #13620

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Oct 4, 2024

ruff-ecosystem results

Linter (stable)

✅ ecosystem check detected no linter changes.

Linter (preview)

✅ ecosystem check detected no linter changes.

Formatter (stable)

✅ ecosystem check detected no format changes.

Formatter (preview)

✅ ecosystem check detected no format changes.

@dscorbett
Copy link
Copy Markdown

Another subtlety worth testing is strings with surrogates. In Python, each surrogate counts as 1 and surrogate pairs are not special so they count as 2; for example, len("\ud800\udc00") == 2 whereas len("\U00010000") == 1. I don’t know how Ruff distinguishes those two example strings or if chars().count() matches what Python does.

@dylwil3
Copy link
Copy Markdown
Collaborator Author

dylwil3 commented Oct 4, 2024

Another subtlety worth testing is strings with surrogates. In Python, each surrogate counts as 1 and surrogate pairs are not special so they count as 2; for example, len("\ud800\udc00") == 2 whereas len("\U00010000") == 1. I don’t know how Ruff distinguishes those two example strings or if chars().count() matches what Python does.

TIL @dscorbett - neat! Added a test for this, and it appears to be handled correctly (I think this happens in the guts of the parser, so by the time I'm looking at string_val here, the subtleties have been smoothed out).

@dscorbett
Copy link
Copy Markdown

I think the reason it works is that Ruff’s representation of a Python string as a Rust string replaces surrogates with replacement characters. That is fine for counting the code points but could be a problem for other rules.

Copy link
Copy Markdown
Member

@MichaReiser MichaReiser left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, thanks. I only have two nit comments.

Comment on lines +338 to +339
.and_then(ast::Int::as_u32)
.and_then(|x| usize::try_from(x).ok())
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest converting to a u64 considering that you have to use usize::try_from anyways (for 32 bit platforms)

Suggested change
.and_then(ast::Int::as_u32)
.and_then(|x| usize::try_from(x).ok())
.and_then(ast::Int::as_u64)
.and_then(|x| usize::try_from(x).ok())

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or you could consider adding a as_usize method to ast::Int

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

went with the latter

// Only support prefix removal for size at most `u32::MAX`
.and_then(ast::Int::as_u32)
.and_then(|x| usize::try_from(x).ok())
.is_some_and(|x| x == string_val.to_str().chars().count()),
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
.is_some_and(|x| x == string_val.to_str().chars().count()),
.is_some_and(|x| x == string_val.chars().count()),

.and_then(ast::Int::as_u32)
.is_some_and(|x| x == string_val.to_str().text_len().to_u32())
.and_then(|x| usize::try_from(x).ok())
.is_some_and(|x| x == string_val.to_str().chars().count())
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
.is_some_and(|x| x == string_val.to_str().chars().count())
.is_some_and(|x| x == string_val.chars().count())

@MichaReiser MichaReiser added fixes Related to suggested fixes for violations preview Related to preview mode features rule Implementing or modifying a lint rule and removed preview Related to preview mode features fixes Related to suggested fixes for violations labels Oct 7, 2024
@MichaReiser MichaReiser merged commit 14ee5db into astral-sh:main Oct 7, 2024
@dylwil3 dylwil3 deleted the affix-unicode branch October 7, 2024 15:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

preview Related to preview mode features rule Implementing or modifying a lint rule

Projects

None yet

Development

Successfully merging this pull request may close these issues.

FURB188 calculates lengths wrong for non-ASCII affixes

3 participants