[refurb] Parse more exotic decimal strings in verbose-decimal-constructor (FURB157)#14098
[refurb] Parse more exotic decimal strings in verbose-decimal-constructor (FURB157)#14098dylwil3 merged 6 commits intoastral-sh:mainfrom
verbose-decimal-constructor (FURB157)#14098Conversation
|
Happy to hear any suggestions around the strange mix of mutability and shadowing in the implementation. I was attempting to avoid allocating a String if we didn't have to, and to not change too much from the implementation that already existed... Also happy to delete some comments if they are contributing more to clutter than understanding. |
|
| // _after_ trimming whitespace from the string and removing all occurrences of "_". | ||
| let mut trimmed = Cow::from(str_literal.to_str().trim_whitespace()); | ||
| if memchr::memchr(b'_', trimmed.as_bytes()).is_some() { | ||
| trimmed = Cow::from(trimmed.replace('_', "")); |
There was a problem hiding this comment.
Nit: I don't think memchr gives us much here, considering that most literals are very short. We also don't have to use replace and allcoate a Cow::Owned. Instead, I would use Cow::from(trimmed.trim_start_matches('_')) for better readability (with the added benefit that it doesn't allocate)
There was a problem hiding this comment.
I'm fine not using memchr, but I do think we need to use replace because underscores may appear anywhere in the string. For example- Decimal("1_____00") is valid.
Edit: whoops, didn't see that was already addressed below
| // NB: We need not convert, e.g., `2e3` to `2000`. | ||
| // However, `2e+3` is a syntax error, so we remove | ||
| // the sign on the exponent. | ||
| rest = Cow::from(rest.replace('+', "")); |
There was a problem hiding this comment.
replace is technically O(n). Not that it matters much here but I think I would find it slightly more readable if we explicitly tested for the + sign and the expected position.
if rest[index + 1] == b'+' {
rest.into_owned().remove(index + 1)
}
The alternative is to use remove_matches which captures the semantic better than repalce with a empty string
There was a problem hiding this comment.
I like remove_matches but isn't it nightly only? https://doc.rust-lang.org/std/string/struct.String.html#method.remove_matches
There was a problem hiding this comment.
Ah dang. I guess replace is still the best option
| // 'e' was the last character in `rest`, and then the right | ||
| // hand side will be `""`. | ||
| let exponent = rest[index + 1..] | ||
| .strip_prefix('+') |
There was a problem hiding this comment.
It seems that the exponential syntax also supports negative numbers:
ruff/crates/ruff_python_parser/src/lexer.rs
Lines 1088 to 1100 in 51a9983
There was a problem hiding this comment.
Yes, you can have negative exponents. But this lint should not suggest replacing Decimal("1e-1") with Decimal(1e-1). They are not the same! Decimal offers a way to represent decimal numbers in an exact way, something that floating point representations do not support.
>>>> Decimal("1e-1") == Decimal(1e-1) # 1e-1 = 0.1 can not be represented exactly as a float
False
>>>> Decimal("5e-1") == Decimal(5e-1) # Okay, 5e-1 = 1/2 can be represented exactly
TrueI think we need to be similarly careful with large numbers, even if they are integers:
>>>> Decimal("1e22") == Decimal(1e22)
True
>>>> Decimal("1e23") == Decimal(1e23)
FalseThere was a problem hiding this comment.
Just as @sharkdp points out, the omission of negative exponents was intentional. I didn't think about large exponents though, good point!
There was a problem hiding this comment.
It's hard to figure out when this conversion is actually safe - parsing scientific notation is weird, and I can't seem to find a reference for the range where the representation is faithful. I think issues start to occur when numbers are larger than 1e16 but I'm having trouble finding a justification for that in code/documentation...
The smallest example I found so far was:
>>> Decimal(1801439850948199e1) == Decimal('1801439850948199e1')
FalseThere was a problem hiding this comment.
FURB157 should not rewrite strings to float literals. Literals with exponents are float literals, so they should not be passed to the Decimal constructor, or they will trigger decimal-from-float-literal (RUF032).
$ cat ruf032.py
from decimal import Decimal
Decimal(1e16)
$ ruff check --isolated --preview --select RUF032 ruf032.py
ruf032.py:2:9: RUF032 `Decimal()` called with float literal argument
|
1 | from decimal import Decimal
2 | Decimal(1e16)
| ^^^^ RUF032
|
= help: Use a string literal instead
Found 1 error.
No fixes available (1 hidden fix can be enabled with the `--unsafe-fixes` option).There was a problem hiding this comment.
Well that makes life easier! Thank you!
| // NB: We have already checked that there is at most one 'e'. | ||
| if !rest | ||
| .bytes() | ||
| .all(|c| c.is_ascii_digit() || c == b'e' || c == b'E') |
There was a problem hiding this comment.
Does this support 1_2_3_4? or 1.1_2_3_4?
There was a problem hiding this comment.
I think the idea above was to replace all _s, not just leading ones.
There was a problem hiding this comment.
Good catch. I don't think replacing all of them is correct because _ are not permitted in the exponent part.
There was a problem hiding this comment.
I think they are:
>>> Decimal("2e1_0") == Decimal(2e1_0)
TrueThere was a problem hiding this comment.
Understanding your own code is always the hardest. So what's not allowed is Decimal(2._1)
ruff/crates/ruff_python_parser/src/lexer.rs
Lines 1072 to 1078 in 4323512
There was a problem hiding this comment.
Yes. Leading/trailing underscores are also not supported in "plain" Python syntax: _10. But they are supported in Decimals syntax.
But I don't think there is a problem in this PR? All underscores are replaced. So if someone really has Decimal("_1_0_0__") in their codebase, we would suggest replacing that with Decimal(100), which is fine.
| // NB: We need not convert, e.g., `2e3` to `2000`. | ||
| // However, `2e+3` is a syntax error, so we remove | ||
| // the sign on the exponent. |
There was a problem hiding this comment.
I don't think the positive sign in the exponent is a syntax error?
>>> 2e+3
2000.0
>>> Decimal(2e+3)
Decimal('2000')There was a problem hiding this comment.
You're absolutely right! Dunno why I thought it was...
| 24 | Decimal("__1____000") | ||
| 25 | Decimal("2e4") | ||
| | | ||
| = help: Replace with `1000` |
There was a problem hiding this comment.
Maybe it would be nice to preserve the thousand separator in that case and suggest 1_000 instead of 1000. (If it's not too annoying to do)
There was a problem hiding this comment.
That's a good suggestion, and I think it would be a nice followup! I think one would want to trim leading/trailing underscores, and then replace every internal, contiguous sequence of underscores with a single underscore. At that point it may be worth a little refactoring to just walk through the string once and abort early as appropriate...
| Decimal(0) | ||
| Decimal("Infinity") | ||
| decimal.Decimal(0) | ||
|
|
There was a problem hiding this comment.
Could we please also include cases for Decimal("-inf") and Decimal("inf") (both should error).
|
Sorry for sending everyone down an exponential rabbit hole, and thanks @dscorbett for digging me out of it! Exponent handling has been removed since it takes us out of the world of integer literals. |
|
@dylwil3 feel free to merge at your own discretion |
FURB157 suggests replacing expressions like
Decimal("123")withDecimal(123). This PR extends the rule to cover cases where the input string toDecimalcan be easily transformed into an integer literal.For example:
Note: we do not implement the full decimal parsing logic from CPython on the grounds that certain acceptable string inputs to the
Decimalconstructor may be presumed purposeful on the part of the developer. For example, as in the linked issue,Decimal("١٢٣")is valid and equal toDecimal(123), but we do not suggest a replacement in this case.Closes #13807