Conversation
04a662a to
a7897d5
Compare
About the organization
Edit: as noted by @tertsdiepraam the quote issue was an oversight on my part About fuzzing
Edit: leak fixed in #5787 |
7b9a52e to
22e7380
Compare
|
Re: quotes. These probably depend on locale. We chose to follow the C locale until we have proper locale support, which uses |
Interesting. I only had set |
5e35d02 to
ecfba2a
Compare
|
Fixed quotes |
|
GNU testsuite comparison: |
ecfba2a to
9f34735
Compare
|
I've restructured the code to make names more explicit, and added uucode tests to the partial parsing function. |
8bbf8f1 to
54b87f4
Compare
There was a problem hiding this comment.
Hi! Thanks! I think this is mostly correct, but I think it should be implemented slightly differently. A problem with this implementation is that the base is lost. Here's GNU:
> printf %d 010b
printf: ‘010b’: value not completely converted
8
So I think this is what should happen:
- Remove the base prefix to determine the base.
- Get the digits from the front.
- If anything is left, print the error.
- Convert the digits with
from_str_radix.
This will also prevent us from trying to parse the same value multiple times, which feels a bit strange and requires us to keep both passes in sync.
Does that all make sense?
TIL I learned that coreutils inherited the "octal prefix mistake" from C.
Note that:
The code I implemented only deals with decimal numbers, as those are more complicated to parse (only one "-" optional prefix allowed, only one optional "." allowed, "-inf", "inf", "nan" — although those three aren't tied to any specific base they don't accept a base prefix).
Yes it does. I see two alternatives:
The second one makes more sense to me, I'll see what I can do and set this PR as draft in the meantime. Thanks for your feedback. |
Haha yeah and now we're stuck with it 😄
That sounds great! Thank you! |
54b87f4 to
61309c1
Compare
|
I have implemented a new number parser which can also cope with hexadecimal floats. I have put it in another module as to differentiate between the part which emits error messages and the one which only does some processing. |
d5b83d4 to
b7e6488
Compare
|
As far as I know, the only remaining issue before we can activate the printf fuzzer (in matching stderr mode) is that double quotes are escaped when printed by uutils while they are not when printed by GNU coreutils. And also |
b7e6488 to
a6f527c
Compare
|
The latest push just adds a new test for the error message corresponding to a larger single character: $ target/debug/coreutils printf %d "'abc"
printf: warning: bc: character(s) following character constant have been ignored
97% |
|
GNU testsuite comparison: |
|
The failing chown test seems unrelated to those changes and fails for me the same way locally with and without this patch. |
tertsdiepraam
left a comment
There was a problem hiding this comment.
Very nice! This is great stuff! I have some small suggestions, but this could also be merged as is in my opinion.
| ParseError::PartialMatch(v, rest) => { | ||
| if input.starts_with('\'') { | ||
| show_warning!( | ||
| "{}: character(s) following character constant have been ignored", | ||
| &rest, | ||
| ); | ||
| } else { | ||
| show_error!("{}: value not completely converted", input.quote()); | ||
| } | ||
| v | ||
| } |
There was a problem hiding this comment.
Maybe it makes sense to split these cases into two enum variants?
There was a problem hiding this comment.
I experimented with it yesterday but went back: this is a partial match after all, just like the others. The fact that the higher layers choose to print a different message has little to do with the parsing.
There was a problem hiding this comment.
That does make sense, but they could be qualified as a numeric partial match and a character partial match, which are two quite different operations. I'm mostly suggesting it because it would save us from storing but discarding rest. I'm happy with this though, so let's leave it as it is.
There was a problem hiding this comment.
Note that rest is only a &str, no copy is ever made of the character data. I think the cost is insignificant.
There was a problem hiding this comment.
Oh yeah I didn't mean a performance overhead, just an additional thing to pass around
| } | ||
| let (mut index, mut int, mut point, mut frac, mut prec, mut ended) = | ||
| (0, 0u64, false, 0u64, 0, false); | ||
| for c in rest.chars() { |
There was a problem hiding this comment.
A couple of things about this loop:
- It might be easier to use bytes instead of chars because we only care about the digits which are in ascii range.
- I think it might be nicer to split this loop into before and after the integral part, so encountering a
.will break the first loop and will go to the second. The current version is a bit hard to follow with all the mutable variables that keep track of the state (though I can't find any problems with it :) )
There was a problem hiding this comment.
About chars / bytes : sure, but we might encounter a non-ASCII char in the process and we have to deal with it atomically.
About separating the loops, let me check if it makes things simpler.
Thanks for the review!
There was a problem hiding this comment.
but we might encounter a non-ASCII
I think we won't split on that if we encounter it, because it won't be a digit or a point or something else we can use. I think it should be safe.
There was a problem hiding this comment.
To expand on this a bit. The coreutils are weird regarding utf8. The GNU versions usually do not care at all whether a string is valid. The "correct" (i.e. compatible) solution is usually to use bytes throughout and avoid using str/String altogether. Then for printing in error messages we do a to_string_lossy or something like that. It's a bit messy unfortunately.
For example, I bet that printf will happily accept something that starts with digits and then has some invalid utf8. I can trigger this in fish:
# GNU
> printf %d 10\xf0
printf: ‘10\360’: value not completely converted
10
# uutils
> printf %d 10\xf0
error: invalid UTF-8 was detected in one or more arguments
Usage: target/debug/coreutils printf FORMATSTRING [ARGUMENT]...
target/debug/coreutils printf FORMAT [ARGUMENT]...
target/debug/coreutils printf OPTION
There was a problem hiding this comment.
@tertsdiepraam Breaking the loop in two parts has indeed decreased the complexity a lot. It also made a better structure to add inline comments. Thanks for the suggestion!
There was a problem hiding this comment.
To expand on this a bit. The coreutils are weird regarding utf8. The GNU versions usually do not care at all whether a string is valid. The "correct" (i.e. compatible) solution is usually to use bytes throughout and avoid using
str/Stringaltogether. Then for printing in error messages we do ato_string_lossyor something like that. It's a bit messy unfortunately.
Agreed, but it comes back to the previous question: do we want to be compatible with the GNU version even when we go out of the scope of its intended usage, or is it enough to be compatible with the GNU version for its intended usage, and then to be more correct outside of this usage?
Anyways, as far as this parser is concerned, given that it receives a &str as input, I suggest to keep char which is more readable and should infer no performance penalty, and switch to u8 if at some point you decide to manipulate &[u8] inside of printf for example. The change would be easy and feel natural then.
There was a problem hiding this comment.
do we want to be compatible with the GNU version even when we go out of the scope of its intended usage, or is it enough to be compatible with the GNU version for its intended usage, and then to be more correct outside of this usage?
So far, the answer has been that we do indeed want to be compatible, but of course the intended usage has more priority. It's difficult to tell what even is the intended usage at this point, because the coreutils have accumulated so much functionality. I think the strength of this particular project lies in the compatibility, even though I often wish I could change (i.e. "fix") the behaviour. But I'm always happy to discuss specific cases :)
527b015 to
d14c464
Compare
|
Btw, are different parameters used for configuring Clippy? The windows running requires that I had a |
|
GNU testsuite comparison: |
d14c464 to
1ec5b85
Compare
|
The failing macos/x86_64 check seems unrelated. |
tertsdiepraam
left a comment
There was a problem hiding this comment.
Final few comments I think, just some super minor things. This is excellent and I want to merge it as soon as possible.
The parser can parse integral and floating point numbers as expected by the coreutils `printf` command.
The error messages are more compliant with GNU coreutils. Also, floating hexadecimal numbers are now supported in `printf`.
1ec5b85 to
a85a792
Compare
|
GNU testsuite comparison: |
|
Great! Thank you! |
This makes printf more compatible with its GNU coreutils counterpart.