printf compatibility by samueltardieu · Pull Request #5783 · uutils/coreutils

samueltardieu · 2024-01-04T20:23:33Z

This makes printf more compatible with its GNU coreutils counterpart.

format: %c prints the first character of a string
format: parse as many characters as possible in numbers and make the error messages more compatible with GNU coreutils

samueltardieu · 2024-01-04T22:36:10Z

About the organization

The split between the printf application and core functionality seems weird sometimes, as error messages are emitted from the core library. Maybe it would make sense if the parsing functionalities were either brought back into the printf application or returned Result objects in order to be able to format error messages in the printf application specifically.

For example, printf uses ‘XXX’ quotes which AFAICT were not present in the core library; emitting the message from there (uucore::features::format::argument) forces to use those quotes in uucore.

Edit: as noted by @tertsdiepraam the quote issue was an oversight on my part

About fuzzing

The fuzzing now often fails because too many file descriptors are opened. I don't know if there is a descriptor leak in the fuzzing framework itself, or if this is due to the launch of the external GNU coreutils commands.

Edit: leak fixed in #5787

tertsdiepraam · 2024-01-04T23:56:12Z

Re: quotes. These probably depend on locale. We chose to follow the C locale until we have proper locale support, which uses '.

samueltardieu · 2024-01-05T08:17:42Z

Re: quotes. These probably depend on locale. We chose to follow the C locale until we have proper locale support, which uses '.

Interesting. I only had set LC_NUMERIC=C, but by setting all I get different results in the fuzzer but not in the tests. I'll fix the quotes.

samueltardieu · 2024-01-05T08:22:20Z

Fixed quotes

github-actions · 2024-01-05T08:55:35Z

GNU testsuite comparison:

Skip an intermittent issue tests/tail/inotify-dir-recreate (fails in this run but passes in the 'main' branch)

samueltardieu · 2024-01-05T09:00:00Z

I've restructured the code to make names more explicit, and added uucode tests to the partial parsing function.

tertsdiepraam

Hi! Thanks! I think this is mostly correct, but I think it should be implemented slightly differently. A problem with this implementation is that the base is lost. Here's GNU:

> printf %d 010b
printf: ‘010b’: value not completely converted
8

So I think this is what should happen:

Remove the base prefix to determine the base.
Get the digits from the front.
If anything is left, print the error.
Convert the digits with from_str_radix.

This will also prevent us from trying to parse the same value multiple times, which feels a bit strange and requires us to keep both passes in sync.

Does that all make sense?

samueltardieu · 2024-01-08T09:32:22Z

Hi! Thanks! I think this is mostly correct, but I think it should be implemented slightly differently. A problem with this implementation is that the base is lost. Here's GNU:
> printf %d 010b
printf: ‘010b’: value not completely converted
8

TIL I learned that coreutils inherited the "octal prefix mistake" from C.

So I think this is what should happen:

1. Remove the base prefix to determine the base.
2. Get the digits from the front.
3. If anything is left, print the error.
4. Convert the digits with `from_str_radix`.

Note that:

The "-" prefix must be extracted first as it happens earlier.
It has to account that some bases (dec/hex) can have a fractional part while others (bin/oct) cannot.

The code I implemented only deals with decimal numbers, as those are more complicated to parse (only one "-" optional prefix allowed, only one optional "." allowed, "-inf", "inf", "nan" — although those three aren't tied to any specific base they don't accept a base prefix).

This will also prevent us from trying to parse the same value multiple times, which feels a bit strange and requires us to keep both passes in sync.

Does that all make sense?

Yes it does. I see two alternatives:

Keep the parser I wrote for the decimal case. It is only called when the number could not be parsed as-is, so those situations can be considered unlikely. Handle other bases numbers differently.
Write a parser accomodating all cases and return the parsed value as well as the rest of the string if it cannot be parsed.

The second one makes more sense to me, I'll see what I can do and set this PR as draft in the meantime.

Thanks for your feedback.

tertsdiepraam · 2024-01-08T09:37:06Z

TIL I learned that coreutils inherited the "octal prefix mistake" from C.

Haha yeah and now we're stuck with it 😄

Write a parser accomodating all cases and return the parsed value as well as the rest of the string if it cannot be parsed.

That sounds great! Thank you!

samueltardieu · 2024-01-08T14:19:33Z

I have implemented a new number parser which can also cope with hexadecimal floats. I have put it in another module as to differentiate between the part which emits error messages and the one which only does some processing.

samueltardieu · 2024-01-08T14:46:49Z

As far as I know, the only remaining issue before we can activate the printf fuzzer (in matching stderr mode) is that double quotes are escaped when printed by uutils while they are not when printed by GNU coreutils. And also \c (suppress further output) doesn't seem to be honored properly.

samueltardieu · 2024-01-08T23:07:37Z

The latest push just adds a new test for the error message corresponding to a larger single character:

 $ target/debug/coreutils printf %d "'abc"
printf: warning: bc: character(s) following character constant have been ignored
97%

github-actions · 2024-01-08T23:38:34Z

GNU testsuite comparison:

Skip an intermittent issue tests/tail/inotify-dir-recreate (fails in this run but passes in the 'main' branch)

samueltardieu · 2024-01-09T08:48:14Z

The failing chown test seems unrelated to those changes and fails for me the same way locally with and without this patch.

tertsdiepraam

Very nice! This is great stuff! I have some small suggestions, but this could also be merged as is in my opinion.

src/uucore/src/lib/features/format/argument.rs

tertsdiepraam · 2024-01-09T10:09:06Z

src/uucore/src/lib/features/format/argument.rs

+                ParseError::PartialMatch(v, rest) => {
+                    if input.starts_with('\'') {
+                        show_warning!(
+                            "{}: character(s) following character constant have been ignored",
+                            &rest,
+                        );
+                    } else {
+                        show_error!("{}: value not completely converted", input.quote());
+                    }
+                    v
+                }


Maybe it makes sense to split these cases into two enum variants?

I experimented with it yesterday but went back: this is a partial match after all, just like the others. The fact that the higher layers choose to print a different message has little to do with the parsing.

That does make sense, but they could be qualified as a numeric partial match and a character partial match, which are two quite different operations. I'm mostly suggesting it because it would save us from storing but discarding rest. I'm happy with this though, so let's leave it as it is.

Note that rest is only a &str, no copy is ever made of the character data. I think the cost is insignificant.

Oh yeah I didn't mean a performance overhead, just an additional thing to pass around

tertsdiepraam · 2024-01-09T10:15:21Z

src/uucore/src/lib/features/format/num_parser.rs

+        }
+        let (mut index, mut int, mut point, mut frac, mut prec, mut ended) =
+            (0, 0u64, false, 0u64, 0, false);
+        for c in rest.chars() {


A couple of things about this loop:

It might be easier to use bytes instead of chars because we only care about the digits which are in ascii range.

I think it might be nicer to split this loop into before and after the integral part, so encountering a . will break the first loop and will go to the second. The current version is a bit hard to follow with all the mutable variables that keep track of the state (though I can't find any problems with it :) )

About chars / bytes : sure, but we might encounter a non-ASCII char in the process and we have to deal with it atomically.

About separating the loops, let me check if it makes things simpler.

Thanks for the review!

but we might encounter a non-ASCII

I think we won't split on that if we encounter it, because it won't be a digit or a point or something else we can use. I think it should be safe.

To expand on this a bit. The coreutils are weird regarding utf8. The GNU versions usually do not care at all whether a string is valid. The "correct" (i.e. compatible) solution is usually to use bytes throughout and avoid using str/String altogether. Then for printing in error messages we do a to_string_lossy or something like that. It's a bit messy unfortunately.

For example, I bet that printf will happily accept something that starts with digits and then has some invalid utf8. I can trigger this in fish:

# GNU > printf %d 10\xf0 printf: ‘10\360’: value not completely converted 10 # uutils > printf %d 10\xf0 error: invalid UTF-8 was detected in one or more arguments Usage: target/debug/coreutils printf FORMATSTRING [ARGUMENT]... target/debug/coreutils printf FORMAT [ARGUMENT]... target/debug/coreutils printf OPTION

@tertsdiepraam Breaking the loop in two parts has indeed decreased the complexity a lot. It also made a better structure to add inline comments. Thanks for the suggestion!

To expand on this a bit. The coreutils are weird regarding utf8. The GNU versions usually do not care at all whether a string is valid. The "correct" (i.e. compatible) solution is usually to use bytes throughout and avoid using str/String altogether. Then for printing in error messages we do a to_string_lossy or something like that. It's a bit messy unfortunately.

Agreed, but it comes back to the previous question: do we want to be compatible with the GNU version even when we go out of the scope of its intended usage, or is it enough to be compatible with the GNU version for its intended usage, and then to be more correct outside of this usage?

Anyways, as far as this parser is concerned, given that it receives a &str as input, I suggest to keep char which is more readable and should infer no performance penalty, and switch to u8 if at some point you decide to manipulate &[u8] inside of printf for example. The change would be easy and feel natural then.

do we want to be compatible with the GNU version even when we go out of the scope of its intended usage, or is it enough to be compatible with the GNU version for its intended usage, and then to be more correct outside of this usage?

So far, the answer has been that we do indeed want to be compatible, but of course the intended usage has more priority. It's difficult to tell what even is the intended usage at this point, because the coreutils have accumulated so much functionality. I think the strength of this particular project lies in the compatibility, even though I often wish I could change (i.e. "fix") the behaviour. But I'm always happy to discuss specific cases :)

samueltardieu · 2024-01-09T11:24:33Z

Btw, are different parameters used for configuring Clippy? The windows running requires that I had a #[allow(clippy::cognitive_complexity)] to the parse function, while I cannot reproduce this on Linux.

github-actions · 2024-01-09T11:48:54Z

GNU testsuite comparison:

Skip an intermittent issue tests/tail/inotify-dir-recreate (fails in this run but passes in the 'main' branch)

samueltardieu · 2024-01-09T14:59:30Z

The failing macos/x86_64 check seems unrelated.

tertsdiepraam

Final few comments I think, just some super minor things. This is excellent and I want to merge it as soon as possible.

src/uucore/src/lib/features/format/num_parser.rs

The parser can parse integral and floating point numbers as expected by the coreutils `printf` command.

The error messages are more compliant with GNU coreutils. Also, floating hexadecimal numbers are now supported in `printf`.

github-actions · 2024-01-10T14:33:55Z

GNU testsuite comparison:

Skipping an intermittent issue tests/tail/inotify-dir-recreate (passes in this run but fails in the 'main' branch)

tertsdiepraam · 2024-01-10T15:33:46Z

Great! Thank you!

samueltardieu marked this pull request as draft January 4, 2024 21:22

samueltardieu force-pushed the printf-compatibility branch from 04a662a to a7897d5 Compare January 4, 2024 22:28

samueltardieu marked this pull request as ready for review January 4, 2024 22:34

samueltardieu force-pushed the printf-compatibility branch 3 times, most recently from 7b9a52e to 22e7380 Compare January 4, 2024 22:51

samueltardieu marked this pull request as draft January 5, 2024 08:17

samueltardieu force-pushed the printf-compatibility branch 2 times, most recently from 5e35d02 to ecfba2a Compare January 5, 2024 08:22

samueltardieu marked this pull request as ready for review January 5, 2024 08:22

samueltardieu marked this pull request as draft January 5, 2024 08:40

samueltardieu force-pushed the printf-compatibility branch from ecfba2a to 9f34735 Compare January 5, 2024 08:59

samueltardieu marked this pull request as ready for review January 5, 2024 08:59

samueltardieu force-pushed the printf-compatibility branch 3 times, most recently from 8bbf8f1 to 54b87f4 Compare January 5, 2024 09:46

sylvestre requested a review from tertsdiepraam January 6, 2024 22:46

tertsdiepraam reviewed Jan 8, 2024

View reviewed changes

samueltardieu marked this pull request as draft January 8, 2024 09:32

samueltardieu force-pushed the printf-compatibility branch from 54b87f4 to 61309c1 Compare January 8, 2024 14:18

samueltardieu marked this pull request as ready for review January 8, 2024 14:18

samueltardieu force-pushed the printf-compatibility branch 2 times, most recently from d5b83d4 to b7e6488 Compare January 8, 2024 14:30

samueltardieu mentioned this pull request Jan 8, 2024

Printf: Fix printf hex alternate zero #5811

Merged

samueltardieu force-pushed the printf-compatibility branch from b7e6488 to a6f527c Compare January 8, 2024 23:05

format: %c prints the first character of a string

5dfeca9

tertsdiepraam reviewed Jan 9, 2024

View reviewed changes

samueltardieu force-pushed the printf-compatibility branch 2 times, most recently from 527b015 to d14c464 Compare January 9, 2024 11:13

samueltardieu force-pushed the printf-compatibility branch from d14c464 to 1ec5b85 Compare January 9, 2024 12:22

tertsdiepraam reviewed Jan 10, 2024

View reviewed changes

src/uucore/src/lib/features/format/num_parser.rs Outdated Show resolved Hide resolved

src/uucore/src/lib/features/format/num_parser.rs Show resolved Hide resolved

src/uucore/src/lib/features/format/num_parser.rs Show resolved Hide resolved

samueltardieu added 2 commits January 10, 2024 14:34

format: new dedicated number parser

00cd6fa

The parser can parse integral and floating point numbers as expected by the coreutils `printf` command.

format: use the new number parser and fix the error messages

a85a792

The error messages are more compliant with GNU coreutils. Also, floating hexadecimal numbers are now supported in `printf`.

samueltardieu force-pushed the printf-compatibility branch from 1ec5b85 to a85a792 Compare January 10, 2024 13:35

tertsdiepraam merged commit 0071442 into uutils:main Jan 10, 2024

samueltardieu deleted the printf-compatibility branch January 10, 2024 17:33

tertsdiepraam mentioned this pull request Jan 17, 2024

Printf: Alternative hex fails with a 0 value #5810

Closed

tertsdiepraam mentioned this pull request Feb 7, 2024

Csplit: Add missing precision syntax #5794

Closed

Uh oh!

Conversation

samueltardieu commented Jan 4, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

samueltardieu commented Jan 4, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

About the organization

About fuzzing

Uh oh!

tertsdiepraam commented Jan 4, 2024

Uh oh!

samueltardieu commented Jan 5, 2024

Uh oh!

samueltardieu commented Jan 5, 2024

Uh oh!

github-actions bot commented Jan 5, 2024

Uh oh!

samueltardieu commented Jan 5, 2024

Uh oh!

tertsdiepraam left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

samueltardieu commented Jan 8, 2024

Uh oh!

tertsdiepraam commented Jan 8, 2024

Uh oh!

samueltardieu commented Jan 8, 2024

Uh oh!

samueltardieu commented Jan 8, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

samueltardieu commented Jan 8, 2024

Uh oh!

github-actions bot commented Jan 8, 2024

Uh oh!

samueltardieu commented Jan 9, 2024

Uh oh!

tertsdiepraam left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

samueltardieu Jan 9, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

samueltardieu commented Jan 9, 2024

Uh oh!

github-actions bot commented Jan 9, 2024

Uh oh!

samueltardieu commented Jan 9, 2024

Uh oh!

tertsdiepraam left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

samueltardieu commented Jan 4, 2024 •

edited

Loading

samueltardieu commented Jan 4, 2024 •

edited

Loading

tertsdiepraam left a comment •

edited

Loading

samueltardieu commented Jan 8, 2024 •

edited

Loading

samueltardieu Jan 9, 2024 •

edited

Loading