split: pass GNU tests/b-chunk.sh by zhitkoff · Pull Request #5475 · uutils/coreutils

zhitkoff · 2023-10-29T21:45:00Z

This PR adds

full implementation for ---io-blksize option
refactoring of how splitting into N chunks works for --number strategies
handling input size of stdin stream and file size for files at /dev, /proc, /sys and similar locations that either report 0 for file size, while actually having some content OR report size greater than actual content

passes GNU tests/b-chunk.sh

github-actions · 2023-10-29T22:55:20Z

GNU testsuite comparison:

Congrats! The gnu test tests/split/b-chunk is no longer failing!

zhitkoff · 2023-10-30T19:40:52Z

@tertsdiepraam would you mind reviewing this one ?

zhitkoff · 2023-11-01T21:54:38Z

setting back to draft - even though it passes the GNU test, the implementation for ---io-blksize option is incomplete, needs more work

github-actions · 2023-11-02T22:25:10Z

GNU testsuite comparison:

Congrats! The gnu test tests/rm/rm2 is no longer failing!
Congrats! The gnu test tests/split/b-chunk is no longer failing!
Skip an intermittent issue tests/rm/rm1

It seems to be unnecessary since we have already made the path relative using `construct_dest_path`.

rustix & linux-raw-sys

* uucore support for illumos and solaris * use macro to consolidate illumos and solaris signals * fixing some CI issues * replaced macro with better cfg usage

github-actions · 2023-11-08T00:19:05Z

GNU testsuite comparison:

Congrats! The gnu test tests/split/b-chunk is no longer failing!

github-actions · 2023-11-08T02:34:42Z

GNU testsuite comparison:

Congrats! The gnu test tests/split/b-chunk is no longer failing!

zhitkoff · 2023-11-08T02:53:13Z

@tertsdiepraam @sylvestre this one is ready for review

sylvestre · 2023-11-08T11:12:46Z

src/uu/split/src/split.rs

            None => b'\n',
        };

+        let io_blksize: Option<usize> = if let Some(s) = matches.get_one::<String>(OPT_IO_BLKSIZE) {


What do you think about ?

let io_blksize: Option<usize> = matches.get_one::<String>(OPT_IO_BLKSIZE).map(|s| { parse_size_u64(s) .ok() .and_then(|n| { let n = n.try_into().map_err(|_| SettingsError::InvalidIOBlockSize(s.to_string()))?; if n > OPT_IO_BLKSIZE_MAX { Err(SettingsError::InvalidIOBlockSize(s.to_string())) } else { Ok(n) } }) }).transpose()?;

in this case it would "eat" possible error returned by parse_size_u64(), so it cannot be bubbled up to from()->Result. Also, since ok() converts it to Option and the and_then() expects an option to be returned from the closure, it is not straitforward to return the SettingsError in either places within the closure inside closure - i.e. doing return Err(SettingsError::InvalidIOBlockSize(s.to_string())) or Some(Err(SettingsError::InvalidIOBlockSize(s.to_string()))) would not work : as return is not for from()->Result function but closure, and wrapping in Some() to satisfy and_then() signature we will end up with Option<Option<Result<>>> as a retuning value from closure from map() instead of Option<Result<>> ... could unwrap(), but then still missing/eating error from parse_size_u64() ...
I could be missing something though

please ignore my comment then :(

sylvestre · 2023-11-08T11:16:41Z

src/uu/split/src/split.rs

+        // empty STDIN stream,
+        // and files with true file size 0
+        // will also fit here
+        input_size = num_bytes;


what about an early exit ?

Suggested change

input_size = num_bytes;

Ok(num_bytes)

sylvestre · 2023-11-08T11:17:04Z

src/uu/split/src/split.rs

+        let mut tmp_fd = File::open(Path::new(input))?;
+        let end = tmp_fd.seek(SeekFrom::End(0))?;
+        if end > 0 {
+            input_size = end;


same:

Suggested change

input_size = end;

Ok(end)

sylvestre · 2023-11-08T11:18:01Z

src/uu/split/src/split.rs

+    let read_limit: u64 = if let Some(n) = io_blksize {
+        *n
+    } else {
+        OPT_IO_BLKSIZE_MAX
+    }
+    .try_into()
+    .unwrap();


Suggested change

let read_limit: u64 = if let Some(n) = io_blksize {

*n

} else {

OPT_IO_BLKSIZE_MAX

}

.try_into()

.unwrap();

let read_limit = io_blksize.unwrap_or(OPT_IO_BLKSIZE_MAX) as u64;

sylvestre · 2023-11-08T11:20:33Z

src/uu/split/src/split.rs

+        // could report incorrect file size via `metadata.len()`
+        // either `0` while there is content
+        // or a size larger than actual content
+        input_size = get_irregular_input_size(input, reader, buf, io_blksize)?;


same, early return

Suggested change

input_size = get_irregular_input_size(input, reader, buf, io_blksize)?;

get_irregular_input_size(input, reader, buf, io_blksize)

sylvestre · 2023-11-08T11:21:33Z

src/uu/split/src/split.rs

+        // Regular file
+        let metadata = metadata(input)?;
+        input_size = metadata.len();
+        // Double check the size if `metadata.len()` reports `0`
+        if input_size == 0 {
+            input_size = get_irregular_input_size(input, reader, buf, io_blksize)?;
+        }


Suggested change

// Regular file

let metadata = metadata(input)?;

input_size = metadata.len();

// Double check the size if `metadata.len()` reports `0`

if input_size == 0 {

input_size = get_irregular_input_size(input, reader, buf, io_blksize)?;

}

let metadata = metadata(input)?;

let size = metadata.len();

// If metadata reports size 0, use get_irregular_input_size to double-check

if size == 0 {

get_irregular_input_size(input, reader, buf, io_blksize)

} else {

Ok(size)

}

github-actions · 2023-11-08T18:44:31Z

GNU testsuite comparison:

Congrats! The gnu test tests/split/b-chunk is no longer failing!

tertsdiepraam

Impressive work! Seeing that GNU test pass is great!

tertsdiepraam · 2023-11-09T08:09:10Z

src/uu/split/src/split.rs

+// The ---io parameter is consumed and ignored.
+// The parameter is included to make GNU coreutils tests pass.
+static OPT_IO: &str = "-io";


Here's an example from GNU:

❯ split --- split: option '---io-blksize' requires an argument Try 'split --help' for more information. ❯ split ---io-bl split: option '---io-blksize' requires an argument Try 'split --help' for more information.

This is not a separate option, but one of those things where a long argument can be shortened to the first few letters if its unambiguous. clap supports this too and I think it's turned on (it's infer_long_args(true)) so do we need this?

yep, looks like it is the case. will remove

tertsdiepraam · 2023-11-09T08:16:41Z

src/uu/split/src/split.rs

-    // of bytes per chunk.
-    //
+    // Get the size of the input in bytes
+    let initial_buf: &mut Vec<u8> = &mut vec![];


Suggested change

let initial_buf: &mut Vec<u8> = &mut vec![];

let initial_buf = Vec::new();

tertsdiepraam · 2023-11-09T08:20:38Z

src/uu/split/src/split.rs

+    if usize::BITS < 64 {
+        let _: usize = num_chunks
+            .try_into()
+            .map_err(|_| USimpleError::new(1, "Number of chunks too big"))?;


Why this change? The original code makes sense to me. If we need it to fit in a usize why wouldn't we make it a usize?

it made sense for the original code because num_chunks was later used in method calls that require usize like writers.iter_mut().take(num_chunks - 1), and there was a test to test for that error - looks like only for code coverage %. I left it in to keep the test, but looking at it more closely - it does not need to be usize, so I will remove it along with the test that checked for it.

tertsdiepraam · 2023-11-09T08:24:40Z

src/uu/split/src/split.rs

+    if input == "-"
+        || input.starts_with("/dev/")
+        || input.starts_with("/proc/")
+        || input.starts_with("/sys/")


This feels inherently brittle and I'm wondering whether there's a better way to do it. For example, we could have links to all of these filesystems and that should still work. Based on the syscalls they make, I don't think GNU has exceptions like, though I'm not entirely sure what they are doing instead.

I think they might just try to get the length first and try this other method if that returns 0? I'm not sure.

in GNU they first go straight to reading it into a buffer and then compare that to what metadata/filesystem is reporting + do seek on the file if those disagree (as well as trying to copy things arround into temp file/buffer, etc) and some other edge cases. I tried to avoid reading file into a buffer first for everything, but could re-implement to closer follow GNU approach

I think doing what GNU does is the right call! Nice investigative work! (As a precaution: remember not to look at the GNU source code)

should be ready to go

tertsdiepraam · 2023-11-09T08:32:13Z

src/uu/split/src/split.rs

    loop {
+        let chunk_size = chunk_size_base + (chunk_size_reminder > i) as u64;
        let buf: &mut Vec<u8> = &mut vec![];
+        i += 1;


Would this work?

for i in 1.. { ... }

github-actions · 2023-11-09T17:56:09Z

GNU testsuite comparison:

Congrats! The gnu test tests/split/b-chunk is no longer failing!
GNU test failed: tests/tail/truncate. tests/tail/truncate is passing on 'main'. Maybe you have to rebase?

github-actions · 2023-11-09T19:09:52Z

GNU testsuite comparison:

Congrats! The gnu test tests/split/b-chunk is no longer failing!

zhitkoff · 2023-11-09T20:04:34Z

@tertsdiepraam @sylvestre
made changes based on review, please let me know if you have any more suggestions/comments

github-actions · 2023-11-13T18:04:19Z

GNU testsuite comparison:

Congrats! The gnu test tests/split/b-chunk is no longer failing!

zhitkoff · 2023-11-17T15:58:39Z

@tertsdiepraam @sylvestre I have another set of changes ready that pass next Gnu test for split 'l-chunk', but they are based on code included in this PR. Should I add them into this PR or wait until this one is resolved and then submit those in a new one?

--------- Co-authored-by: Terts Diepraam <terts.diepraam@gmail.com> Co-authored-by: Daniel Hofstetter <daniel.hofstetter@42dh.com> Co-authored-by: Brandon Elam Barker <brandon.barker@gmail.com> Co-authored-by: Kostiantyn Hryshchuk <statheres@gmail.com> Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com>

zhitkoff marked this pull request as draft November 1, 2023 21:53

zhitkoff added 5 commits November 2, 2023 17:27

split: b-chunk refactor for GNU tests

bb86218

split: pass b-chunk GNU test

c9eafc1

split: comments

06ecde5

split: doc comments

4f50dab

split: comments

ad29e42

zhitkoff force-pushed the split-b-chunk branch from ed7c9ad to ad29e42 Compare November 2, 2023 21:27

tertsdiepraam and others added 19 commits November 3, 2023 11:37

cp: remove crash! call

7e81d29

It seems to be unnecessary since we have already made the path relative using `construct_dest_path`.

split: refactor suffix auto-widening and auto-width

d3dca9d

split: suffix auto-widening and auto-width tests

16322e2

split: refactor filename suffix

d823781

split: slash separator

7fb6e10

split: directory separator in additional suffix

5cf4147

deny.toml: remove two entries from skip list

79f648c

rustix & linux-raw-sys

split: rebase regression

143a7c0

split: input size and --number split refactor

4e9fcb0

du: add -P/--no-dereference

e8e1cf6

Add support in uucore for illumos and solaris (uutils#5489)

207db6b

* uucore support for illumos and solaris * use macro to consolidate illumos and solaris signals * fixing some CI issues * replaced macro with better cfg usage

Fix clippy::implicit_clone

8f15b85

chore(deps): update rust crate libc to 0.2.150

ef33865

du: ignore test under Android & FreeBSD

32d60a9

cp,tail: fix warnings in tests on Android

0a31e46

split: refactor filename suffix

0293504

split: changed to_owned to to_string

e938629

Merge branch 'main' into split-b-chunk

50d63c2

split: formatting

c4624aa

zhitkoff added 2 commits November 7, 2023 20:20

split: better 32bit vs 64bit handling + more tests

6e7c2fd

split: tests

2c454a2

zhitkoff marked this pull request as ready for review November 8, 2023 02:52

sylvestre reviewed Nov 8, 2023

View reviewed changes

split: review updates

7acc549

tertsdiepraam reviewed Nov 9, 2023

View reviewed changes

zhitkoff added 2 commits November 9, 2023 12:21

split: updates based on review

b0fa085

split: comments

e7dd254

zhitkoff and others added 2 commits November 9, 2023 13:02

Merge branch 'main' into split-b-chunk

c024b1f

split: formatting

0a08ced

zhitkoff requested review from sylvestre and tertsdiepraam November 9, 2023 20:04

Merge branch 'main' into split-b-chunk

850e5fc

sylvestre merged commit eb00c19 into uutils:main Nov 17, 2023

zhitkoff deleted the split-b-chunk branch December 3, 2023 21:07

	input_size = get_irregular_input_size(input, reader, buf, io_blksize)?;
	get_irregular_input_size(input, reader, buf, io_blksize)

	let initial_buf: &mut Vec<u8> = &mut vec![];
	let initial_buf = Vec::new();

Uh oh!

Conversation

zhitkoff commented Oct 29, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Oct 29, 2023

Uh oh!

zhitkoff commented Oct 30, 2023

Uh oh!

zhitkoff commented Nov 1, 2023

Uh oh!

github-actions bot commented Nov 2, 2023

Uh oh!

github-actions bot commented Nov 8, 2023

Uh oh!

github-actions bot commented Nov 8, 2023

Uh oh!

zhitkoff commented Nov 8, 2023

Uh oh!

sylvestre Nov 8, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Nov 8, 2023

Uh oh!

tertsdiepraam left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Nov 9, 2023

Uh oh!

github-actions bot commented Nov 9, 2023

Uh oh!

zhitkoff commented Nov 9, 2023

Uh oh!

github-actions bot commented Nov 13, 2023

Uh oh!

zhitkoff commented Nov 17, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

zhitkoff commented Oct 29, 2023 •

edited

Loading

sylvestre Nov 8, 2023 •

edited

Loading