Write tar fixes#2261
Conversation
|
there are some tests failures that must be taken care of: |
The failure itself is not the problem, but rather what the right behavior should be. It seems that some errors should be ignored, but other errors maybe not. Do you have guidance? A cop-out would be to add an |
|
@giuseppe - let me propose the error handling commit in a separate PR, because it evidently needs more discussion. |
87b1551 to
89d00b2
Compare
89d00b2 to
f7ff038
Compare
thanks; this needs another LGTM, I think. Do you know someone who could provide this? |
mtrmac
left a comment
There was a problem hiding this comment.
- Do we need to make
TarWithOptionsTopublic? c/storage attempts to have a stable API, every new function is a commitment. See alsoNewPatternMatcher. - This does not actually substantially improve error handling, because errors in the goroutine are only logged and don’t cause a failure in the consumer.
pkg/archive/archive.go
Outdated
| // TarWithOptions creates an archive from the directory at `srcPath`, only including files whose relative | ||
| // paths are included in `options.IncludeFiles` (if non-nil) or not in `options.ExcludePatterns`. | ||
| func TarWithOptions(srcPath string, options *TarOptions) (io.ReadCloser, error) { | ||
| if _, err := fileutils.NewPatternMatcher(options.ExcludePatterns); err != nil { |
There was a problem hiding this comment.
NewPatternMatcher is not quite free; doing that twice is awkward.
If TarWithOptionsTo were private, we could easily only do it once, and pass the value into the goroutine.
There was a problem hiding this comment.
Alternatively, if the pipe writer were modified to use CloseWithError (which, I, think, should definitely happen), we don’t need to specifically check here: we can just start the pipe writer goroutine, and receive the pattern matcher error through the pipe.
There was a problem hiding this comment.
it's somewhat incompatible, though: the error will only be propagated once reading on the stream starts. If the caller does something like
tarStream, err := Tar(...)
if err != nil { return err }
dest := os.Create("file.tar")
io.Copy(dest, tarStream)
then your proposed behavior would now leave around "file.tar" in case of a PatternMAtcher failure.
There was a problem hiding this comment.
I implemented this. PTAL.
There was a problem hiding this comment.
the error will only be propagated once reading on the stream starts
That’s a fair point. There are several ways to structure that in a private function; for the eventual public API, I don’t know.
There was a problem hiding this comment.
as an aside, the pattern matcher could be O(1) in the number of patterns by rewriting
NewPatternMatcher("a", "b")
to generate the regexp '(a|b)'
pkg/archive/archive.go
Outdated
| logrus.Errorf("%s", err) | ||
| return | ||
| } | ||
| compressWriter, err := CompressStream(pipeWriter, options.Compression) |
There was a problem hiding this comment.
AFAICS compressWriter is ~entirely local to the pipe-writing goroutine; the reading goroutine doesn’t need to know it exists.
There was a problem hiding this comment.
You mean: the writing routine (which becomes TarWithOptionsTo in this PR) should also do compression?
There was a problem hiding this comment.
It is already doing the compression: compressWriter is a layer of indirection that does the compression within its Write implementation, and compressWriter.Write is ultimately called within the producer goroutine.
(After all, compressWriter is writing to pipeWriter; the caller of TarWithOptions is, most of the time, blocked on reading from pipeReader, and it has no opportunity to write to pipeWriter in the meantime.)
There was a problem hiding this comment.
I'm not sure what your original point was. Can you rephrase it?
There was a problem hiding this comment.
You mean: the writing routine (which becomes TarWithOptionsTo in this PR) should also do compression?
Yes. TarWithOptionsTo( options: { Compression: … } ) can be set, and it seems simpler to implement it than to specifically refuse the option (or, worse, to silently ignore it while TarWithOptions implements it).
OTOH that is similar to the PatternMatcher situation, where moving the initialization into the writing goroutine would delay error reporting.
There was a problem hiding this comment.
… and would the compression interfere with your plans to use reflinks?
There was a problem hiding this comment.
and would the compression interfere with your plans to use reflinks?
Compression, either in the file system, or on the tar stream obviously breaks reflink copying.
Unfortunately, the CompressStream function does more than compression; it also introduces buffering for the Uncompressed case by means of the NewWriteCloserWrapper. AFAICT, the wrapper will hide the ReadFrom method of the underlying writer.
I'm not sure why the buffering is there. The tarWriter has its own buffer.
There was a problem hiding this comment.
moby/moby#7542 claims a fairly significant performance improvement, although that’s not directly attributed to exactly these lines.
I don’t see archive.Tar doing all that much buffering.
There was a problem hiding this comment.
the moby PR has a bunch of anecdotal data, along with a TarUntar benchmark. The latter looks rather unrepresentative, creating 100 4 byte files in /tmp, and then repeatedly tarring them.
I made a quick test that tars a dir on a real file system. The buffering in CompressStream makes a ~10% impact when going from ext4 to ext4,
$ TARDEST=$HOME/tardest.tar TARDIR=$HOME/vc/linux go test -run TarLinux -bench TarLinux
2025/03/11 13:41:38 tarring "/home/hanwen/vc/linux"
2025/03/11 13:41:45 7.221876687s
2025/03/11 13:41:55 9.783580435s
2025/03/11 13:42:03 8.324693106s
2025/03/11 13:42:13 9.535414505s
2025/03/11 13:42:21 8.34720174s
2025/03/11 13:42:31 9.622281549s
2025/03/11 13:42:39 8.726733474s
2025/03/11 13:42:48 9.115060653s
2025/03/11 13:42:57 8.690927914s
2025/03/11 13:43:06 9.37279054s
2025/03/11 13:43:06 With 8.34720174s, without 9.535414505s
PASS
ok github.com/containers/storage/pkg/archive 88.748s
suprprisingly, with my local hack for zero-transport copy, the buffering (which disables zero-transport) actually makes things go faster:
$ TARDEST=$HOME/.cache/tardest.tar TARDIR=$HOME/.cache/linux go test -run TarLinux -bench TarLinux
2025/03/11 13:47:06 tarring "/home/hanwen/.cache/linux"
2025/03/11 13:47:15 9.059872899s
2025/03/11 13:47:30 14.882930582s
2025/03/11 13:47:37 6.968169459s
2025/03/11 13:47:53 16.237515011s
2025/03/11 13:48:00 7.07192259s
2025/03/11 13:48:15 15.151443769s
2025/03/11 13:48:23 7.171133439s
2025/03/11 13:48:38 15.060790794s
2025/03/11 13:48:46 8.301557066s
2025/03/11 13:49:01 14.920596331s
2025/03/11 13:49:01 With 7.171133439s, without 15.060790794s
PASS
ok github.com/containers/storage/pkg/archive 114.837s
for tarring larger files, zero copy does bring benefits,
$ TARDEST=$HOME/.cache/tardest.tar TARDIR=$HOME/.cache/lib64 go test -run TarLinux -bench TarLinux
2025/03/11 13:54:09 tarring "/home/hanwen/.cache/lib64"
2025/03/11 13:54:12 2.802090087s
2025/03/11 13:54:14 1.640192869s
2025/03/11 13:54:16 2.12680278s
2025/03/11 13:54:18 1.61872927s
2025/03/11 13:54:20 2.151480276s
2025/03/11 13:54:21 1.619020427s
2025/03/11 13:54:24 2.168449473s
2025/03/11 13:54:25 1.584594405s
2025/03/11 13:54:27 2.132707088s
2025/03/11 13:54:29 1.617399592s
2025/03/11 13:54:29 With 2.151480276s, without 1.61872927s
it seems I'll need a more sophisticated strategy here (using zero-copy only for large files), and right now, there is no good argument for dropping the buffering from CompressStream.
There was a problem hiding this comment.
pkg/archive/archive.go
Outdated
| // `dest`, only including files whose relative paths are included in | ||
| // `options.IncludeFiles` (if non-nil) or not in | ||
| // `options.ExcludePatterns`. | ||
| func TarWithOptionsTo(dest io.WriteCloser, srcPath string, options *TarOptions) error { |
There was a problem hiding this comment.
Given how this function silently ignores many errors (and, it seems, the plan is to change that), I’m very uncomfortable with making it a public API at this point.
There was a problem hiding this comment.
That's a fair point.
I am primarily interested in creating tar files (in the Podman Checkpoint command) without the intermediate buffering introduced by io.Pipe. How should we proceed then? IIUC, there are couple of uncontested parts:
- passing errors in the writer through to the reader using CloseWithError
- avoiding writes to TarOptions
and a more hairy change, which is to propagate all or some errors from tarWriter.addFile(). IIUC, you'd rather see us decide what this looks like before making a new API.
FWIW, I think it would be better to export something like archive.tarWriter. Then, some of the functionality of TarOptions could be expressed in code by callers, rather than having 16 options that interact with each other in unpredictable ways.
There was a problem hiding this comment.
I think the “uncontested parts” can be merged ~immediately.
I didn’t look at the failures triggered by adding error handling ~at all yet. I understand that they matter to you (don’t they?), so they block the overall effort anyway. It think that’s the next step.
And making the API public would follow, or be a part of the “add error handling” PR.
I think it would be better to export something like
archive.tarWriter.
I have ~no opinion on that in principle, I’d defer to actual c/storage maintainers.
At a high level, I’m a bit unclear on what is the point of a granular API for external consumers. archive.Tar, sure, a very simple one-shot call. archive.TarWithOptions, fine, for various tunables (although many would be pretty specific to needs of c/storage callers).
For the individual tarWriter.addFile… I’m tempted to question whether there are callers specialized enough that using just archive.TarWithOptions is insufficient, but not so much that they would want to use the standard-library archive/tar.Writer, and whether the intersection of these conditions is enough to worry about.
Then, some of the functionality of TarOptions could be expressed in code by callers
I don’t know what this means. Do you mean some other parameter-passing convention (e.g. “functional options”?) Using ordinary function parameters instead of a struct? I’d be rather opposed to the latter.
I am primarily interested in creating tar files (in the Podman Checkpoint command) without the intermediate buffering introduced by io.Pipe. I’m curious, does that have a measurable performance impact?
There was a problem hiding this comment.
I understand that they matter to you (don’t they?),
The failures in the test suite (caused by ignoring errors) don't matter to me. What matters to me is that in case of any error, the tar creation fails.
they would want to use the standard-library archive/tar.Writer, and whether the intersection of these conditions is enough to worry about.
The archive/tar package falls short because it doesn't connect the file system to the Writer/Reader, handling among others
- extended attributes
- file types (mkdir vs mknod vs creat)
- hard-links
Note that the std library archive/tar.Writer is structured as a type with methods, offering greater flexibility to callers, rather than a function with a zillion options.
I’m curious, does that have a measurable performance impact?
See containers/podman@40e49cc and containers/podman#24826 (comment)
for writing large files to BTRFS, there is an effective speedup of up to 1000x.
f7ff038 to
ff06e02
Compare
pkg/archive/archive.go
Outdated
| logrus.Errorf("%s", err) | ||
| return | ||
| } | ||
| compressWriter, err := CompressStream(pipeWriter, options.Compression) |
There was a problem hiding this comment.
moby/moby#7542 claims a fairly significant performance improvement, although that’s not directly attributed to exactly these lines.
I don’t see archive.Tar doing all that much buffering.
ff06e02 to
6e03dd5
Compare
|
ping. @giuseppe |
A nil and empty slice are equivalent, except resulting JSON serialization. There is JSON serialization in applyLayerHandler(), but it's internal to the storage package and therefore not subject to compatibility promises. Signed-off-by: Han-Wen Nienhuys <hanwen@engflow.com>
Before, this was breaking the convention that options passed by reference are not mutated. In particular, TestOverlayTarUntar passes the same object to both TarWithOptions and Untar, resulting in a race detector failure. Signed-off-by: Han-Wen Nienhuys <hanwen@engflow.com>
Gives a name to the anonymous goroutine populating the archive, and makes parameter dependencies explicit, in preparation to moving it out. No functional changes. Signed-off-by: Han-Wen Nienhuys <hanwen@engflow.com>
6e03dd5 to
d39dba2
Compare
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: giuseppe, hanwen-flow The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
No description provided.