chunked: allow conversion without zstd compression#2343
Conversation
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: giuseppe The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
f593118 to
f6a2815
Compare
mtrmac
left a comment
There was a problem hiding this comment.
Just an extremely brief skim…
This might well make sense as an intermediate step. For full performance, wouldn't we want to entirely avoid copyAllBlobToFile + convertTarToZstdChunked?
“Just” stream the incoming tar file as it comes; for every regular file, store it somewhere inside the staging destination, checksum it, and either hardlink/reflink it with a pre-existing one, or move it in place if it is entirely new. (Yes, that would be a third implementation of setting file metadata, along with pkg/archive (maybe there are even more??), and we’d really want to consolidate them.)
Alternatively, after #2325 , wouldn’t it be fairly easy to add chunked-like pre-staging of layer contents without holding locks to the regular pull path? And if we did that, could we teach pkg/archive to build the composefs structures ~directly? I think that (“just”, again) requires computing MeasureVerity, ordinary digests for RegularFilePathForValidatedDigest, and the tar header metadata.
Unlike truly-chunked pulls, on the “convert” path (and, also, for https://github.com/containers/image/issues/2792 ), there’s not that much benefit in trying not to read files, so the hole / roll-sum data computation can, I think (?) be entirely avoided; we only want to match entire files, by digest, for reflinks/hardlinks.
I know, “just”…
I've done some benchmarking and the cost of writing the tarball is minimal compared to computing the checksum of each file/chunk. It is still useful to compute the chunks because we fill the cache with this information and future downloads can use these chunks. I agree it would look cleaner to work directly on the tar stream, but the implementation cost is much higher because we will need to duplicate a lot of code paths. |
f6a2815 to
a7455c2
Compare
| if part != nil { | ||
| limit := mf.CompressedSize | ||
| // If we are reading from a source file, use the uncompressed size to limit the reader, because | ||
| // the compressed size refers to the original layer stream. | ||
| if missingPart.OriginFile != nil && partCompression == fileTypeNoCompression { | ||
| limit = mf.UncompressedSize | ||
| } | ||
| c.rawReader = io.LimitReader(part, limit) | ||
| } |
There was a problem hiding this comment.
I’m generally unhappy with the complexity. I feel that there must be some way to express this simpler.
I think that conflating compressedFileType to mean both “format of the file” and “origin+format of this chunk” is confusing, although it seems somewhat convenient for this PR.
We have the if c.rawReader != nil code to discard remaining data even for OriginFile chunks, where we don’t actually need to do that.
This part that duplicates the OriginFile condition could certainly be done without duplicating the condition, e.g. setting an extra readingFromALocalFile boolean in the case missingPart.OriginFile != nil section.
As a possible starting point, is there any need to split the prepareCompressedStreamToFile / appendCompressedStreamToFile code into two parts, and to interleave the openDestinationFile in the middle? I am quite possibly missing something, but I don’t see why that is necessary. And if it is not, combining the two …StreamToFile into a single function with a single switch would be simpler.
I’m afraid I can’t predict the full scope how the refactoring could be done, I can only see one or two steps ahead.
Thanks, that’s interesting.
Wait … we are writing this not-really-zstd data to durable storage as |
Flush() is only called before Close() so it has not any effect. Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
a7455c2 to
92ce27a
Compare
mtrmac
left a comment
There was a problem hiding this comment.
I’d really like storeMissingFiles to be simplified further, but I guess with the c.rawReader = io.LimitReader consolidation this PR, is, on net, not making things worse.
f5b395e to
b3ada72
Compare
The reader for the part is now limited before calling prepareCompressedStreamToFile(). Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
a new function NoCompression() is added to provide a way to create uncompressed zstd:chunked files. Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
This commit introduces the capability to convert tar layers to the zstd:chunked format, without performing any zstd compression. This allows using zstd:chunked without the cost of compression and decompression. Closes: https://issues.redhat.com/browse/RUN-3056 Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
b3ada72 to
87c6994
Compare
mtrmac
left a comment
There was a problem hiding this comment.
LGTM, with a bit of heavy heart about the complexity of storeMissingFiles.
|
/lgtm |
This commit introduces the capability to convert tar layers to the zstd:chunked format, without performing any zstd compression.
This allows using zstd:chunked without the cost of compression and decompression.
Closes: https://issues.redhat.com/browse/RUN-3056