Fix estargz compression loses the original tar metadata#2352
Fix estargz compression loses the original tar metadata#2352AkihiroSuda merged 1 commit intomoby:masterfrom
Conversation
| type conversion struct { | ||
| target compression.Type | ||
| decompress func(io.Reader) (cdcompression.DecompressReadCloser, error) | ||
| decompress func(context.Context, ocispecs.Descriptor) (io.ReadCloser, error) |
There was a problem hiding this comment.
change it to ReaderAt or ReadSeeker instead. We should be able to define conversions for all supported compressions as a single array and then unit test them with random (tarball) data making sure that compression is lossless.
There was a problem hiding this comment.
was there an issue with this?
There was a problem hiding this comment.
change it to ReaderAt or ReadSeeker instead.
IIUC cdcompression.DecompressReadCloser is not seekable. So if we want to make it seekable we need to change the current decompress implementation to store the decompressed blob to somewhere seekable (e.g. content store) then return the readseeker. Does this change SGTY?
43b45b4 to
d90d0fa
Compare
|
Fixed to use lossless compressor/decompressor of estargz (ktock/stargz-snapshotter@f5999ba). |
You don't seem to be using tar-split parser there for encode phase. Therefore are you sure it works? It might work fine if the initial tarball was created by (same version of) go as the encoding in estargz and original tar matches. But try it with a tarball from GNU tar for example. Tar allows padding etc so there are many ways how the same files could end up encoded into a tarball. |
It just streams the original tar into the destination writer using
It's tested against three formats of tar (https://github.com/ktock/stargz-snapshotter/blob/f5999bafbd01e69a54adb7842d476afbc9eee52d/estargz/testutil.go#L621) but it uses go library so I'll try with tools outside golang. |
Interesting. It looks like the gzip writer is switched in the parser function https://github.com/ktock/stargz-snapshotter/blob/f5999bafbd01e69a54adb7842d476afbc9eee52d/estargz/estargz.go#L804 but the actual write happens in the |
Fixed to use tar-split (https://github.com/ktock/stargz-snapshotter/blob/aac1cee508136d135d55d9f51629bf9b493e048f/estargz/estargz.go#L817). |
04e0307 to
ce62cbb
Compare
ce62cbb to
5ee07ef
Compare
cache/estargz.go
Outdated
| return false | ||
| } | ||
| defer r.Close() | ||
| _, _, err = estargz.OpenFooter(io.NewSectionReader(r, 0, r.Size())) |
There was a problem hiding this comment.
Could we leave a label on the blob after this has been determined for the first time and then avoid opening blob after that
cache/manager_test.go
Outdated
| require.NoError(t, err, testName) | ||
| } | ||
|
|
||
| // Check the uncompresed digest is the same as the original |
| // Note that we don't support eStragz compression for tar that contains a file named | ||
| // `stargz.index.json` because we cannot create eStargz in loseless way for such blob | ||
| // (we must overwrite stargz.index.json file). | ||
| if err := w.AppendTarLossLess(pr); err != nil { |
There was a problem hiding this comment.
Bit worried that other tools can still create unsupported tars. Can that other API be removed? What is the behavior when such tarballs are pulled? Maybe they should fail the isEstargz check?
There was a problem hiding this comment.
The cases where the unsupported tar (= contains stargz.index.json) reach here will be the following
- invalid eStargz: the source blob is a broken (or malformed) eStargz that contains
stargz.index.jsonbut has an invalid footer. In this case, the blob is recognized as a normal gzip blob anddecompressEStargzis not applied. - name conflict: the source blob is a non-eStargz blob (tar/gzip/zstd) but contains
stargz.index.jsonwhich is not TOC.
(If the source blob is an eStargz, the decompressor (decompressEStargz) removes the TOC entry before appending so AppendTarLossLess won't return the error.)
I think these cases are rare thus it's reasonable to return an error here.
There was a problem hiding this comment.
Yes, I didn't think the error case but just that AppendTarLossLess is an additional API and AppendTar still remains. So other tools can create estargz that does not decompress properly. Can we make this requirement for stargz lib, and what is the behavior when we pull such blob that was not created with a lossless encoder.
There was a problem hiding this comment.
other tools can create estargz that does not decompress properly.
If the client converts a plain-gzip image A to eStargz B on their side with tools other than BuildKit, B doesn't hold the original bits so it's treated as a different image than A when pulled to BuildKit. When BuildKit converts B into a tar image, the produced tar is different from what is contained in A.
what is the behavior when we pull such blob that was not created with a lossless encoder.
Could you elaborate on the problem of the conversion happening outside of BuildKit? When it comes to tools other than BuildKit, they mostly need to change tar bits for enabling optimization which sorts tar entries in priority order and adds some landmark dummy files. I think BuildKit cache should treat such blobs as different one from the original.
There was a problem hiding this comment.
Is there any reason other tools couldn't just also use the lossless method. Why can't this be default?
Is it still correct that isEstargz detects lossy estargz as estargz or should it use it as regular gzip and try to recompress with lossless method?
go.mod
Outdated
| gotest.tools/v3 v3.0.3 // indirect | ||
| ) | ||
|
|
||
| require github.com/vbatts/tar-split v0.11.2 // indirect |
There was a problem hiding this comment.
this looks to be outside the block
go.mod
Outdated
| github.com/golang/groupcache v0.0.0-20200121045136-8c9f03a8e57e // indirect | ||
| github.com/golang/groupcache v0.0.0-20210331224755-41bb18bfe9da // indirect | ||
| github.com/golang/protobuf v1.5.2 | ||
| // snappy: updated for go1.17 support |
There was a problem hiding this comment.
make vendor automatically do this. Maybe we need a replace if we want to lock the snappy version.
| type conversion struct { | ||
| target compression.Type | ||
| decompress func(io.Reader) (cdcompression.DecompressReadCloser, error) | ||
| decompress func(context.Context, ocispecs.Descriptor) (io.ReadCloser, error) |
There was a problem hiding this comment.
was there an issue with this?
971e364 to
cb61cda
Compare
tonistiigi
left a comment
There was a problem hiding this comment.
Overall LGTM but I think we need to wait for #2348 to sort out the go.mod issues and rebase then.
|
@ktock needs rebase |
cb61cda to
63576c2
Compare
Currently, eStargz compression doesn't preserve the original tar metadata (header bytes and their order). This causes failure of `TestGetRemote` because an uncompressed blob converted from a gzip blob provides different digset against the one converted from eStargz blob even if their original tar (computed by differ) are the same. This commit solves this issue by fixing eStargz to preserve original tar's metadata that is modified by eStargz. Signed-off-by: Kohei Tokunaga <ktokunaga.mail@gmail.com>
63576c2 to
da821a4
Compare
|
@AkihiroSuda sgty? |
Currently, eStargz compression doesn't preserve the original tar metadata (header bytes and their order).
This causes failure of
TestGetRemotebecause an uncompressed blob converted from a gzip blob provides different digsetagainst the one converted from eStargz blob even if their original tar (computed by differ) are the same.
This commit solves this issue by fixing eStargz compressor to preserve original tar's metadata that is modified by eStargz.
The metadata is saved on the content store and used for decompression when the eStargz is converted into another compressionType.(EDIT: fixed not to use content store)