-
Notifications
You must be signed in to change notification settings - Fork 3.8k
Erofs snapshotter and differ #10705
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Erofs snapshotter and differ #10705
Conversation
7b9dbc7 to
ccc50bf
Compare
f57841a to
e9eed08
Compare
e9eed08 to
879e392
Compare
626a4ca to
ec82d81
Compare
ktock
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tests will be needed.
plugins/diff/erofs/differ.go
Outdated
| return "", fmt.Errorf("invalid filesystem type for erofs differ: %w", errdefs.ErrNotImplemented) | ||
| } | ||
| // If the layer is not prepared by the EROFS snapshotter, fall back to the next differ | ||
| if _, err := os.Stat(filepath.Join(layer, ".erofslayer")); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks difficult to maintain... I think applier shouldn't be aware of internal directory structure of a snapshotter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think applier shouldn't be aware of internal directory structure of a snapshotter.
For generic differs I think, yes. But this differ is only useful together with erofs snapshotter.
All cases are as below:
erofs snapshot layer + erofs differ : convert OCI media types (or erofs native blobs later)
erofs snapshot layer + other differs (walking for example) : just unpack to fs and behaves as overlay;
other snapshot layer (e.g. overlay or remote snapshotter) + erofs differ: move to the next differ;
other snapshot layer + other differ : don't care.
This looks difficult to maintain...
Why? Basically it needs a way to identify a layer prepared by erofs snapshotter to ensure the safety.
Do you have better ideas for this requirement?..
I've checked the previous windows differs, but they seems they just use a special mount structure. But I think this method is more clean compared to that since the mount structure can be mounted directly (so that erofs snapshotter can be used with walking differ) rather than a difficult-to-identify mount structure...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
...
This looks difficult to maintain...
Why? Basically it needs a way to identify a layer prepared by erofs snapshotter to ensure the safety. Do you have better ideas for this requirement?.. I've checked the previous windows differs, but they seems they just use a special mount structure. But I think this method is more clean compared to that since the mount structure can be mounted directly (so that erofs snapshotter can be used with walking differ) rather than a difficult-to-identify mount structure...
Derek has an ongoing mount manager which could help us to clean up this with new mount[] structure too, I think after this work lands, EROFS snapshotter could have a more clean way then.
But I guess they could be done in parallel? I do hope if you could agree..
I will try to add tests to run erofs snapshotter later. |
03fc459 to
0c2285b
Compare
djdongjin
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
|
Hi @AkihiroSuda, thanks for pinging. I could summerize the benefits of
I do hope the kernel features above could be fully leveraged in a dedicated way for container ecosystem, that is why I worked on this. I do hope it can be upstreamed into containerd since it has been pending for almost two years since it was proposed. |
|
Sorry for late reply. It's on my list. Will take a look in this week. |
dmcgowan
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would like to see the first version simplified a bit. If we have the merging implementation working, we should stick with that for the first version and add the overlayed erofs layers later on. We could also avoid mixing erofs layers and non-erofs layers by converting on Commit.
Let me know if that really won't work well or the merging is not the preferred solution. We still have time to make improvements and add features to this before the next release.
|
|
||
| _, imagePull = info.Labels[targetSnapshotLabel] | ||
| if !imagePull && s.mergeFsMeta { | ||
| s.GenerateFsMeta(ctx, snap) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not just do this on commit? Create snapshot should be very fast. Also this could be done outside the transaction then renamed inside.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi Derek, thanks for the suggestion.
The merged fsmeta generation is triggered by the non-unpacking snapshot preparation ("containerd.io/snapshot.ref").
Actually we don't want to generate merged fsmeta on every commited layer (otherwise it will impact performance, I don't want to generate fsmeta all the time), but only for all the parent layers if some layer is not using for unpacking. And fsmeta is kept into the runable layer, not every commited layers.
| }}, nil | ||
| } | ||
|
|
||
| // generate a metadata-only EROFS fsmeta.erofs if the parent layers are all EROFS blobs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Allowing some parent layers to not be EROFS blobs makes this snapshotter more complicated.
| } | ||
|
|
||
| func (s *snapshotter) Commit(ctx context.Context, name, key string, opts ...snapshots.Opt) error { | ||
| return s.ms.WithTransaction(ctx, true, func(ctx context.Context) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think before we enter this transaction is the best time to generate the fsmeta blob or convert the flat directory to erofs (to avoid having a mix of erofs and non-erofs snapshots).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've explained why generating the fsmeta blob here is ineffiencient above.
I think it's reasonable to convert the flat directory to erofs, let me try this.
| ovlOptions []string | ||
|
|
||
| // If fsmerge feature is enabled or not | ||
| mergeFsmeta bool |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the performance like for this? In the initial version I think it would be best to have this as the default and possibly not even include the loopback version yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's mainly used to avoid too many erofs layered mounts and used for secure container virtio-blk passthrough. In other words, it's mainly be used as an advanced feature for now.
If fsmeta merging generation is stayed into Prepare(), it will almost the same performance number against mount one-by-one;
If fsmeta merging generation is moved into Commit(), it will be performed bad against mount one-by-one.
I could reconfirm the exact performance number, but if you think fsmerge feature is still somewhat unclean, I could just drop this feature for now and only leave a very basic version without fsmerge feature.
Yes, it's quite a clever idea to convert non-erofs layer to erofs layer on Commit (although non-erofs layer conversion will make erofs snapshotter together with the walking differ performs longer than the overlayfs snapshotter, but I could see some benefit is that mixing overlay dir is bad for VM passthrough use cases...)
Although "merging is not the preferred solution" but it needs to be generated for Active Snapshot for running containers later, rather than snapshots for unpacking.
Yeah, agreed, thanks! |
The EROFS differ only applies to EROFS layers which are marked by a special file `.erofslayer` generated by the EROFS snapshotter. Why it's needed? Since we'd like to parse []mount.Mount directly without actual mounting and convert OCI layers into EROFS blobs, `.erofslayer` gives a hint that the active snapshotter supports the output blob generated by the EROFS differ. I'd suggest it could be read together with the next commit. Signed-off-by: cardy.tang <zuniorone@gmail.com> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
1dae4f0 to
66f5a49
Compare
It allows us to mount each EROFS blob layer (generated by the EROFS differ) independently, or use the "unpacked" fs/ directories (if some other differ is used.) Currently, it's somewhat like the overlay snapshotter, but I tend to separate the new EROFS logic into a self-contained component, rather than keeping it tangled in the very beginning. Existing users who use the overlay snapshotter won't be impacted at all but they have a chance to use this new snapshotter to leverage the EROFS filesystem. Signed-off-by: cardy.tang <zuniorone@gmail.com> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Some basic tests for now. Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
66f5a49 to
2f15d65
Compare
|
Just a general question, is this snapshotter planned to be in-tree? |
Yes. At least that is the plan on my side, which will provide EROFS kernel features to users much like other in-tree snapshotters. |
mythi
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd be interested in this from Kata/CoCo perspective. I did an initial check of the PR and test runs with runc.
|
|
||
| ``` | ||
| [plugins."io.containerd.service.v1.diff-service"] | ||
| default = ["erofs","walking"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm seeing this warning in the logs:
WARN[2025-01-31T14:27:36.659539529+02:00] multiple differs match for platform, set `differ` option to choose, skipping "walking"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it seems it relates to transfer service, but I don't really know the details too.
EROFS snapshotter can only work effectively with the related differ, so just
leaving "default = ["erofs"]" is fine in practice.
| in the very beginning, it'd be better to be left as an independent snapshotter | ||
| so that existing overlayfs users won't be impacted by the new behaviors and | ||
| users could have a chance to try and develop the related ecosystems (such as | ||
| ComposeFS, confidential containers, gVisor, Kata, gVisor, and more) together. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm interested in the "confidential containers" case. How the guest would know the snapshotter prepared everything correctly (and the image integrity is preserved)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think some changes should be made in the kata-shim, currently I don't have enough time to prepare a optimized workable implementation for this, but after this lands, I will try to sort out these.
|
|
||
| - Native EROFS layers can be pulled from registries without conversion. | ||
|
|
||
| For VM containers, the EROFS snapshotter can efficiently pass through and share |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given the VM based use-case, it feels to me this has some overlaps with the blockfile snapshotter too.
Is there currently a way to get this tried out with Kata based setup?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in principle, the blockfile approach is ineffective if underlay fs doesn't support reflink.
Also I'd like to use vhost-user protocol (+ a merged flatten device) to pass through the container image for VM-based containers, rather than generating some loopback device on the host, which is ineffective too.
Is there currently a way to get this tried out with Kata based setup?
No, as I said above. I will not work out any further since I don't have enough time (and it's also useless) to complete all the use cases before upstreaming.
Personally I think it's a very effective approach for VM-based image pass through. but currently I don't really have a promising reason to persuade our development dept. to put more development resource on it, esp. before upstreaming.
| In order to leverage EROFS-formatted blobs, the EROFS differ is needed to be | ||
| used together to apply image layers. Otherwise, the EROFS snapshotter will |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AFAUI, this handles the whiteouts also. perhaps add a note about it (and how it's done (in mkfs.erofs?).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AFAUI, this handles the whiteouts also. perhaps add a note about it (and how it's done (in
mkfs.erofs?).
I could try to mention it a bit, but that is an implementation detail, since it's a real necessary part to support OCI images. --aufs option in mkfs.erofs will handle AUFS-like whiteouts and opaques properly.
dmcgowan
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's get this in, there is time to test and iterate on it before the 2.1 release
containerd 2.1.0 Welcome to the v2.1.0 release of containerd! The first minor release of containerd 2.x focuses on continued stability alongside new features and improvements. This is the first time-based released for containerd. Most the feature set and core functionality has long been stable and harderened in production environments, so now we transition to a balance of timely delivery of new functionality with the same high confidence in stability and performance. * Add no_sync option to boost boltDB performance on ephemeral environments ([containerd#10745](containerd#10745)) * Add content create event ([containerd#11006](containerd#11006)) * Erofs snapshotter and differ ([containerd#10705](containerd#10705)) * Update CRI to use transfer service for image pull by default ([containerd#8515](containerd#8515)) * Support multiple cni plugin bin dirs ([containerd#11311](containerd#11311)) * Support container restore through CRI/Kubernetes ([containerd#10365](containerd#10365)) * Add OCI/Image Volume Source support ([containerd#10579](containerd#10579)) * Enable Writable cgroups for unprivileged containers ([containerd#11131](containerd#11131)) * Fix recursive RLock() mutex acquisition ([containerd/go-cni#126](containerd/go-cni#126)) * Support CNI STATUS Verb ([containerd/go-cni#123](containerd/go-cni#123)) * Retry last registry host on 50x responses ([containerd#11484](containerd#11484)) * Multipart layer fetch ([containerd#10177](containerd#10177)) * Enable HTTP debug and trace for transfer based puller ([containerd#10762](containerd#10762)) * Add support for unpacking custom media types ([containerd#11744](containerd#11744)) * Add dial timeout field to hosts toml configuration ([containerd#11106](containerd#11106)) * Expose Pod assigned IPs to NRI plugins ([containerd#10921](containerd#10921)) * Support multiple uid/gid mappings ([containerd#10722](containerd#10722)) * Fix race between serve and immediate shutdown on the server ([containerd/ttrpc#175](containerd/ttrpc#175)) * Update FreeBSD defaults and re-organize platform defaults ([containerd#11017](containerd#11017)) * Postpone cri config deprecations to v2.2 ([containerd#11684](containerd#11684)) * Remove deprecated dynamic library plugins ([containerd#11683](containerd#11683)) * Remove the support for Schema 1 images ([containerd#11681](containerd#11681)) Please try out the release binaries and report any issues at https://github.com/containerd/containerd/issues. * Derek McGowan * Phil Estes * Akihiro Suda * Maksym Pavlenko * Jin Dong * Wei Fu * Sebastiaan van Stijn * Samuel Karp * Mike Brown * Adrien Delorme * Austin Vazquez * Akhil Mohan * Kazuyoshi Kato * Henry Wang * Gao Xiang * ningmingxiao * Krisztian Litkey * Yang Yang * Archit Kulkarni * Chris Henzie * Iceber Gu * Alexey Lunev * Antonio Ojea * Davanum Srinivas * Marat Radchenko * Michael Zappa * Paweł Gronowski * Rodrigo Campos * Alberto Garcia Hierro * Amit Barve * Andrey Smirnov * Divya * Etienne Champetier * Kirtana Ashok * Philip Laine * QiPing Wan * fengwei0328 * zounengren * Adrian Reber * Alfred Wingate * Amal Thundiyil * Athos Ribeiro * Brian Goff * Cesar Talledo * ChengyuZhu6 * Chongyi Zheng * Craig Ingram * Danny Canter * David Son * Fupan Li * HirazawaUi * Jing Xu * Jonathan A. Sternberg * Jose Fernandez * Kaita Nakamura * Kohei Tokunaga * Lei Liu * Marco Visin * Mike Baynton * Qiyuan Liang * Sameer * Shiming Zhang * Swagat Bora * Teresaliu * Tony Fang * Tõnis Tiigi * Vered Rosen * Vinayak Goyal * bo.jiang * chriskery * luchenhan * mahmut * zhaixiaojuan * **github.com/Microsoft/hcsshim** v0.12.9 -> v0.13.0-rc.3 * **github.com/cilium/ebpf** v0.11.0 -> v0.16.0 * **github.com/containerd/cgroups/v3** v3.0.3 -> v3.0.5 * **github.com/containerd/containerd/api** v1.8.0 -> v1.9.0 * **github.com/containerd/continuity** v0.4.4 -> v0.4.5 * **github.com/containerd/go-cni** v1.1.10 -> v1.1.12 * **github.com/containerd/imgcrypt/v2** v2.0.0-rc.1 -> v2.0.1 * **github.com/containerd/otelttrpc** ea5083fda723 -> v0.1.0 * **github.com/containerd/platforms** v1.0.0-rc.0 -> v1.0.0-rc.1 * **github.com/containerd/ttrpc** v1.2.6 -> v1.2.7 * **github.com/containerd/typeurl/v2** v2.2.2 -> v2.2.3 * **github.com/containernetworking/cni** v1.2.3 -> v1.3.0 * **github.com/containernetworking/plugins** v1.5.1 -> v1.7.1 * **github.com/containers/ocicrypt** v1.2.0 -> v1.2.1 * **github.com/davecgh/go-spew** d8f796af33cc -> v1.1.1 * **github.com/fsnotify/fsnotify** v1.7.0 -> v1.9.0 * **github.com/go-jose/go-jose/v4** v4.0.4 -> v4.0.5 * **github.com/google/go-cmp** v0.6.0 -> v0.7.0 * **github.com/grpc-ecosystem/grpc-gateway/v2** v2.22.0 -> v2.26.1 * **github.com/klauspost/compress** v1.17.11 -> v1.18.0 * **github.com/mdlayher/socket** v0.4.1 -> v0.5.1 * **github.com/moby/spdystream** v0.4.0 -> v0.5.0 * **github.com/moby/sys/user** v0.3.0 -> v0.4.0 * **github.com/opencontainers/image-spec** v1.1.0 -> v1.1.1 * **github.com/opencontainers/runtime-spec** v1.2.0 -> v1.2.1 * **github.com/opencontainers/selinux** v1.11.1 -> v1.12.0 * **github.com/pelletier/go-toml/v2** v2.2.3 -> v2.2.4 * **github.com/petermattis/goid** 4fcff4a6cae7 **_new_** * **github.com/pmezard/go-difflib** 5d4384ee4fb2 -> v1.0.0 * **github.com/prometheus/client_golang** v1.20.5 -> v1.22.0 * **github.com/prometheus/common** v0.55.0 -> v0.62.0 * **github.com/sasha-s/go-deadlock** v0.3.5 **_new_** * **github.com/smallstep/pkcs7** v0.1.1 **_new_** * **github.com/stretchr/testify** v1.9.0 -> v1.10.0 * **github.com/tchap/go-patricia/v2** v2.3.1 -> v2.3.2 * **github.com/urfave/cli/v2** v2.27.5 -> v2.27.6 * **github.com/vishvananda/netlink** v1.3.0 -> 0e7078ed04c8 * **github.com/vishvananda/netns** v0.0.4 -> v0.0.5 * **go.etcd.io/bbolt** v1.3.11 -> v1.4.0 * **go.opentelemetry.io/auto/sdk** v1.1.0 **_new_** * **go.opentelemetry.io/contrib/instrumentation/google.golang.org/grpc/otelgrpc** v0.56.0 -> v0.60.0 * **go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp** v0.56.0 -> v0.60.0 * **go.opentelemetry.io/otel** v1.31.0 -> v1.35.0 * **go.opentelemetry.io/otel/exporters/otlp/otlptrace** v1.31.0 -> v1.35.0 * **go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracegrpc** v1.31.0 -> v1.35.0 * **go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracehttp** v1.31.0 -> v1.35.0 * **go.opentelemetry.io/otel/metric** v1.31.0 -> v1.35.0 * **go.opentelemetry.io/otel/sdk** v1.31.0 -> v1.35.0 * **go.opentelemetry.io/otel/trace** v1.31.0 -> v1.35.0 * **go.opentelemetry.io/proto/otlp** v1.3.1 -> v1.5.0 * **golang.org/x/crypto** v0.28.0 -> v0.36.0 * **golang.org/x/exp** aacd6d4b4611 -> 2d47ceb2692f * **golang.org/x/mod** v0.21.0 -> v0.24.0 * **golang.org/x/net** v0.30.0 -> v0.38.0 * **golang.org/x/oauth2** v0.22.0 -> v0.27.0 * **golang.org/x/sync** v0.8.0 -> v0.14.0 * **golang.org/x/sys** v0.26.0 -> v0.33.0 * **golang.org/x/term** v0.25.0 -> v0.30.0 * **golang.org/x/text** v0.19.0 -> v0.23.0 * **golang.org/x/time** v0.3.0 -> v0.7.0 * **google.golang.org/genproto/googleapis/api** 5fefd90f89a9 -> 56aae31c358a * **google.golang.org/genproto/googleapis/rpc** 324edc3d5d38 -> 56aae31c358a * **google.golang.org/grpc** v1.67.1 -> v1.72.0 * **google.golang.org/protobuf** v1.35.1 -> v1.36.6 * **k8s.io/api** v0.31.2 -> v0.32.3 * **k8s.io/apimachinery** v0.31.2 -> v0.32.3 * **k8s.io/apiserver** v0.31.2 -> v0.32.3 * **k8s.io/client-go** v0.31.2 -> v0.32.3 * **k8s.io/cri-api** v0.31.2 -> v0.32.3 * **k8s.io/kubelet** v0.31.2 -> v0.32.3 * **k8s.io/utils** 18e509b52bc8 -> 3ea5e8cea738 * **sigs.k8s.io/json** bc3834ca7abd -> 9aa6b5e7a4b3 * **sigs.k8s.io/structured-merge-diff/v4** v4.4.1 -> v4.4.2 * **tags.cncf.io/container-device-interface** v0.8.0 -> v1.0.1 * **tags.cncf.io/container-device-interface/specs-go** v0.8.0 -> v1.0.0 Previous release can be found at [v2.0.0](https://github.com/containerd/containerd/releases/tag/v2.0.0) * `containerd-<VERSION>-<OS>-<ARCH>.tar.gz`: ✅Recommended. Dynamically linked with glibc 2.35 (Ubuntu 22.04). * `containerd-static-<VERSION>-<OS>-<ARCH>.tar.gz`: Statically linked. Expected to be used on Linux distributions that do not use glibc >= 2.35. Not position-independent. In addition to containerd, typically you will have to install [runc](https://github.com/opencontainers/runc/releases) and [CNI plugins](https://github.com/containernetworking/plugins/releases) from their official sites too. See also the [Getting Started](https://github.com/containerd/containerd/blob/main/docs/getting-started.md) documentation.

Implements #9361
Sorry about my limited Go skill, this version refines the previous @cardyok version (I revised it in my spare time),
I think it's clean for upstreaming.
In order to try this:
modprobe erofswith Linux 5.16 or higher (for now);nerdctl --snapshotter=erofs run -it wordpress:latestSome preliminary performance numbers: #10705 (comment)
TODOs: