-
Notifications
You must be signed in to change notification settings - Fork 3.8k
[release/2.0] Fix overlayfs issues related to user namespace #12223
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[release/2.0] Fix overlayfs issues related to user namespace #12223
Conversation
This is a planned follow-on from containerd#10721 primarily at the request of @fuweid, exchanging MNT_DETACH at unmount time for MOUNT_ATTR_RDONLY at mount time. The effect is to increase risk of unmount failure due to EBUSY (as observed in the wild) but add an additional protection that the then-leaked bind mount does not act as a conduit for inadvertent modification of the underlying data, including our own efforts to clean up the mountpoint. Tests covering the lifecycle of the temporary idmap mounts and integrity of the underlying lower layer data is also included in the normal and failed-unmount case. Fixes containerd#10704 Signed-off-by: Mike Baynton <mike@mbaynton.com> (cherry picked from commit 1e3d10d) Signed-off-by: Andrew Halaney <ahalaney@netflix.com>
Per layer idmap'ed bind mounts are costly to performance, as shown in [0]. Each one requires taking various kernel locks, and each one shows up in the host's mount table leading to some components like systemd processing all these temporary mounts unnecessarily. Let's instead go ahead and idmap the common directory of all the layers to achieve the same effect. Now instead of being a function of the number of layers, its a constant idmap per overlayfs! This can have a big impact. For example, imagine running 100 containers at once, each with 50 layers. That's going from doing 100 * 50 (5000) bind mounts, to just 100. In reality both the shim and containerd proper do this, so its actually double that! [0]: containerd#12048 (comment) Signed-off-by: Andrew Halaney <ahalaney@netflix.com> (cherry picked from commit 6e9b6ea) Signed-off-by: Andrew Halaney <ahalaney@netflix.com>
doPrepareIDMappedOverlay() can return nil as the cleanup function. As
the function mandates for the callback to be called, even when errors
are returned, we end-up calling the nil function that of course causes:
panic: runtime error: invalid memory address or nil pointer dereference
The containerd daemon continues to run fine, though.
With the current structure of always calling the cleanup function, we
would either need to check which error it is, to call it only on
specific errors (ugly) or return a "stub function" that doesn't do
anything on those cases (so the call works). None of these options seem
very nice.
Let's just switch the logic to a pattern more common on go: only need to
cleanup if it returned fine. This makes it easier for callers to not do
the wrong thing.
For this change we need to do the cleanup when returning errors on the
function itself, so the user doesn't need to do it for us.
Signed-off-by: Rodrigo Campos <rodrigoca@microsoft.com>
(cherry picked from commit 7a19c94)
Signed-off-by: Andrew Halaney <ahalaney@netflix.com>
Before this patch, the cleanup was unconditionally trying an unmount, which is not needed if the mount never succeeded in the first place. This causes logs of failed stuff, that should never be there. Also, one function was creating the tmpDir and the nested one removing it (in the cleanup function), which complicates the reasoning about not leaking resources. This patch on one hand moves the creation of the tmpDir into the function that will remove it later and renames the parameter. On the other hand, it also separates the cleanup function in two functions: cleanDir() and cleanMount(). Each one will clean the directory or the mount, only if needed, and a new cleanup function that just calls cleanDir() and cleanMount() is returned as the cleanup function handler. Because we now create the tmp directory inside the function, we need to adjust the test: the directory passed as param now will exist, but it shoild be empty. Therefore, we change the os.Stat() to an os.Remove(). If it's not empty (the cleanup didn't work), the remove will fail. Signed-off-by: Rodrigo Campos <rodrigoca@microsoft.com> (cherry picked from commit dd7fe0b) Signed-off-by: Andrew Halaney <ahalaney@netflix.com>
Signed-off-by: Rodrigo Campos <rodrigoca@microsoft.com> (cherry picked from commit 27ba690) Signed-off-by: Andrew Halaney <ahalaney@netflix.com>
When a lot of pods are created all at the same time, the umount fails with EBUSY sometimes. This causes some mounts to be leaked. A way to repro this issue is here: containerd#12139 (comment) While using lsof/fuser it was not possible to get the culprits on time (the mount is busy for a few ms), @fuwei has found what is causing the mount to be busy: containerd#12139 (comment) When we fork in `GetUsernsFD()`, if a call to prepareIDMappedOverlay() is ongoing and has an open fd to the path to idmap, then the unmount callback will fail as the forked process still has an open fd to it. Let's handle the idmap unmounts in the same way other temp mounts are done, using the Unmount() helper, that retries the unmount. This retry fixes it 100% of the times on my system and in the systems of @halaney that reported the issue too. This was originally fixed by containerd#10721 by using a detached mount, but the mount was switched to non-detached again in containerd#10955 and we started to leak mounts. As @fuwei mentioned, using a detached mount is quite invisible for the admin and, therefore, a retry is a better alternative. It seems these umounts are a great candidate for containerd#11303, which will manage the life-cycle of mounts and can handle these retries whenever needed. [1]: To do it, I just run as root "unshare -m", that creates a mntns with private propagation, and then run the containerd daemon. Signed-off-by: Rodrigo Campos <rodrigoca@microsoft.com> (cherry picked from commit da3dc1e) Signed-off-by: Andrew Halaney <ahalaney@netflix.com>
|
Hi @halaney. Thanks for your PR. I'm waiting for a containerd member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
rata
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
containerd 2.0.7 Welcome to the v2.0.7 release of containerd! The seventh patch release for containerd 2.0 includes various bug fixes and updates. * **containerd** * [**GHSA-pwhc-rpq9-4c8w**](GHSA-pwhc-rpq9-4c8w) * [**GHSA-m6hq-p25p-ffr2**](GHSA-m6hq-p25p-ffr2) * **runc** * [**GHSA-qw9x-cqr3-wc7r**](GHSA-qw9x-cqr3-wc7r) * [**GHSA-cgrx-mc8f-2prm**](GHSA-cgrx-mc8f-2prm) * [**GHSA-9493-h29p-rfm2**](GHSA-9493-h29p-rfm2) * **Disable event subscriber during task cleanup** ([containerd#12406](containerd#12406)) * **Add SystemdCgroup to default runtime options** ([containerd#12254](containerd#12254)) * **Fix userns with container image VOLUME mounts that need copy** ([containerd#12241](containerd#12241)) * **Add dial timeout field to hosts toml configuration** ([containerd#12136](containerd#12136)) * **Update runc binary to v1.3.3** ([containerd#12479](containerd#12479)) * **Fix lost container logs from quickly closing io** ([containerd#12376](containerd#12376)) * **Create bootstrap.json with 0644 permission** ([containerd#12184](containerd#12184)) * **Fix pidfd leak in UnshareAfterEnterUserns** ([containerd#12178](containerd#12178)) Please try out the release binaries and report any issues at https://github.com/containerd/containerd/issues. * Austin Vazquez * Phil Estes * Rodrigo Campos * Wei Fu * Akihiro Suda * Derek McGowan * Maksym Pavlenko * ningmingxiao * Kirtana Ashok * Akhil Mohan * Andrew Halaney * Jin Dong * Jose Fernandez * Mike Baynton * Philip Laine * Swagat Bora * wheat2018 <details><summary>56 commits</summary> <p> * Prepare release notes for v2.0.7 ([containerd#12482](containerd#12482)) * [`4931e24f1`](containerd@4931e24) Prepare release notes for v2.0.7 * [`205bc4f2d`](containerd@205bc4f) Update mailmap * [`5f708b76a`](containerd@5f708b7) Merge commit from fork * [`8cd112d82`](containerd@8cd112d) Fix directory permissions * [`05290b5bc`](containerd@05290b5) Merge commit from fork * [`4d1edf4ad`](containerd@4d1edf4) fix goroutine leak of container Attach * Update runc binary to v1.3.3 ([containerd#12479](containerd#12479)) * [`b46dc6a67`](containerd@b46dc6a) runc: Update runc binary to v1.3.3 * ci: bump Go 1.24.9; 1.25.3 ([containerd#12361](containerd#12361)) * [`5e9c82178`](containerd@5e9c821) Update GHA runners to use latest images for basic binaries build * [`7f59248dc`](containerd@7f59248) Update GHA runners to use latest image for most jobs * [`e1373e8a8`](containerd@e1373e8) ci: bump Go 1.24.9, 1.25.3 * [`e1a910a6a`](containerd@e1a910a) ci: bump Go 1.24.8; 1.25.2 * [`fd04b7f17`](containerd@fd04b7f) move exclude-dirs to issues.exclude-dirs * [`b49377975`](containerd@b493779) update golangci-lint to v1.64.2 * [`6e45022a1`](containerd@6e45022) build(deps): bump golangci/golangci-lint-action from 6.3.2 to 6.5.0 * [`09ce0f2a1`](containerd@09ce0f2) build(deps): bump golangci/golangci-lint-action from 6.2.0 to 6.3.2 * [`de63a740b`](containerd@de63a74) build(deps): bump golangci/golangci-lint-action from 6.1.1 to 6.2.0 * Fix lost container logs from quickly closing io ([containerd#12376](containerd#12376)) * [`f953ee8a3`](containerd@f953ee8) bugfix:fix container logs lost because io close too quickly * CI: update Fedora to 43 ([containerd#12448](containerd#12448)) * [`f6f15f513`](containerd@f6f15f5) CI: update Fedora to 43 * Disable event subscriber during task cleanup ([containerd#12406](containerd#12406)) * [`2a2329cbd`](containerd@2a2329c) cri/server/podsandbox: disable event subscriber * CI: skip ubuntu-24.04-arm on private repos ([containerd#12428](containerd#12428)) * [`dfb954743`](containerd@dfb9547) CI: skip ubuntu-24.04-arm on private repos * Remove additional fuzzers from instrumentation repo ([containerd#12420](containerd#12420)) * [`f6b02f6bb`](containerd@f6b02f6) Remove additional fuzzers from CI * runc:Update runc binary to v1.3.1 ([containerd#12275](containerd#12275)) * [`75c13ee3f`](containerd@75c13ee) runc:Update runc binary to v1.3.1 * Add SystemdCgroup to default runtime options ([containerd#12254](containerd#12254)) * [`427cdd06c`](containerd@427cdd0) add SystemdCgroup to default runtime options * install-runhcs-shim: fetch target commit instead of tags ([containerd#12255](containerd#12255)) * [`0b35e19fb`](containerd@0b35e19) install-runhcs-shim: fetch target commit instead of tags * Fix userns with container image VOLUME mounts that need copy ([containerd#12241](containerd#12241)) * [`3212afc2f`](containerd@3212afc) integration: Add test for directives with userns * [`b855c6e10`](containerd@b855c6e) cri: Fix userns with Dockerfile VOLUME mounts that need copy * Fix overlayfs issues related to user namespace ([containerd#12223](containerd#12223)) * [`05c0c99f4`](containerd@05c0c99) core/mount: Retry unmounting idmapped directories * [`afdede4ce`](containerd@afdede4) core/mount: Test cleanup of DoPrepareIDMappedOverlay() * [`47205f814`](containerd@47205f8) core/mount: Properly cleanup on doPrepareIDMappedOverlay errors * [`6f4abd970`](containerd@6f4abd9) core/mount: Don't call nil function on errors * [`a2f0d65d7`](containerd@a2f0d65) core/mount: Only idmap once per overlayfs, not per layer * [`1c32accd7`](containerd@1c32acc) Make ovl idmap mounts read-only * ci: bump Go 1.23.12, 1.24.6 ([containerd#12187](containerd#12187)) * [`9e72e91e6`](containerd@9e72e91) ci: bump Go 1.23.12, 1.24.6 * Create bootstrap.json with 0644 permission ([containerd#12184](containerd#12184)) * [`009622e04`](containerd@009622e) fix: create bootstrap.json with 0644 permission * Fix pidfd leak in UnshareAfterEnterUserns ([containerd#12178](containerd#12178)) * [`5bec0a332`](containerd@5bec0a3) sys: fix pidfd leak in UnshareAfterEnterUserns * Fix windows test failures ([containerd#12120](containerd#12120)) * [`2a2488131`](containerd@2a24881) Fix intermittent test failures on Windows CIs * [`018470948`](containerd@0184709) Remove WS2025 from CIs due to regression * Add dial timeout field to hosts toml configuration ([containerd#12136](containerd#12136)) * [`b50cbbc98`](containerd@b50cbbc) Add dial timeout field to hosts toml configuration </p> </details> This release has no dependency changes Previous release can be found at [v2.0.6](https://github.com/containerd/containerd/releases/tag/v2.0.6) * `containerd-<VERSION>-<OS>-<ARCH>.tar.gz`: ✅Recommended. Dynamically linked with glibc 2.31 (Ubuntu 20.04). * `containerd-static-<VERSION>-<OS>-<ARCH>.tar.gz`: Statically linked. Expected to be used on non-glibc Linux distributions. Not position-independent. In addition to containerd, typically you will have to install [runc](https://github.com/opencontainers/runc/releases) and [CNI plugins](https://github.com/containernetworking/plugins/releases) from their official sites too. See also the [Getting Started](https://github.com/containerd/containerd/blob/main/docs/getting-started.md) documentation.
Backport various overlayfs related user namespace fixes and improvements
This backports various fixes related to the following issues:
And pulls in 1e3d10d ("Make ovl idmap mounts read-only") as
a dependency.
These patches make:
These are great improvements to the stability of containerd when using
user namespaces + overlayfs.