Skip to content

Conversation

@halaney
Copy link
Contributor

@halaney halaney commented Aug 20, 2025

Backport various overlayfs related user namespace fixes and improvements

This backports various fixes related to the following issues:

Issue #12048
Issue #12139

And pulls in 1e3d10d ("Make ovl idmap mounts read-only") as
a dependency.

These patches make:

1. The temporary idmap'ed bind mounts r/o and not MNT_DETACHED,
   which prevents accidental modification and makes unmount semantics
   more clear
2. Makes far less bind mounts happen per overlayfs, greatly improving
   performance which was shown to be devastating enough to kill the node
3. Retries on umount errors (a common containerd pattern) for the temporary
   bind mounts to prevent leaking mounts on long running containerd daemons

These are great improvements to the stability of containerd when using
user namespaces + overlayfs.

mbaynton and others added 6 commits August 20, 2025 16:42
This is a planned follow-on from containerd#10721 primarily at the request of
@fuweid, exchanging MNT_DETACH at unmount time for MOUNT_ATTR_RDONLY at
mount time. The effect is to increase risk of unmount failure due to
EBUSY (as observed in the wild) but add an additional protection that the
then-leaked bind mount does not act as a conduit for inadvertent modification
of the underlying data, including our own efforts to clean up the mountpoint.

Tests covering the lifecycle of the temporary idmap mounts and integrity
of the underlying lower layer data is also included in the normal and
failed-unmount case.

Fixes containerd#10704

Signed-off-by: Mike Baynton <mike@mbaynton.com>
(cherry picked from commit 1e3d10d)
Signed-off-by: Andrew Halaney <ahalaney@netflix.com>
Per layer idmap'ed bind mounts are costly to performance, as shown in
[0]. Each one requires taking various kernel locks, and each one shows
up in the host's mount table leading to some components like systemd
processing all these temporary mounts unnecessarily.

Let's instead go ahead and idmap the common directory of all the
layers to achieve the same effect. Now instead of being a function of
the number of layers, its a constant idmap per overlayfs! This can have
a big impact. For example, imagine running 100 containers at once, each
with 50 layers. That's going from doing 100 * 50 (5000) bind mounts, to
just 100. In reality both the shim and containerd proper do this, so its
actually double that!

[0]: containerd#12048 (comment)
Signed-off-by: Andrew Halaney <ahalaney@netflix.com>
(cherry picked from commit 6e9b6ea)
Signed-off-by: Andrew Halaney <ahalaney@netflix.com>
doPrepareIDMappedOverlay() can return nil as the cleanup function. As
the function mandates for the callback to be called, even when errors
are returned, we end-up calling the nil function that of course causes:

        panic: runtime error: invalid memory address or nil pointer dereference

The containerd daemon continues to run fine, though.

With the current structure of always calling the cleanup function, we
would either need to check which error it is, to call it only on
specific errors (ugly) or return a "stub function" that doesn't do
anything on those cases (so the call works). None of these options seem
very nice.

Let's just switch the logic to a pattern more common on go: only need to
cleanup if it returned fine. This makes it easier for callers to not do
the wrong thing.

For this change we need to do the cleanup when returning errors on the
function itself, so the user doesn't need to do it for us.

Signed-off-by: Rodrigo Campos <rodrigoca@microsoft.com>
(cherry picked from commit 7a19c94)
Signed-off-by: Andrew Halaney <ahalaney@netflix.com>
Before this patch, the cleanup was unconditionally trying an unmount,
which is not needed if the mount never succeeded in the first place.
This causes logs of failed stuff, that should never be there.
Also, one function was creating the tmpDir and the nested one removing
it (in the cleanup function), which complicates the reasoning about
not leaking resources.

This patch on one hand moves the creation of the tmpDir into the
function that will remove it later and renames the parameter. On the
other hand, it also separates the cleanup function in two functions:
cleanDir() and cleanMount(). Each one will clean the directory or the
mount, only if needed, and a new cleanup function that just calls
cleanDir() and cleanMount() is returned as the cleanup function handler.

Because we now create the tmp directory inside the function, we need to
adjust the test: the directory passed as param now will exist, but it
shoild be empty. Therefore, we change the os.Stat() to an os.Remove().
If it's not empty (the cleanup didn't work), the remove will fail.

Signed-off-by: Rodrigo Campos <rodrigoca@microsoft.com>
(cherry picked from commit dd7fe0b)
Signed-off-by: Andrew Halaney <ahalaney@netflix.com>
Signed-off-by: Rodrigo Campos <rodrigoca@microsoft.com>
(cherry picked from commit 27ba690)
Signed-off-by: Andrew Halaney <ahalaney@netflix.com>
When a lot of pods are created all at the same time, the umount fails
with EBUSY sometimes. This causes some mounts to be leaked. A way to
repro this issue is here:
	containerd#12139 (comment)

While using lsof/fuser it was not possible to get the culprits on time
(the mount is busy for a few ms), @fuwei has found what is causing the
mount to be busy:
	containerd#12139 (comment)

When we fork in `GetUsernsFD()`, if a call to prepareIDMappedOverlay()
is ongoing and has an open fd to the path to idmap, then the unmount
callback will fail as the forked process still has an open fd to it.

Let's handle the idmap unmounts in the same way other temp mounts are
done, using the Unmount() helper, that retries the unmount.

This retry fixes it 100% of the times on my system and in the systems
of @halaney that reported the issue too.

This was originally fixed by containerd#10721 by using a detached mount, but the
mount was switched to non-detached again in containerd#10955 and we started to
leak mounts. As @fuwei mentioned, using a detached mount is quite
invisible for the admin and, therefore, a retry is a better alternative.

It seems these umounts are a great candidate for containerd#11303, which will
manage the life-cycle of mounts and can handle these retries whenever
needed.

[1]: To do it, I just run as root "unshare -m", that creates a mntns
with private propagation, and then run the containerd daemon.

Signed-off-by: Rodrigo Campos <rodrigoca@microsoft.com>
(cherry picked from commit da3dc1e)
Signed-off-by: Andrew Halaney <ahalaney@netflix.com>
@k8s-ci-robot
Copy link

Hi @halaney. Thanks for your PR.

I'm waiting for a containerd member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@halaney halaney changed the title Ahalaney/overlayfs idmap backports 2.0 Backport various overlayfs related user namespace fixes and improvements Aug 20, 2025
@halaney halaney marked this pull request as ready for review August 20, 2025 22:11
@dosubot dosubot bot added the area/runtime Runtime label Aug 20, 2025
@AkihiroSuda AkihiroSuda changed the title Backport various overlayfs related user namespace fixes and improvements [release/2.0] Backport various overlayfs related user namespace fixes and improvements Aug 20, 2025
Copy link
Contributor

@rata rata left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@github-project-automation github-project-automation bot moved this from Needs Triage to Review In Progress in Pull Request Review Aug 22, 2025
@mxpv mxpv merged commit c92bc89 into containerd:release/2.0 Aug 22, 2025
53 checks passed
@github-project-automation github-project-automation bot moved this from Review In Progress to Done in Pull Request Review Aug 22, 2025
@dmcgowan dmcgowan changed the title [release/2.0] Backport various overlayfs related user namespace fixes and improvements [release/2.0] Fix overlayfs issues related user namespace Nov 5, 2025
@dmcgowan dmcgowan changed the title [release/2.0] Fix overlayfs issues related user namespace [release/2.0] Fix overlayfs issues related to user namespace Nov 5, 2025
mansikulkarni96 added a commit to mansikulkarni96/containerd that referenced this pull request Dec 4, 2025
containerd 2.0.7

Welcome to the v2.0.7 release of containerd!

The seventh patch release for containerd 2.0 includes various bug fixes and updates.

* **containerd**
  * [**GHSA-pwhc-rpq9-4c8w**](GHSA-pwhc-rpq9-4c8w)
  * [**GHSA-m6hq-p25p-ffr2**](GHSA-m6hq-p25p-ffr2)

* **runc**
  * [**GHSA-qw9x-cqr3-wc7r**](GHSA-qw9x-cqr3-wc7r)
  * [**GHSA-cgrx-mc8f-2prm**](GHSA-cgrx-mc8f-2prm)
  * [**GHSA-9493-h29p-rfm2**](GHSA-9493-h29p-rfm2)

* **Disable event subscriber during task cleanup** ([containerd#12406](containerd#12406))
* **Add SystemdCgroup to default runtime options** ([containerd#12254](containerd#12254))
* **Fix userns with container image VOLUME mounts that need copy** ([containerd#12241](containerd#12241))

* **Add dial timeout field to hosts toml configuration** ([containerd#12136](containerd#12136))

* **Update runc binary to v1.3.3** ([containerd#12479](containerd#12479))
* **Fix lost container logs from quickly closing io** ([containerd#12376](containerd#12376))
* **Create bootstrap.json with 0644 permission** ([containerd#12184](containerd#12184))
* **Fix pidfd leak in UnshareAfterEnterUserns** ([containerd#12178](containerd#12178))

Please try out the release binaries and report any issues at
https://github.com/containerd/containerd/issues.

* Austin Vazquez
* Phil Estes
* Rodrigo Campos
* Wei Fu
* Akihiro Suda
* Derek McGowan
* Maksym Pavlenko
* ningmingxiao
* Kirtana Ashok
* Akhil Mohan
* Andrew Halaney
* Jin Dong
* Jose Fernandez
* Mike Baynton
* Philip Laine
* Swagat Bora
* wheat2018

<details><summary>56 commits</summary>
<p>

* Prepare release notes for v2.0.7 ([containerd#12482](containerd#12482))
  * [`4931e24f1`](containerd@4931e24) Prepare release notes for v2.0.7
  * [`205bc4f2d`](containerd@205bc4f) Update mailmap
  * [`5f708b76a`](containerd@5f708b7) Merge commit from fork
  * [`8cd112d82`](containerd@8cd112d) Fix directory permissions
  * [`05290b5bc`](containerd@05290b5) Merge commit from fork
  * [`4d1edf4ad`](containerd@4d1edf4) fix goroutine leak of container Attach
* Update runc binary to v1.3.3 ([containerd#12479](containerd#12479))
  * [`b46dc6a67`](containerd@b46dc6a) runc: Update runc binary to v1.3.3
* ci: bump Go 1.24.9; 1.25.3 ([containerd#12361](containerd#12361))
  * [`5e9c82178`](containerd@5e9c821) Update GHA runners to use latest images for basic binaries build
  * [`7f59248dc`](containerd@7f59248) Update GHA runners to use latest image for most jobs
  * [`e1373e8a8`](containerd@e1373e8) ci: bump Go 1.24.9, 1.25.3
  * [`e1a910a6a`](containerd@e1a910a) ci: bump Go 1.24.8; 1.25.2
  * [`fd04b7f17`](containerd@fd04b7f) move exclude-dirs to issues.exclude-dirs
  * [`b49377975`](containerd@b493779) update golangci-lint to v1.64.2
  * [`6e45022a1`](containerd@6e45022) build(deps): bump golangci/golangci-lint-action from 6.3.2 to 6.5.0
  * [`09ce0f2a1`](containerd@09ce0f2) build(deps): bump golangci/golangci-lint-action from 6.2.0 to 6.3.2
  * [`de63a740b`](containerd@de63a74) build(deps): bump golangci/golangci-lint-action from 6.1.1 to 6.2.0
* Fix lost container logs from quickly closing io ([containerd#12376](containerd#12376))
  * [`f953ee8a3`](containerd@f953ee8) bugfix:fix container logs lost because io close too quickly
* CI: update Fedora to 43 ([containerd#12448](containerd#12448))
  * [`f6f15f513`](containerd@f6f15f5) CI: update Fedora to 43
* Disable event subscriber during task cleanup ([containerd#12406](containerd#12406))
  * [`2a2329cbd`](containerd@2a2329c) cri/server/podsandbox: disable event subscriber
* CI: skip ubuntu-24.04-arm on private repos ([containerd#12428](containerd#12428))
  * [`dfb954743`](containerd@dfb9547) CI: skip ubuntu-24.04-arm on private repos
* Remove additional fuzzers from instrumentation repo ([containerd#12420](containerd#12420))
  * [`f6b02f6bb`](containerd@f6b02f6) Remove additional fuzzers from CI
* runc:Update runc binary to v1.3.1 ([containerd#12275](containerd#12275))
  * [`75c13ee3f`](containerd@75c13ee) runc:Update runc binary to v1.3.1
* Add SystemdCgroup to default runtime options ([containerd#12254](containerd#12254))
  * [`427cdd06c`](containerd@427cdd0) add SystemdCgroup to default runtime options
* install-runhcs-shim: fetch target commit instead of tags ([containerd#12255](containerd#12255))
  * [`0b35e19fb`](containerd@0b35e19) install-runhcs-shim: fetch target commit instead of tags
* Fix userns with container image VOLUME mounts that need copy ([containerd#12241](containerd#12241))
  * [`3212afc2f`](containerd@3212afc) integration: Add test for directives with userns
  * [`b855c6e10`](containerd@b855c6e) cri: Fix userns with Dockerfile VOLUME mounts that need copy
* Fix overlayfs issues related to user namespace ([containerd#12223](containerd#12223))
  * [`05c0c99f4`](containerd@05c0c99) core/mount: Retry unmounting idmapped directories
  * [`afdede4ce`](containerd@afdede4) core/mount: Test cleanup of DoPrepareIDMappedOverlay()
  * [`47205f814`](containerd@47205f8) core/mount: Properly cleanup on doPrepareIDMappedOverlay errors
  * [`6f4abd970`](containerd@6f4abd9) core/mount: Don't call nil function on errors
  * [`a2f0d65d7`](containerd@a2f0d65) core/mount: Only idmap once per overlayfs, not per layer
  * [`1c32accd7`](containerd@1c32acc) Make ovl idmap mounts read-only
* ci: bump Go 1.23.12, 1.24.6 ([containerd#12187](containerd#12187))
  * [`9e72e91e6`](containerd@9e72e91) ci: bump Go 1.23.12, 1.24.6
* Create bootstrap.json with 0644 permission ([containerd#12184](containerd#12184))
  * [`009622e04`](containerd@009622e) fix: create bootstrap.json with 0644 permission
* Fix pidfd leak in UnshareAfterEnterUserns ([containerd#12178](containerd#12178))
  * [`5bec0a332`](containerd@5bec0a3) sys: fix pidfd leak in UnshareAfterEnterUserns
* Fix windows test failures ([containerd#12120](containerd#12120))
  * [`2a2488131`](containerd@2a24881) Fix intermittent test failures on Windows CIs
  * [`018470948`](containerd@0184709) Remove WS2025 from CIs due to regression
* Add dial timeout field to hosts toml configuration ([containerd#12136](containerd#12136))
  * [`b50cbbc98`](containerd@b50cbbc) Add dial timeout field to hosts toml configuration
</p>
</details>

This release has no dependency changes

Previous release can be found at [v2.0.6](https://github.com/containerd/containerd/releases/tag/v2.0.6)
* `containerd-<VERSION>-<OS>-<ARCH>.tar.gz`:         ✅Recommended. Dynamically linked with glibc 2.31 (Ubuntu 20.04).
* `containerd-static-<VERSION>-<OS>-<ARCH>.tar.gz`:  Statically linked. Expected to be used on non-glibc Linux distributions. Not position-independent.

In addition to containerd, typically you will have to install [runc](https://github.com/opencontainers/runc/releases)
and [CNI plugins](https://github.com/containernetworking/plugins/releases) from their official sites too.

See also the [Getting Started](https://github.com/containerd/containerd/blob/main/docs/getting-started.md) documentation.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Archived in project

Development

Successfully merging this pull request may close these issues.

6 participants