Make ovl idmap mounts read-only #10955

mbaynton · 2024-11-05T20:06:45Z

This is an additional safeguard against inadvertent data modification or deletion as a follow-up from #10721.

Tests covering the lifecycle of the temporary idmap mounts and integrity of the underlying lower layer data is also included in the normal and failed-unmount case.

Fixes #10704

From the discussion in #10721 I had three goals:

@fuweid's request to add the read-only flag to the temporary mounts by adjusting IdMapMount's signature to accept mount options.
Add tests
One way to simplify the logic determining the path to remove during cleanup, we already have the path to remove so no need to use filepath.Dir / filepath.Join on a particular layer's path. Not sure if this is exactly what @AkihiroSuda had in mind with comment

k8s-ci-robot · 2024-11-05T20:06:54Z

Hi @mbaynton. Thanks for your PR.

I'm waiting for a containerd member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

fuweid · 2024-11-07T19:37:19Z

core/mount/mount_linux.go

 			// Using os.Remove() so if it's not empty, we don't delete files in the
 			// rootfs.
 			if err := os.Remove(lowerDir); err != nil {
-				log.L.WithError(err).Warnf("failed to remove temporary overlay lowerdir's")


Could you please remove the unix.MNT_DETACH at the line 256? I think it's not required if we use read-only.

@fuweid sadly it is needed. There is some "contention" issue (the umount fails with busy when creating a lot of pods), see here for a repro: #10704 (comment)

I'm unsure yet why the layers seem to be re-used by other mounts or what. We should debug that and fix it, but until then, the detached mount is the best way around it.

Also, I'm not sure I'd remove it even after that is fixed. A detached mount just makes it more robust: if it is used for some reason, all following access will be disconnected and the unmount will be performed as soon as it is not being used anymore.

I left it in for the reasons @rata stated. That said, I think @fuweid may understand the possibility of leaking mounts if we remove it and feels this is preferable to using MNT_DETACH, see their prior comment about it.

NOTE: Even if there are leaky mountpoint, containerd will clean it up during booting.

I'm curious if there's a specific risk or downside to using MNT_DETACH that you are aware of @fuweid?

Hi @mbaynton Sorry for late reply. I can't reproduce this issue by the script you provided. The EBUSY might be caused by other running processes, for example, security application scans all the files. When one process holds fd on that mountpoint, it's unlikely to umount it without MNT_DETACH.

Personally, I don't like MNT_DETACH because it hides real issue. We, contained, have similar issue like #5538. That's why I want to use readonly here.

Since we have cleanup during starting containerd, if it's caused by security application and it's transient issue, it won't be issue.

Additionally, MNT_DETACH will hide mountpoint and we can't see it. If there are applications holding files in that mountpoint for a long time, we can say it's leaky issue. The application should be aware of this, instead of using MNT_DETACH.

I've removed MNT_DETACH as you requested.

I agree that without knowing the real issue in detail (is it in fact because both my clusters and @rata's were running other software that jumped right on the new mounts, or is there another explanation) it is a bit speculation as to whether MNT_DETACH is doing what we think.

I would like to explore achieving these idmap mounts using the new mount API and overlayfs API extension to specify lower layers by fd (later, in other issues/PRs), because I think this method will lead to greater isolation to just containerd and the kernel. Lower layer idmap mounts need not ever be attached to any mount namespace / directory tree. However, if that method does not work out, then I think it will be worth the tracing necessary to understand what is usually causing the EBUSY we originally saw. In that case, I may come back and advocate for MNT_DETACH again :)

I can't reproduce this issue by the script you provided.

Interesting, thanks for trying it out. You were using a build prior to #10721? This is good feedback, I only knew that it worked on a few of my clusters and it worked for @rata, I was starting to think it was a highly reliable repro.

When one process holds fd on that mountpoint, it's unlikely to umount it without MNT_DETACH

I think with MNT_DETACH we have a chance for the mount to become unused and freed, where as with it, we ensure the mount remains indefinitely or at least until the restart cleanup you mentioned.

I would like to explore achieving these idmap mounts using the new mount API and overlayfs API extension to specify lower layers by fd (later, in other issues/PRs), because I think this method will lead to greater isolation to just containerd and the kernel.

+1 looking forward to that!

Interesting, thanks for trying it out. You were using a build prior to #10721? This is good feedback, I only knew that it worked on a few of my clusters and it worked for @rata, I was starting to think it was a highly reliable repro.

Yes, I reverts #10721 in my local and I didn't reproduce this issue.

The issue is these mounts might be copied into another namespace with MS_PRIVATE and mount will fail until those namespaces are torn down.

core/mount/mount_linux.go

fuweid · 2024-11-07T19:49:09Z

core/mount/mount_linux_test.go

+					runtime.LockOSThread()
+					// Never unlock in this goroutine, ensures the thread isn't reused.
+					// Thus we don't have to worry about restoring the caps.
+					ourCaps, err := capability.NewPid2(0)


is it thread-leader level or sub thread level?

In practice it's coming up as not the group leader but the reason why is perhaps fragile:

At the test function level eg TestdoPrepareIdMappedOverlay, go test runs this function on the leader thread.

Each t.Run() is in a new goroutine but because we have not t.Parallel()ized there is no need to spawn new threads and the go runtime is mapping the new goroutines to the leader thread. Here we call runtime.LockOSThread() the first time.

Then we spawn a new goroutine in which we conduct our real business. At this point the runtime is required to create a new thread since it previously only had the leader which is now locked. This new thread must not be the leader.

We'll see if we keep this, as @rata also points out it is pretty complicated.

Good to know this. Since you prepare tempdir for temporary mountpoint, we can try to open file in that mountpoint, which can cause EBUSY during umount. Maybe it's easier to reproduce it this issue instead of using runtime.LockOSThread(). What do you think? :)

If not using MNT_DETACH yes that works and is easier.

rata

@mbaynton thanks a lot for the PR! Some comments below

core/mount/mount_linux_test.go

rata · 2024-11-08T12:14:21Z

core/mount/mount_linux_test.go

+				if tc.injectUmountFault {
+					runtime.LockOSThread()
+					// Never unlock in this goroutine, ensures the thread isn't reused.
+					// Thus we don't have to worry about restoring the caps.
+					ourCaps, err := capability.NewPid2(0)


I think this is quite complicated. This is basically to test the cleanup function, right?

I have a WIP change to just move the cleanup to a different function, so we can easily unit-test it. It's still a WIP, if you want, I can share it with you.

What I'm doing is the cleanup function has some dirs to unmount and some dirs to delete. So to "simulate" an unmount failure you just don't say to unmount that dir, then the dir will not be unmounted.

But wouldn't have the cleanup be a different function simplify the complexity of the test here? Or am I missing something?

I think this is quite complicated.

I think so too.

This is basically to test the cleanup function, right?

Yes, specifically to generate a fault with unmount even when it has MNT_DETACH and verify that we don't remove the underlying files. We could also just not test that, it's vanishingly unlikely with MNT_DETACH present and we have multiple safeguards at this point to prevent it.

I'm also open to doing it in a more unit test / dependency injected way, I was just trying to keep the outside _test.go change size small. Happy to take a look at your approach!

core/mount/mount_linux.go

rata · 2024-11-08T12:22:56Z

go.mod

 	github.com/prometheus/client_golang v1.20.5
 	github.com/sirupsen/logrus v1.9.3
 	github.com/stretchr/testify v1.9.0
+	github.com/syndtr/gocapability v0.0.0-20200815063812-42c35b437635


This pkg is kind of unmaintained. In runc we are moving away to our fork that we moved to the moby org. I think avoiding it will be nice.

rata · 2024-11-08T12:53:45Z

@mbaynton I'm unsure if we should mark the issue as fixed. I mean, the issue is fixed in 2.0 already, there is no data corruption. But there is "contention" and the unmount fails qith resource busy, that is why we the detached unmount for now (so it doesn't fail), and it is not clear to me why yet.

I didn't write that code, so it's not easy for me to see it now. However, I think it needs to be investigated why the unmount fails without detached with the repro case you shared in the issue. It might be legit, but it might not be.

At first glance, it seems to me that those directories should be unsued outside the current pod and probably container too, I don't know why the unmount without detached fails. Doing an stress test of mount/unmount, it doesn't seem to be the kernel generating sporadic failures. So it seems there is something going on worth investigating, IMHO.

If you want to investigate that too, @mbaynton, it would be great!

mbaynton · 2024-11-12T01:48:39Z

I'm unsure if we should mark the issue as fixed

Sure, I don't really have a stake in whether the issue is marked fixed :). I just assumed it was reopened to do this follow-up.

The approach I personally am going to take on next iterations is to see what becomes available in kernel 6.13 and if lower layer specification by file handle indeed lands, see if the reproducer still produces busy idmapped unmounts when they are never added to the host's path tree or if we can just sidestep the whole issue that way.

fuweid

LGTM. Thanks!

mbaynton · 2024-11-22T23:50:54Z

Rebase/squash?

fuweid · 2024-11-23T06:34:44Z

Rebase/squash?

Yes. Please squash into one commit. Thanks

mbaynton · 2024-11-27T03:29:50Z

Sorry for the delay, missed your message

fuweid

LGTM

mbaynton · 2024-12-13T15:43:08Z

What should we do from here?

thaJeztah · 2024-12-13T15:57:45Z

core/mount/mount_linux_test.go

 	return target, mounts
 }
+
+func supportsIDMap(path string) bool {


Only glancing over, but wondering if there's expected and unexpected errors that could happen in this helper that we may be hiding?

Mostly considering if this helper should return an error (instead of a bool), and handle expected errors, and return unexpected ones, so that they can be logged in the caller (possibly even make the test fail).

Skipping a test may be easy to overlook ("skipping test because doesn't support idmapped mounts") but could hide actual issues that need looking into?

I think this function is copied from

containerd/integration/pod_userns_linux_test.go

Line 51 in 9b89fb9

func supportsIDMap(path string) bool {

I think we can make exported function in the follow-up to unify them.

Indeed, this was copied. Does it feel best to tinker with all copies of it in this PR or I could also look at doing that separately with a new issue. I see your point @thaJeztah and it seems easy enough to tighten up either way.

ping @thaJeztah let's handle this in the follow-up. WDYT?

fuweid · 2024-12-17T02:38:54Z

ping @AkihiroSuda @dmcgowan

dmcgowan · 2024-12-19T07:14:53Z

core/mount/mount_idmapped_linux.go

 	}, nil
 }

-// IDMapMount applies GID/UID shift according to gidmap/uidmap for target path


The interface is now stable, add a function with the new signature. This one can be marked as deprecated if it should never be used with the default values.

@fuweid

This is a planned follow-on from containerd#10721 primarily at the request of @fuweid, exchanging MNT_DETACH at unmount time for MOUNT_ATTR_RDONLY at mount time. The effect is to increase risk of unmount failure due to EBUSY (as observed in the wild) but add an additional protection that the then-leaked bind mount does not act as a conduit for inadvertent modification of the underlying data, including our own efforts to clean up the mountpoint. Tests covering the lifecycle of the temporary idmap mounts and integrity of the underlying lower layer data is also included in the normal and failed-unmount case. Fixes containerd#10704 Signed-off-by: Mike Baynton <mike@mbaynton.com>

mbaynton · 2024-12-21T22:04:21Z

Restored original IDMapMount and call it again where new options are not needed, created new IDMapMountWithAttrs.

Rebased.

@halaney

When a lot of pods are created all at the same time, the umount fails with EBUSY sometimes. This causes some mounts to be leaked. A way to repro this issue is here: containerd#12139 (comment) While using lsof/fuser it was not possible to get the culprits on time (the mount is busy for a few ms), I've tried running the containerd daemon in a mount namespace with propagation private[1], and in that case there are no leaks. So, it seems clear it's not containerd but probably systemd that scans all host mounts to act on them, that is causing the EBUSY for a few ms. Also, it seems EBUSY is not just returned in the context of userns, but also for other temp mounts containerd does. See here for more details: containerd#12139 (comment) Let's handle the idmap unmounts in the same way other temp mounts are done, using the Unmount() helper, that retries the unmount. This retry fixes it 100% of the times on my system and in the systems of @halaney that reported the issue too. This was originally fixed by containerd#10721 by using a detached mount, but the mount was switched to non-detached again in containerd#10955 and we started to leak mounts. As @fuwei mentioned, using a detached mount is quite invisible for the admin and, therefore, a retry is a better alternative. It seems these umounts are a great candidate for containerd#11303, which will manage the life-cycle of mounts and can handle these retries whenever needed. It's left for future research if it's possible to make the host not scan these mounts and, if it's possible, decide if we want to do that. [1]: To do it, I just run as root "unshare -m", that creates a mntns with private propagation, and then run the containerd daemon. Signed-off-by: Rodrigo Campos <rodrigoca@microsoft.com>

@halaney

When a lot of pods are created all at the same time, the umount fails with EBUSY sometimes. This causes some mounts to be leaked. A way to repro this issue is here: containerd#12139 (comment) While using lsof/fuser it was not possible to get the culprits on time (the mount is busy for a few ms), I've tried running the containerd daemon in a mount namespace with propagation private[1], and in that case there are no leaks. So, it seems clear it's not containerd but probably systemd that scans all host mounts to act on them, that is causing the EBUSY for a few ms. Also, it seems EBUSY is not just returned in the context of userns, but also for other temp mounts containerd does. See here for more details: containerd#12139 (comment) Let's handle the idmap unmounts in the same way other temp mounts are done, using the Unmount() helper, that retries the unmount. This retry fixes it 100% of the times on my system and in the systems of @halaney that reported the issue too. This was originally fixed by containerd#10721 by using a detached mount, but the mount was switched to non-detached again in containerd#10955 and we started to leak mounts. As @fuwei mentioned, using a detached mount is quite invisible for the admin and, therefore, a retry is a better alternative. It seems these umounts are a great candidate for containerd#11303, which will manage the life-cycle of mounts and can handle these retries whenever needed. It's left for future research if it's possible to make the host not scan these mounts and, if it's possible, decide if we want to do that. [1]: To do it, I just run as root "unshare -m", that creates a mntns with private propagation, and then run the containerd daemon. Signed-off-by: Rodrigo Campos <rodrigoca@microsoft.com>

@fuwei

When a lot of pods are created all at the same time, the umount fails with EBUSY sometimes. This causes some mounts to be leaked. A way to repro this issue is here: containerd#12139 (comment) While using lsof/fuser it was not possible to get the culprits on time (the mount is busy for a few ms), @fuwei has found what is causing the mount to be busy: containerd#12139 (comment) When we fork in `GetUsernsFD()`, if a call to prepareIDMappedOverlay() is ongoing and has an open fd to the path idmap, then the unmount callback will fail as the forked process still has an open fd to it. Let's handle the idmap unmounts in the same way other temp mounts are done, using the Unmount() helper, that retries the unmount. This retry fixes it 100% of the times on my system and in the systems of @halaney that reported the issue too. This was originally fixed by containerd#10721 by using a detached mount, but the mount was switched to non-detached again in containerd#10955 and we started to leak mounts. As @fuwei mentioned, using a detached mount is quite invisible for the admin and, therefore, a retry is a better alternative. It seems these umounts are a great candidate for containerd#11303, which will manage the life-cycle of mounts and can handle these retries whenever needed. It's left for future research if it's possible to make the host not scan these mounts and, if it's possible, decide if we want to do that. [1]: To do it, I just run as root "unshare -m", that creates a mntns with private propagation, and then run the containerd daemon. Signed-off-by: Rodrigo Campos <rodrigoca@microsoft.com>

@fuwei

When a lot of pods are created all at the same time, the umount fails with EBUSY sometimes. This causes some mounts to be leaked. A way to repro this issue is here: containerd#12139 (comment) While using lsof/fuser it was not possible to get the culprits on time (the mount is busy for a few ms), @fuwei has found what is causing the mount to be busy: containerd#12139 (comment) When we fork in `GetUsernsFD()`, if a call to prepareIDMappedOverlay() is ongoing and has an open fd to the path idmap, then the unmount callback will fail as the forked process still has an open fd to it. Let's handle the idmap unmounts in the same way other temp mounts are done, using the Unmount() helper, that retries the unmount. This retry fixes it 100% of the times on my system and in the systems of @halaney that reported the issue too. This was originally fixed by containerd#10721 by using a detached mount, but the mount was switched to non-detached again in containerd#10955 and we started to leak mounts. As @fuwei mentioned, using a detached mount is quite invisible for the admin and, therefore, a retry is a better alternative. It seems these umounts are a great candidate for containerd#11303, which will manage the life-cycle of mounts and can handle these retries whenever needed. It's left for future research if it's possible to make the host not scan these mounts and, if it's possible, decide if we want to do that. [1]: To do it, I just run as root "unshare -m", that creates a mntns with private propagation, and then run the containerd daemon. Signed-off-by: Rodrigo Campos <rodrigoca@microsoft.com>

@fuwei

When a lot of pods are created all at the same time, the umount fails with EBUSY sometimes. This causes some mounts to be leaked. A way to repro this issue is here: containerd#12139 (comment) While using lsof/fuser it was not possible to get the culprits on time (the mount is busy for a few ms), @fuwei has found what is causing the mount to be busy: containerd#12139 (comment) When we fork in `GetUsernsFD()`, if a call to prepareIDMappedOverlay() is ongoing and has an open fd to the path to idmap, then the unmount callback will fail as the forked process still has an open fd to it. Let's handle the idmap unmounts in the same way other temp mounts are done, using the Unmount() helper, that retries the unmount. This retry fixes it 100% of the times on my system and in the systems of @halaney that reported the issue too. This was originally fixed by containerd#10721 by using a detached mount, but the mount was switched to non-detached again in containerd#10955 and we started to leak mounts. As @fuwei mentioned, using a detached mount is quite invisible for the admin and, therefore, a retry is a better alternative. It seems these umounts are a great candidate for containerd#11303, which will manage the life-cycle of mounts and can handle these retries whenever needed. [1]: To do it, I just run as root "unshare -m", that creates a mntns with private propagation, and then run the containerd daemon. Signed-off-by: Rodrigo Campos <rodrigoca@microsoft.com>

@fuwei

When a lot of pods are created all at the same time, the umount fails with EBUSY sometimes. This causes some mounts to be leaked. A way to repro this issue is here: containerd#12139 (comment) While using lsof/fuser it was not possible to get the culprits on time (the mount is busy for a few ms), @fuwei has found what is causing the mount to be busy: containerd#12139 (comment) When we fork in `GetUsernsFD()`, if a call to prepareIDMappedOverlay() is ongoing and has an open fd to the path to idmap, then the unmount callback will fail as the forked process still has an open fd to it. Let's handle the idmap unmounts in the same way other temp mounts are done, using the Unmount() helper, that retries the unmount. This retry fixes it 100% of the times on my system and in the systems of @halaney that reported the issue too. This was originally fixed by containerd#10721 by using a detached mount, but the mount was switched to non-detached again in containerd#10955 and we started to leak mounts. As @fuwei mentioned, using a detached mount is quite invisible for the admin and, therefore, a retry is a better alternative. It seems these umounts are a great candidate for containerd#11303, which will manage the life-cycle of mounts and can handle these retries whenever needed. [1]: To do it, I just run as root "unshare -m", that creates a mntns with private propagation, and then run the containerd daemon. Signed-off-by: Rodrigo Campos <rodrigoca@microsoft.com>

@fuwei

When a lot of pods are created all at the same time, the umount fails with EBUSY sometimes. This causes some mounts to be leaked. A way to repro this issue is here: containerd#12139 (comment) While using lsof/fuser it was not possible to get the culprits on time (the mount is busy for a few ms), @fuwei has found what is causing the mount to be busy: containerd#12139 (comment) When we fork in `GetUsernsFD()`, if a call to prepareIDMappedOverlay() is ongoing and has an open fd to the path to idmap, then the unmount callback will fail as the forked process still has an open fd to it. Let's handle the idmap unmounts in the same way other temp mounts are done, using the Unmount() helper, that retries the unmount. This retry fixes it 100% of the times on my system and in the systems of @halaney that reported the issue too. This was originally fixed by containerd#10721 by using a detached mount, but the mount was switched to non-detached again in containerd#10955 and we started to leak mounts. As @fuwei mentioned, using a detached mount is quite invisible for the admin and, therefore, a retry is a better alternative. It seems these umounts are a great candidate for containerd#11303, which will manage the life-cycle of mounts and can handle these retries whenever needed. [1]: To do it, I just run as root "unshare -m", that creates a mntns with private propagation, and then run the containerd daemon. Signed-off-by: Rodrigo Campos <rodrigoca@microsoft.com>

@fuwei

When a lot of pods are created all at the same time, the umount fails with EBUSY sometimes. This causes some mounts to be leaked. A way to repro this issue is here: containerd#12139 (comment) While using lsof/fuser it was not possible to get the culprits on time (the mount is busy for a few ms), @fuwei has found what is causing the mount to be busy: containerd#12139 (comment) When we fork in `GetUsernsFD()`, if a call to prepareIDMappedOverlay() is ongoing and has an open fd to the path to idmap, then the unmount callback will fail as the forked process still has an open fd to it. Let's handle the idmap unmounts in the same way other temp mounts are done, using the Unmount() helper, that retries the unmount. This retry fixes it 100% of the times on my system and in the systems of @halaney that reported the issue too. This was originally fixed by containerd#10721 by using a detached mount, but the mount was switched to non-detached again in containerd#10955 and we started to leak mounts. As @fuwei mentioned, using a detached mount is quite invisible for the admin and, therefore, a retry is a better alternative. It seems these umounts are a great candidate for containerd#11303, which will manage the life-cycle of mounts and can handle these retries whenever needed. [1]: To do it, I just run as root "unshare -m", that creates a mntns with private propagation, and then run the containerd daemon. Signed-off-by: Rodrigo Campos <rodrigoca@microsoft.com>

@fuwei

When a lot of pods are created all at the same time, the umount fails with EBUSY sometimes. This causes some mounts to be leaked. A way to repro this issue is here: containerd#12139 (comment) While using lsof/fuser it was not possible to get the culprits on time (the mount is busy for a few ms), @fuwei has found what is causing the mount to be busy: containerd#12139 (comment) When we fork in `GetUsernsFD()`, if a call to prepareIDMappedOverlay() is ongoing and has an open fd to the path to idmap, then the unmount callback will fail as the forked process still has an open fd to it. Let's handle the idmap unmounts in the same way other temp mounts are done, using the Unmount() helper, that retries the unmount. This retry fixes it 100% of the times on my system and in the systems of @halaney that reported the issue too. This was originally fixed by containerd#10721 by using a detached mount, but the mount was switched to non-detached again in containerd#10955 and we started to leak mounts. As @fuwei mentioned, using a detached mount is quite invisible for the admin and, therefore, a retry is a better alternative. It seems these umounts are a great candidate for containerd#11303, which will manage the life-cycle of mounts and can handle these retries whenever needed. [1]: To do it, I just run as root "unshare -m", that creates a mntns with private propagation, and then run the containerd daemon. Signed-off-by: Rodrigo Campos <rodrigoca@microsoft.com>

@fuwei

When a lot of pods are created all at the same time, the umount fails with EBUSY sometimes. This causes some mounts to be leaked. A way to repro this issue is here: containerd#12139 (comment) While using lsof/fuser it was not possible to get the culprits on time (the mount is busy for a few ms), @fuwei has found what is causing the mount to be busy: containerd#12139 (comment) When we fork in `GetUsernsFD()`, if a call to prepareIDMappedOverlay() is ongoing and has an open fd to the path to idmap, then the unmount callback will fail as the forked process still has an open fd to it. Let's handle the idmap unmounts in the same way other temp mounts are done, using the Unmount() helper, that retries the unmount. This retry fixes it 100% of the times on my system and in the systems of @halaney that reported the issue too. This was originally fixed by containerd#10721 by using a detached mount, but the mount was switched to non-detached again in containerd#10955 and we started to leak mounts. As @fuwei mentioned, using a detached mount is quite invisible for the admin and, therefore, a retry is a better alternative. It seems these umounts are a great candidate for containerd#11303, which will manage the life-cycle of mounts and can handle these retries whenever needed. [1]: To do it, I just run as root "unshare -m", that creates a mntns with private propagation, and then run the containerd daemon. Signed-off-by: Rodrigo Campos <rodrigoca@microsoft.com>

@fuwei

When a lot of pods are created all at the same time, the umount fails with EBUSY sometimes. This causes some mounts to be leaked. A way to repro this issue is here: containerd#12139 (comment) While using lsof/fuser it was not possible to get the culprits on time (the mount is busy for a few ms), @fuwei has found what is causing the mount to be busy: containerd#12139 (comment) When we fork in `GetUsernsFD()`, if a call to prepareIDMappedOverlay() is ongoing and has an open fd to the path to idmap, then the unmount callback will fail as the forked process still has an open fd to it. Let's handle the idmap unmounts in the same way other temp mounts are done, using the Unmount() helper, that retries the unmount. This retry fixes it 100% of the times on my system and in the systems of @halaney that reported the issue too. This was originally fixed by containerd#10721 by using a detached mount, but the mount was switched to non-detached again in containerd#10955 and we started to leak mounts. As @fuwei mentioned, using a detached mount is quite invisible for the admin and, therefore, a retry is a better alternative. It seems these umounts are a great candidate for containerd#11303, which will manage the life-cycle of mounts and can handle these retries whenever needed. [1]: To do it, I just run as root "unshare -m", that creates a mntns with private propagation, and then run the containerd daemon. Signed-off-by: Rodrigo Campos <rodrigoca@microsoft.com>

@fuwei

When a lot of pods are created all at the same time, the umount fails with EBUSY sometimes. This causes some mounts to be leaked. A way to repro this issue is here: containerd#12139 (comment) While using lsof/fuser it was not possible to get the culprits on time (the mount is busy for a few ms), @fuwei has found what is causing the mount to be busy: containerd#12139 (comment) When we fork in `GetUsernsFD()`, if a call to prepareIDMappedOverlay() is ongoing and has an open fd to the path to idmap, then the unmount callback will fail as the forked process still has an open fd to it. Let's handle the idmap unmounts in the same way other temp mounts are done, using the Unmount() helper, that retries the unmount. This retry fixes it 100% of the times on my system and in the systems of @halaney that reported the issue too. This was originally fixed by containerd#10721 by using a detached mount, but the mount was switched to non-detached again in containerd#10955 and we started to leak mounts. As @fuwei mentioned, using a detached mount is quite invisible for the admin and, therefore, a retry is a better alternative. It seems these umounts are a great candidate for containerd#11303, which will manage the life-cycle of mounts and can handle these retries whenever needed. [1]: To do it, I just run as root "unshare -m", that creates a mntns with private propagation, and then run the containerd daemon. Signed-off-by: Rodrigo Campos <rodrigoca@microsoft.com> (cherry picked from commit da3dc1e) Signed-off-by: Andrew Halaney <ahalaney@netflix.com>

@fuwei

When a lot of pods are created all at the same time, the umount fails with EBUSY sometimes. This causes some mounts to be leaked. A way to repro this issue is here: containerd#12139 (comment) While using lsof/fuser it was not possible to get the culprits on time (the mount is busy for a few ms), @fuwei has found what is causing the mount to be busy: containerd#12139 (comment) When we fork in `GetUsernsFD()`, if a call to prepareIDMappedOverlay() is ongoing and has an open fd to the path to idmap, then the unmount callback will fail as the forked process still has an open fd to it. Let's handle the idmap unmounts in the same way other temp mounts are done, using the Unmount() helper, that retries the unmount. This retry fixes it 100% of the times on my system and in the systems of @halaney that reported the issue too. This was originally fixed by containerd#10721 by using a detached mount, but the mount was switched to non-detached again in containerd#10955 and we started to leak mounts. As @fuwei mentioned, using a detached mount is quite invisible for the admin and, therefore, a retry is a better alternative. It seems these umounts are a great candidate for containerd#11303, which will manage the life-cycle of mounts and can handle these retries whenever needed. [1]: To do it, I just run as root "unshare -m", that creates a mntns with private propagation, and then run the containerd daemon. Signed-off-by: Rodrigo Campos <rodrigoca@microsoft.com> (cherry picked from commit da3dc1e) Signed-off-by: Andrew Halaney <ahalaney@netflix.com>

@fuwei

When a lot of pods are created all at the same time, the umount fails with EBUSY sometimes. This causes some mounts to be leaked. A way to repro this issue is here: containerd#12139 (comment) While using lsof/fuser it was not possible to get the culprits on time (the mount is busy for a few ms), @fuwei has found what is causing the mount to be busy: containerd#12139 (comment) When we fork in `GetUsernsFD()`, if a call to prepareIDMappedOverlay() is ongoing and has an open fd to the path to idmap, then the unmount callback will fail as the forked process still has an open fd to it. Let's handle the idmap unmounts in the same way other temp mounts are done, using the Unmount() helper, that retries the unmount. This retry fixes it 100% of the times on my system and in the systems of @halaney that reported the issue too. This was originally fixed by containerd#10721 by using a detached mount, but the mount was switched to non-detached again in containerd#10955 and we started to leak mounts. As @fuwei mentioned, using a detached mount is quite invisible for the admin and, therefore, a retry is a better alternative. It seems these umounts are a great candidate for containerd#11303, which will manage the life-cycle of mounts and can handle these retries whenever needed. [1]: To do it, I just run as root "unshare -m", that creates a mntns with private propagation, and then run the containerd daemon. Signed-off-by: Rodrigo Campos <rodrigoca@microsoft.com>

@fuwei

When a lot of pods are created all at the same time, the umount fails with EBUSY sometimes. This causes some mounts to be leaked. A way to repro this issue is here: containerd#12139 (comment) While using lsof/fuser it was not possible to get the culprits on time (the mount is busy for a few ms), @fuwei has found what is causing the mount to be busy: containerd#12139 (comment) When we fork in `GetUsernsFD()`, if a call to prepareIDMappedOverlay() is ongoing and has an open fd to the path to idmap, then the unmount callback will fail as the forked process still has an open fd to it. Let's handle the idmap unmounts in the same way other temp mounts are done, using the Unmount() helper, that retries the unmount. This retry fixes it 100% of the times on my system and in the systems of @halaney that reported the issue too. This was originally fixed by containerd#10721 by using a detached mount, but the mount was switched to non-detached again in containerd#10955 and we started to leak mounts. As @fuwei mentioned, using a detached mount is quite invisible for the admin and, therefore, a retry is a better alternative. It seems these umounts are a great candidate for containerd#11303, which will manage the life-cycle of mounts and can handle these retries whenever needed. [1]: To do it, I just run as root "unshare -m", that creates a mntns with private propagation, and then run the containerd daemon. Signed-off-by: Rodrigo Campos <rodrigoca@microsoft.com>

@fuwei

When a lot of pods are created all at the same time, the umount fails with EBUSY sometimes. This causes some mounts to be leaked. A way to repro this issue is here: containerd#12139 (comment) While using lsof/fuser it was not possible to get the culprits on time (the mount is busy for a few ms), @fuwei has found what is causing the mount to be busy: containerd#12139 (comment) When we fork in `GetUsernsFD()`, if a call to prepareIDMappedOverlay() is ongoing and has an open fd to the path to idmap, then the unmount callback will fail as the forked process still has an open fd to it. Let's handle the idmap unmounts in the same way other temp mounts are done, using the Unmount() helper, that retries the unmount. This retry fixes it 100% of the times on my system and in the systems of @halaney that reported the issue too. This was originally fixed by containerd#10721 by using a detached mount, but the mount was switched to non-detached again in containerd#10955 and we started to leak mounts. As @fuwei mentioned, using a detached mount is quite invisible for the admin and, therefore, a retry is a better alternative. It seems these umounts are a great candidate for containerd#11303, which will manage the life-cycle of mounts and can handle these retries whenever needed. [1]: To do it, I just run as root "unshare -m", that creates a mntns with private propagation, and then run the containerd daemon. Signed-off-by: Rodrigo Campos <rodrigoca@microsoft.com>

@fuwei

When a lot of pods are created all at the same time, the umount fails with EBUSY sometimes. This causes some mounts to be leaked. A way to repro this issue is here: containerd#12139 (comment) While using lsof/fuser it was not possible to get the culprits on time (the mount is busy for a few ms), @fuwei has found what is causing the mount to be busy: containerd#12139 (comment) When we fork in `GetUsernsFD()`, if a call to prepareIDMappedOverlay() is ongoing and has an open fd to the path to idmap, then the unmount callback will fail as the forked process still has an open fd to it. Let's handle the idmap unmounts in the same way other temp mounts are done, using the Unmount() helper, that retries the unmount. This retry fixes it 100% of the times on my system and in the systems of @halaney that reported the issue too. This was originally fixed by containerd#10721 by using a detached mount, but the mount was switched to non-detached again in containerd#10955 and we started to leak mounts. As @fuwei mentioned, using a detached mount is quite invisible for the admin and, therefore, a retry is a better alternative. It seems these umounts are a great candidate for containerd#11303, which will manage the life-cycle of mounts and can handle these retries whenever needed. [1]: To do it, I just run as root "unshare -m", that creates a mntns with private propagation, and then run the containerd daemon. Signed-off-by: Rodrigo Campos <rodrigoca@microsoft.com>

k8s-ci-robot added do-not-merge/work-in-progress needs-ok-to-test labels Nov 5, 2024

k8s-ci-robot added the size/L label Nov 5, 2024

mbaynton force-pushed the userns-ovl-followup branch from a1ad9ed to 7a2b1a2 Compare November 5, 2024 20:13

mbaynton marked this pull request as ready for review November 5, 2024 21:18

k8s-ci-robot removed the do-not-merge/work-in-progress label Nov 5, 2024

dosubot bot added the area/runtime Runtime label Nov 5, 2024

mbaynton force-pushed the userns-ovl-followup branch 5 times, most recently from 3e7685e to 2b89ebb Compare November 6, 2024 17:11

mbaynton mentioned this pull request Nov 6, 2024

Fix data loss in rootfs overlayfs when unmount of tmp dirs fail with idmap mounts #10721

Merged

fuweid reviewed Nov 7, 2024

View reviewed changes

rata reviewed Nov 8, 2024

View reviewed changes

mbaynton force-pushed the userns-ovl-followup branch from ca7cbfc to 478ec16 Compare November 12, 2024 02:44

mbaynton force-pushed the userns-ovl-followup branch from 48af384 to 65ac080 Compare November 20, 2024 01:59

fuweid approved these changes Nov 22, 2024

View reviewed changes

mbaynton force-pushed the userns-ovl-followup branch from 65ac080 to a72bc12 Compare November 27, 2024 03:29

fuweid approved these changes Dec 3, 2024

View reviewed changes

thaJeztah reviewed Dec 13, 2024

View reviewed changes

dmcgowan reviewed Dec 19, 2024

View reviewed changes

mbaynton force-pushed the userns-ovl-followup branch from 965c9df to 42e717e Compare December 20, 2024 02:22

mbaynton force-pushed the userns-ovl-followup branch from 42e717e to 1e3d10d Compare December 21, 2024 22:02

fuweid approved these changes Dec 24, 2024

View reviewed changes

kzys approved these changes Jan 14, 2025

View reviewed changes

fuweid added this pull request to the merge queue Jan 14, 2025

Merged via the queue into containerd:main with commit c6edc62 Jan 14, 2025
56 checks passed

mbaynton mentioned this pull request Feb 7, 2025

Proposal: support for "new" Linux mount API #11354

Open

halaney mentioned this pull request Jul 17, 2025

Only idmap once per overlayfs, not per layer #12092

Merged

halaney mentioned this pull request Jul 24, 2025

containerd leaks many <root>/tmpmounts/ovl-idmapped* mounts until restart #12139

Closed

Make ovl idmap mounts read-only #10955

Make ovl idmap mounts read-only #10955

Uh oh!

Conversation

mbaynton commented Nov 5, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

k8s-ci-robot commented Nov 5, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mbaynton Nov 20, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rata left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

rata Nov 8, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rata commented Nov 8, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mbaynton commented Nov 12, 2024

Uh oh!

fuweid left a comment

Choose a reason for hiding this comment

Uh oh!

mbaynton commented Nov 22, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

fuweid commented Nov 23, 2024

Uh oh!

mbaynton commented Nov 27, 2024

Uh oh!

fuweid left a comment

Choose a reason for hiding this comment

Uh oh!

mbaynton commented Dec 13, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

fuweid commented Dec 17, 2024

Uh oh!

Choose a reason for hiding this comment

mbaynton commented Nov 5, 2024 •

edited

Loading

mbaynton Nov 20, 2024 •

edited

Loading

rata Nov 8, 2024 •

edited

Loading

rata commented Nov 8, 2024 •

edited

Loading

mbaynton commented Nov 22, 2024 •

edited

Loading