Fix data loss in rootfs overlayfs when unmount of tmp dirs fail with idmap mounts #10721

rata · 2024-09-24T15:13:06Z

This fixes #10704 . Big kudos to @mbaynton for reporting this issue with lot of details, nailing it down to containerd lines of code and showing all the log lines to understand the big picture of what was happening.

PR #5890 added code to unmount and delete the tmp directories for each layer of the overlayfs that is idmapped (idmap support for overlayfs needs to idmap each layer). The thing is, it was using os.RemoveAll() which will remove all recursively and not skipping it if the unmount failed. Therefore, if the unmount fails this is removing the rootfs of the container.

This PR just changes this to:

Do the mount detached: this means if the mount is busy (as reported in issue Data loss in root overlayfs and other container mounts when using user namespaces #10704), it will work just fine, make the mount unavailable for new accesses and the actual unmount is done when it stops being busy.
Don't use os.RemoveAll(), as that can erase the rootfs. Let's just use os.Remove() that only deletes empty directories (or single files). To use it, I'm making sure the directories should be empty.
Skip some of the steps if the unmount fails, as it is pointless to remove the directories if the unmount failed.

There is a repro for this issue and I've confirmed that if the unmount fails, we are not handling it correctly and we cause data-loss, which is horrible. Of course, this PR fixes the issue for the repro too.

I'll also open a follow-up PR to add tests for this and maybe some other small fixes and improvements. But let's fix this data-loss ASAP, I'll follow-up with the other changes.

Fixes #10704

cc @fuweid @AkihiroSuda

k8s-ci-robot · 2024-09-24T15:13:17Z

Hi @rata. Thanks for your PR.

I'm waiting for a containerd member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@mbaynton

Overlayfs needs to do an idmap mount of each layer and the cleanup function just unmounts and deletes the directories. However, when the resource is busy, the umount fails. Let's make the unmount detached so the unmount will eventually be done when it's not busy anymore. Also, making it detached solves the issues with the unmount failing because it is busy. Big kudos to @mbaynton for reporting this issue with lot of details, nailing it down to containerd lines of code and showing all the log lines to understand the big picture. Fixes: containerd#10704 Signed-off-by: Rodrigo Campos <rodrigoca@microsoft.com>

@mbaynton

Using os.RemoveAll() is quite risky, as if the unmount failed and we can delete files from the container rootfs. In fact, we were doing just that. Let's use os.Remove() to make sure we only deleted empty dirs. Big kudos to @mbaynton for reporting this issue with lot of details, nailing it down to containerd lines of code and showing all the log lines to understand the big picture. Fixes: containerd#10704 Signed-off-by: Rodrigo Campos <rodrigoca@microsoft.com>

The detached mount is less likely to fail in our case, but if we see any failure to unmount, we should just skip the removal of directories. Signed-off-by: Rodrigo Campos <rodrigoca@microsoft.com>

AkihiroSuda · 2024-09-25T14:28:03Z

core/mount/mount_linux.go

-			log.L.WithError(terr).Warnf("failed to remove temporary overlay lowerdir's")
+
+		// This dir should be empty now. Otherwise, we don't do anything.
+		if err := os.Remove(filepath.Join(tmpLowerDirs[0], "..")); err != nil {


Can't we just use filepath.Dir? (And validate the path)

Yes, I'm preparing a follow-up PR with more improvements. This is how it was written before, I'd improve it (that and other things, hopefully including tests :))

This PR is just about fixing the data loss, if that is fine, to land ASAP.

In the follow-up PR this thing is removed and handled differently (IMHO, it is clearer). So I'd avoid improving lines that we will remove anyways :)

fuweid · 2024-10-11T21:10:08Z

core/mount/mount_linux.go

-			if err := unix.Unmount(lowerDir, 0); err != nil {
+			// Do a detached unmount so even if the resource is busy, the mount will be
+			// gone (eventually) and we can safely delete the directory too.
+			if err := unix.Unmount(lowerDir, unix.MNT_DETACH); err != nil {


MNT_DETACH isn't a best option :p
The cleanup is defer callback. It's hard to notify that caller to retry.
Based on this fact, the change is acceptable to me.

However, I think we can handle this issue in other way.
Based on mount_setattr manpage, the MOUNT_ATTR_RDONLY can force the temporary mount in read-only. And no one can delete the content.

NOTE: Even if there are leaky mountpoint, containerd will clean it up during booting.

containerd/cmd/containerd/command/main.go

Line 181 in 08037e7

warnings, err := mount.CleanupTempMounts(0)

It's kind of enhancement and it needs to change IdMapMount interface.
It should be handled in the follow-up.

@mbaynton are you interested in tackling this?

Hmm, potentially. I'm most interested in investigating a new code path for Linux 6.13+ using the new file descriptor based layer setup and the new mount API. I think this will keep the idmap temp mounts off the host filesystem tree at all times, so the one and only accessor of them will be the desired overlayfs, so (I hypothesize) there will be no possibility of EBUSY failures / no need to use MNT_DETACH / no possibility of leaky mountpoints.

But an alternative implementation that removes MNT_DETACH, instead makes these read-only as another safeguard, and accepts the possibility that the unmount may fail without MNT_DETACH seems pretty easy to put together too.

@rata it sounds like you already started a follow-up?

In the follow-up PR this thing is removed and handled differently (IMHO, it is clearer).

I don't see one pushed to rata/containerd but are there improvements you've got?

I should have something for this to address @fuweid's request in a week.

mbaynton · 2024-10-13T14:16:57Z

For the wrong error in the OOM test, perhaps the shim process itself has started using more than this?

https://github.com/kubernetes-sigs/cri-tools/blob/22d7ff92e2b664d415d24943de631a9c886d3d2c/pkg/validate/container_linux.go#L296

rata · 2024-10-14T15:50:18Z

It seems to be the update to runc 1.1.15 that is using slightly more than 15MB. It will be nice to adjust the tests. But we are anyways preparing a fix (opencontainers/runc#4439, opencontainers/runc#4427)

This is an additional safeguard against inadvertent data modification or deletion as a follow-up from containerd#10721. Tests covering the lifecycle of the temporary idmap mounts and integrity of the underlying lower layer data is also included in the normal and failed-unmount case. Fixes containerd#10704 Signed-off-by: Mike Baynton <mike@mbaynton.com>

@fuweid

This is a planned follow-on from containerd#10721 primarily at the request of @fuweid, exchanging MNT_DETACH at unmount time for MOUNT_ATTR_RDONLY at mount time. The effect is to increase risk of unmount failure due to EBUSY (as observed in the wild) but add an additional protection that the then-leaked bind mount does not act as a conduit for inadvertent modification of the underlying data, including our own efforts to clean up the mountpoint. Tests covering the lifecycle of the temporary idmap mounts and integrity of the underlying lower layer data is also included in the normal and failed-unmount case. Fixes containerd#10704 Signed-off-by: Mike Baynton <mike@mbaynton.com>

@fuweid

This is a planned follow-on from containerd#10721 primarily at the request of @fuweid, exchanging MNT_DETACH at unmount time for MOUNT_ATTR_RDONLY at mount time. The effect is to increase risk of unmount failure due to EBUSY (as observed in the wild) but add an additional protection that the then-leaked bind mount does not act as a conduit for inadvertent modification of the underlying data, including our own efforts to clean up the mountpoint. Tests covering the lifecycle of the temporary idmap mounts and integrity of the underlying lower layer data is also included in the normal and failed-unmount case. Fixes containerd#10704 Signed-off-by: Mike Baynton <mike@mbaynton.com>

@fuweid

This is a planned follow-on from containerd#10721 primarily at the request of @fuweid, exchanging MNT_DETACH at unmount time for MOUNT_ATTR_RDONLY at mount time. The effect is to increase risk of unmount failure due to EBUSY (as observed in the wild) but add an additional protection that the then-leaked bind mount does not act as a conduit for inadvertent modification of the underlying data, including our own efforts to clean up the mountpoint. Tests covering the lifecycle of the temporary idmap mounts and integrity of the underlying lower layer data is also included in the normal and failed-unmount case. Fixes containerd#10704 Signed-off-by: Mike Baynton <mike@mbaynton.com>

@halaney

When a lot of pods are created all at the same time, the umount fails with EBUSY sometimes. This causes some mounts to be leaked. A way to repro this issue is here: containerd#12139 (comment) While using lsof/fuser it was not possible to get the culprits on time (the mount is busy for a few ms), I've tried running the containerd daemon in a mount namespace with propagation private[1], and in that case there are no leaks. So, it seems clear it's not containerd but probably systemd that scans all host mounts to act on them, that is causing the EBUSY for a few ms. Also, it seems EBUSY is not just returned in the context of userns, but also for other temp mounts containerd does. See here for more details: containerd#12139 (comment) Let's handle the idmap unmounts in the same way other temp mounts are done, using the Unmount() helper, that retries the unmount. This retry fixes it 100% of the times on my system and in the systems of @halaney that reported the issue too. This was originally fixed by containerd#10721 by using a detached mount, but the mount was switched to non-detached again in containerd#10955 and we started to leak mounts. As @fuwei mentioned, using a detached mount is quite invisible for the admin and, therefore, a retry is a better alternative. It seems these umounts are a great candidate for containerd#11303, which will manage the life-cycle of mounts and can handle these retries whenever needed. It's left for future research if it's possible to make the host not scan these mounts and, if it's possible, decide if we want to do that. [1]: To do it, I just run as root "unshare -m", that creates a mntns with private propagation, and then run the containerd daemon. Signed-off-by: Rodrigo Campos <rodrigoca@microsoft.com>

@halaney

When a lot of pods are created all at the same time, the umount fails with EBUSY sometimes. This causes some mounts to be leaked. A way to repro this issue is here: containerd#12139 (comment) While using lsof/fuser it was not possible to get the culprits on time (the mount is busy for a few ms), I've tried running the containerd daemon in a mount namespace with propagation private[1], and in that case there are no leaks. So, it seems clear it's not containerd but probably systemd that scans all host mounts to act on them, that is causing the EBUSY for a few ms. Also, it seems EBUSY is not just returned in the context of userns, but also for other temp mounts containerd does. See here for more details: containerd#12139 (comment) Let's handle the idmap unmounts in the same way other temp mounts are done, using the Unmount() helper, that retries the unmount. This retry fixes it 100% of the times on my system and in the systems of @halaney that reported the issue too. This was originally fixed by containerd#10721 by using a detached mount, but the mount was switched to non-detached again in containerd#10955 and we started to leak mounts. As @fuwei mentioned, using a detached mount is quite invisible for the admin and, therefore, a retry is a better alternative. It seems these umounts are a great candidate for containerd#11303, which will manage the life-cycle of mounts and can handle these retries whenever needed. It's left for future research if it's possible to make the host not scan these mounts and, if it's possible, decide if we want to do that. [1]: To do it, I just run as root "unshare -m", that creates a mntns with private propagation, and then run the containerd daemon. Signed-off-by: Rodrigo Campos <rodrigoca@microsoft.com>

@fuwei

When a lot of pods are created all at the same time, the umount fails with EBUSY sometimes. This causes some mounts to be leaked. A way to repro this issue is here: containerd#12139 (comment) While using lsof/fuser it was not possible to get the culprits on time (the mount is busy for a few ms), @fuwei has found what is causing the mount to be busy: containerd#12139 (comment) When we fork in `GetUsernsFD()`, if a call to prepareIDMappedOverlay() is ongoing and has an open fd to the path idmap, then the unmount callback will fail as the forked process still has an open fd to it. Let's handle the idmap unmounts in the same way other temp mounts are done, using the Unmount() helper, that retries the unmount. This retry fixes it 100% of the times on my system and in the systems of @halaney that reported the issue too. This was originally fixed by containerd#10721 by using a detached mount, but the mount was switched to non-detached again in containerd#10955 and we started to leak mounts. As @fuwei mentioned, using a detached mount is quite invisible for the admin and, therefore, a retry is a better alternative. It seems these umounts are a great candidate for containerd#11303, which will manage the life-cycle of mounts and can handle these retries whenever needed. It's left for future research if it's possible to make the host not scan these mounts and, if it's possible, decide if we want to do that. [1]: To do it, I just run as root "unshare -m", that creates a mntns with private propagation, and then run the containerd daemon. Signed-off-by: Rodrigo Campos <rodrigoca@microsoft.com>

@fuwei

When a lot of pods are created all at the same time, the umount fails with EBUSY sometimes. This causes some mounts to be leaked. A way to repro this issue is here: containerd#12139 (comment) While using lsof/fuser it was not possible to get the culprits on time (the mount is busy for a few ms), @fuwei has found what is causing the mount to be busy: containerd#12139 (comment) When we fork in `GetUsernsFD()`, if a call to prepareIDMappedOverlay() is ongoing and has an open fd to the path idmap, then the unmount callback will fail as the forked process still has an open fd to it. Let's handle the idmap unmounts in the same way other temp mounts are done, using the Unmount() helper, that retries the unmount. This retry fixes it 100% of the times on my system and in the systems of @halaney that reported the issue too. This was originally fixed by containerd#10721 by using a detached mount, but the mount was switched to non-detached again in containerd#10955 and we started to leak mounts. As @fuwei mentioned, using a detached mount is quite invisible for the admin and, therefore, a retry is a better alternative. It seems these umounts are a great candidate for containerd#11303, which will manage the life-cycle of mounts and can handle these retries whenever needed. It's left for future research if it's possible to make the host not scan these mounts and, if it's possible, decide if we want to do that. [1]: To do it, I just run as root "unshare -m", that creates a mntns with private propagation, and then run the containerd daemon. Signed-off-by: Rodrigo Campos <rodrigoca@microsoft.com>

@fuwei

When a lot of pods are created all at the same time, the umount fails with EBUSY sometimes. This causes some mounts to be leaked. A way to repro this issue is here: containerd#12139 (comment) While using lsof/fuser it was not possible to get the culprits on time (the mount is busy for a few ms), @fuwei has found what is causing the mount to be busy: containerd#12139 (comment) When we fork in `GetUsernsFD()`, if a call to prepareIDMappedOverlay() is ongoing and has an open fd to the path to idmap, then the unmount callback will fail as the forked process still has an open fd to it. Let's handle the idmap unmounts in the same way other temp mounts are done, using the Unmount() helper, that retries the unmount. This retry fixes it 100% of the times on my system and in the systems of @halaney that reported the issue too. This was originally fixed by containerd#10721 by using a detached mount, but the mount was switched to non-detached again in containerd#10955 and we started to leak mounts. As @fuwei mentioned, using a detached mount is quite invisible for the admin and, therefore, a retry is a better alternative. It seems these umounts are a great candidate for containerd#11303, which will manage the life-cycle of mounts and can handle these retries whenever needed. [1]: To do it, I just run as root "unshare -m", that creates a mntns with private propagation, and then run the containerd daemon. Signed-off-by: Rodrigo Campos <rodrigoca@microsoft.com>

@fuwei

When a lot of pods are created all at the same time, the umount fails with EBUSY sometimes. This causes some mounts to be leaked. A way to repro this issue is here: containerd#12139 (comment) While using lsof/fuser it was not possible to get the culprits on time (the mount is busy for a few ms), @fuwei has found what is causing the mount to be busy: containerd#12139 (comment) When we fork in `GetUsernsFD()`, if a call to prepareIDMappedOverlay() is ongoing and has an open fd to the path to idmap, then the unmount callback will fail as the forked process still has an open fd to it. Let's handle the idmap unmounts in the same way other temp mounts are done, using the Unmount() helper, that retries the unmount. This retry fixes it 100% of the times on my system and in the systems of @halaney that reported the issue too. This was originally fixed by containerd#10721 by using a detached mount, but the mount was switched to non-detached again in containerd#10955 and we started to leak mounts. As @fuwei mentioned, using a detached mount is quite invisible for the admin and, therefore, a retry is a better alternative. It seems these umounts are a great candidate for containerd#11303, which will manage the life-cycle of mounts and can handle these retries whenever needed. [1]: To do it, I just run as root "unshare -m", that creates a mntns with private propagation, and then run the containerd daemon. Signed-off-by: Rodrigo Campos <rodrigoca@microsoft.com>

@fuwei

When a lot of pods are created all at the same time, the umount fails with EBUSY sometimes. This causes some mounts to be leaked. A way to repro this issue is here: containerd#12139 (comment) While using lsof/fuser it was not possible to get the culprits on time (the mount is busy for a few ms), @fuwei has found what is causing the mount to be busy: containerd#12139 (comment) When we fork in `GetUsernsFD()`, if a call to prepareIDMappedOverlay() is ongoing and has an open fd to the path to idmap, then the unmount callback will fail as the forked process still has an open fd to it. Let's handle the idmap unmounts in the same way other temp mounts are done, using the Unmount() helper, that retries the unmount. This retry fixes it 100% of the times on my system and in the systems of @halaney that reported the issue too. This was originally fixed by containerd#10721 by using a detached mount, but the mount was switched to non-detached again in containerd#10955 and we started to leak mounts. As @fuwei mentioned, using a detached mount is quite invisible for the admin and, therefore, a retry is a better alternative. It seems these umounts are a great candidate for containerd#11303, which will manage the life-cycle of mounts and can handle these retries whenever needed. [1]: To do it, I just run as root "unshare -m", that creates a mntns with private propagation, and then run the containerd daemon. Signed-off-by: Rodrigo Campos <rodrigoca@microsoft.com>

@fuwei

When a lot of pods are created all at the same time, the umount fails with EBUSY sometimes. This causes some mounts to be leaked. A way to repro this issue is here: containerd#12139 (comment) While using lsof/fuser it was not possible to get the culprits on time (the mount is busy for a few ms), @fuwei has found what is causing the mount to be busy: containerd#12139 (comment) When we fork in `GetUsernsFD()`, if a call to prepareIDMappedOverlay() is ongoing and has an open fd to the path to idmap, then the unmount callback will fail as the forked process still has an open fd to it. Let's handle the idmap unmounts in the same way other temp mounts are done, using the Unmount() helper, that retries the unmount. This retry fixes it 100% of the times on my system and in the systems of @halaney that reported the issue too. This was originally fixed by containerd#10721 by using a detached mount, but the mount was switched to non-detached again in containerd#10955 and we started to leak mounts. As @fuwei mentioned, using a detached mount is quite invisible for the admin and, therefore, a retry is a better alternative. It seems these umounts are a great candidate for containerd#11303, which will manage the life-cycle of mounts and can handle these retries whenever needed. [1]: To do it, I just run as root "unshare -m", that creates a mntns with private propagation, and then run the containerd daemon. Signed-off-by: Rodrigo Campos <rodrigoca@microsoft.com>

@fuwei

When a lot of pods are created all at the same time, the umount fails with EBUSY sometimes. This causes some mounts to be leaked. A way to repro this issue is here: containerd#12139 (comment) While using lsof/fuser it was not possible to get the culprits on time (the mount is busy for a few ms), @fuwei has found what is causing the mount to be busy: containerd#12139 (comment) When we fork in `GetUsernsFD()`, if a call to prepareIDMappedOverlay() is ongoing and has an open fd to the path to idmap, then the unmount callback will fail as the forked process still has an open fd to it. Let's handle the idmap unmounts in the same way other temp mounts are done, using the Unmount() helper, that retries the unmount. This retry fixes it 100% of the times on my system and in the systems of @halaney that reported the issue too. This was originally fixed by containerd#10721 by using a detached mount, but the mount was switched to non-detached again in containerd#10955 and we started to leak mounts. As @fuwei mentioned, using a detached mount is quite invisible for the admin and, therefore, a retry is a better alternative. It seems these umounts are a great candidate for containerd#11303, which will manage the life-cycle of mounts and can handle these retries whenever needed. [1]: To do it, I just run as root "unshare -m", that creates a mntns with private propagation, and then run the containerd daemon. Signed-off-by: Rodrigo Campos <rodrigoca@microsoft.com>

@fuwei

When a lot of pods are created all at the same time, the umount fails with EBUSY sometimes. This causes some mounts to be leaked. A way to repro this issue is here: containerd#12139 (comment) While using lsof/fuser it was not possible to get the culprits on time (the mount is busy for a few ms), @fuwei has found what is causing the mount to be busy: containerd#12139 (comment) When we fork in `GetUsernsFD()`, if a call to prepareIDMappedOverlay() is ongoing and has an open fd to the path to idmap, then the unmount callback will fail as the forked process still has an open fd to it. Let's handle the idmap unmounts in the same way other temp mounts are done, using the Unmount() helper, that retries the unmount. This retry fixes it 100% of the times on my system and in the systems of @halaney that reported the issue too. This was originally fixed by containerd#10721 by using a detached mount, but the mount was switched to non-detached again in containerd#10955 and we started to leak mounts. As @fuwei mentioned, using a detached mount is quite invisible for the admin and, therefore, a retry is a better alternative. It seems these umounts are a great candidate for containerd#11303, which will manage the life-cycle of mounts and can handle these retries whenever needed. [1]: To do it, I just run as root "unshare -m", that creates a mntns with private propagation, and then run the containerd daemon. Signed-off-by: Rodrigo Campos <rodrigoca@microsoft.com>

@fuwei

When a lot of pods are created all at the same time, the umount fails with EBUSY sometimes. This causes some mounts to be leaked. A way to repro this issue is here: containerd#12139 (comment) While using lsof/fuser it was not possible to get the culprits on time (the mount is busy for a few ms), @fuwei has found what is causing the mount to be busy: containerd#12139 (comment) When we fork in `GetUsernsFD()`, if a call to prepareIDMappedOverlay() is ongoing and has an open fd to the path to idmap, then the unmount callback will fail as the forked process still has an open fd to it. Let's handle the idmap unmounts in the same way other temp mounts are done, using the Unmount() helper, that retries the unmount. This retry fixes it 100% of the times on my system and in the systems of @halaney that reported the issue too. This was originally fixed by containerd#10721 by using a detached mount, but the mount was switched to non-detached again in containerd#10955 and we started to leak mounts. As @fuwei mentioned, using a detached mount is quite invisible for the admin and, therefore, a retry is a better alternative. It seems these umounts are a great candidate for containerd#11303, which will manage the life-cycle of mounts and can handle these retries whenever needed. [1]: To do it, I just run as root "unshare -m", that creates a mntns with private propagation, and then run the containerd daemon. Signed-off-by: Rodrigo Campos <rodrigoca@microsoft.com>

@fuwei

When a lot of pods are created all at the same time, the umount fails with EBUSY sometimes. This causes some mounts to be leaked. A way to repro this issue is here: containerd#12139 (comment) While using lsof/fuser it was not possible to get the culprits on time (the mount is busy for a few ms), @fuwei has found what is causing the mount to be busy: containerd#12139 (comment) When we fork in `GetUsernsFD()`, if a call to prepareIDMappedOverlay() is ongoing and has an open fd to the path to idmap, then the unmount callback will fail as the forked process still has an open fd to it. Let's handle the idmap unmounts in the same way other temp mounts are done, using the Unmount() helper, that retries the unmount. This retry fixes it 100% of the times on my system and in the systems of @halaney that reported the issue too. This was originally fixed by containerd#10721 by using a detached mount, but the mount was switched to non-detached again in containerd#10955 and we started to leak mounts. As @fuwei mentioned, using a detached mount is quite invisible for the admin and, therefore, a retry is a better alternative. It seems these umounts are a great candidate for containerd#11303, which will manage the life-cycle of mounts and can handle these retries whenever needed. [1]: To do it, I just run as root "unshare -m", that creates a mntns with private propagation, and then run the containerd daemon. Signed-off-by: Rodrigo Campos <rodrigoca@microsoft.com> (cherry picked from commit da3dc1e) Signed-off-by: Andrew Halaney <ahalaney@netflix.com>

@fuweid

This is a planned follow-on from containerd#10721 primarily at the request of @fuweid, exchanging MNT_DETACH at unmount time for MOUNT_ATTR_RDONLY at mount time. The effect is to increase risk of unmount failure due to EBUSY (as observed in the wild) but add an additional protection that the then-leaked bind mount does not act as a conduit for inadvertent modification of the underlying data, including our own efforts to clean up the mountpoint. Tests covering the lifecycle of the temporary idmap mounts and integrity of the underlying lower layer data is also included in the normal and failed-unmount case. Fixes containerd#10704 Signed-off-by: Mike Baynton <mike@mbaynton.com> (cherry picked from commit 1e3d10d) Signed-off-by: Andrew Halaney <ahalaney@netflix.com>

@fuwei

When a lot of pods are created all at the same time, the umount fails with EBUSY sometimes. This causes some mounts to be leaked. A way to repro this issue is here: containerd#12139 (comment) While using lsof/fuser it was not possible to get the culprits on time (the mount is busy for a few ms), @fuwei has found what is causing the mount to be busy: containerd#12139 (comment) When we fork in `GetUsernsFD()`, if a call to prepareIDMappedOverlay() is ongoing and has an open fd to the path to idmap, then the unmount callback will fail as the forked process still has an open fd to it. Let's handle the idmap unmounts in the same way other temp mounts are done, using the Unmount() helper, that retries the unmount. This retry fixes it 100% of the times on my system and in the systems of @halaney that reported the issue too. This was originally fixed by containerd#10721 by using a detached mount, but the mount was switched to non-detached again in containerd#10955 and we started to leak mounts. As @fuwei mentioned, using a detached mount is quite invisible for the admin and, therefore, a retry is a better alternative. It seems these umounts are a great candidate for containerd#11303, which will manage the life-cycle of mounts and can handle these retries whenever needed. [1]: To do it, I just run as root "unshare -m", that creates a mntns with private propagation, and then run the containerd daemon. Signed-off-by: Rodrigo Campos <rodrigoca@microsoft.com> (cherry picked from commit da3dc1e) Signed-off-by: Andrew Halaney <ahalaney@netflix.com>

@fuwei

When a lot of pods are created all at the same time, the umount fails with EBUSY sometimes. This causes some mounts to be leaked. A way to repro this issue is here: containerd#12139 (comment) While using lsof/fuser it was not possible to get the culprits on time (the mount is busy for a few ms), @fuwei has found what is causing the mount to be busy: containerd#12139 (comment) When we fork in `GetUsernsFD()`, if a call to prepareIDMappedOverlay() is ongoing and has an open fd to the path to idmap, then the unmount callback will fail as the forked process still has an open fd to it. Let's handle the idmap unmounts in the same way other temp mounts are done, using the Unmount() helper, that retries the unmount. This retry fixes it 100% of the times on my system and in the systems of @halaney that reported the issue too. This was originally fixed by containerd#10721 by using a detached mount, but the mount was switched to non-detached again in containerd#10955 and we started to leak mounts. As @fuwei mentioned, using a detached mount is quite invisible for the admin and, therefore, a retry is a better alternative. It seems these umounts are a great candidate for containerd#11303, which will manage the life-cycle of mounts and can handle these retries whenever needed. [1]: To do it, I just run as root "unshare -m", that creates a mntns with private propagation, and then run the containerd daemon. Signed-off-by: Rodrigo Campos <rodrigoca@microsoft.com>

@fuwei

When a lot of pods are created all at the same time, the umount fails with EBUSY sometimes. This causes some mounts to be leaked. A way to repro this issue is here: containerd#12139 (comment) While using lsof/fuser it was not possible to get the culprits on time (the mount is busy for a few ms), @fuwei has found what is causing the mount to be busy: containerd#12139 (comment) When we fork in `GetUsernsFD()`, if a call to prepareIDMappedOverlay() is ongoing and has an open fd to the path to idmap, then the unmount callback will fail as the forked process still has an open fd to it. Let's handle the idmap unmounts in the same way other temp mounts are done, using the Unmount() helper, that retries the unmount. This retry fixes it 100% of the times on my system and in the systems of @halaney that reported the issue too. This was originally fixed by containerd#10721 by using a detached mount, but the mount was switched to non-detached again in containerd#10955 and we started to leak mounts. As @fuwei mentioned, using a detached mount is quite invisible for the admin and, therefore, a retry is a better alternative. It seems these umounts are a great candidate for containerd#11303, which will manage the life-cycle of mounts and can handle these retries whenever needed. [1]: To do it, I just run as root "unshare -m", that creates a mntns with private propagation, and then run the containerd daemon. Signed-off-by: Rodrigo Campos <rodrigoca@microsoft.com>

@fuwei

When a lot of pods are created all at the same time, the umount fails with EBUSY sometimes. This causes some mounts to be leaked. A way to repro this issue is here: containerd#12139 (comment) While using lsof/fuser it was not possible to get the culprits on time (the mount is busy for a few ms), @fuwei has found what is causing the mount to be busy: containerd#12139 (comment) When we fork in `GetUsernsFD()`, if a call to prepareIDMappedOverlay() is ongoing and has an open fd to the path to idmap, then the unmount callback will fail as the forked process still has an open fd to it. Let's handle the idmap unmounts in the same way other temp mounts are done, using the Unmount() helper, that retries the unmount. This retry fixes it 100% of the times on my system and in the systems of @halaney that reported the issue too. This was originally fixed by containerd#10721 by using a detached mount, but the mount was switched to non-detached again in containerd#10955 and we started to leak mounts. As @fuwei mentioned, using a detached mount is quite invisible for the admin and, therefore, a retry is a better alternative. It seems these umounts are a great candidate for containerd#11303, which will manage the life-cycle of mounts and can handle these retries whenever needed. [1]: To do it, I just run as root "unshare -m", that creates a mntns with private propagation, and then run the containerd daemon. Signed-off-by: Rodrigo Campos <rodrigoca@microsoft.com>

@fuwei

When a lot of pods are created all at the same time, the umount fails with EBUSY sometimes. This causes some mounts to be leaked. A way to repro this issue is here: containerd#12139 (comment) While using lsof/fuser it was not possible to get the culprits on time (the mount is busy for a few ms), @fuwei has found what is causing the mount to be busy: containerd#12139 (comment) When we fork in `GetUsernsFD()`, if a call to prepareIDMappedOverlay() is ongoing and has an open fd to the path to idmap, then the unmount callback will fail as the forked process still has an open fd to it. Let's handle the idmap unmounts in the same way other temp mounts are done, using the Unmount() helper, that retries the unmount. This retry fixes it 100% of the times on my system and in the systems of @halaney that reported the issue too. This was originally fixed by containerd#10721 by using a detached mount, but the mount was switched to non-detached again in containerd#10955 and we started to leak mounts. As @fuwei mentioned, using a detached mount is quite invisible for the admin and, therefore, a retry is a better alternative. It seems these umounts are a great candidate for containerd#11303, which will manage the life-cycle of mounts and can handle these retries whenever needed. [1]: To do it, I just run as root "unshare -m", that creates a mntns with private propagation, and then run the containerd daemon. Signed-off-by: Rodrigo Campos <rodrigoca@microsoft.com>

k8s-ci-robot added the needs-ok-to-test label Sep 24, 2024

k8s-ci-robot added the size/S label Sep 24, 2024

dosubot bot added area/runtime Runtime kind/bug labels Sep 24, 2024

rata changed the title ~~Fix data loss in root overlayfs when unmount of tmp dirs fail~~ Fix data loss in rootfs overlayfs when unmount of tmp dirs fail Sep 24, 2024

rata changed the title ~~Fix data loss in rootfs overlayfs when unmount of tmp dirs fail~~ Fix data loss in rootfs overlayfs when unmount of tmp dirs fail with idmap mounts Sep 24, 2024

rata force-pushed the issue-10704 branch from ca00ae8 to 634c813 Compare September 24, 2024 15:26

rata force-pushed the issue-10704 branch from 634c813 to 924af77 Compare September 24, 2024 15:35

rata mentioned this pull request Sep 24, 2024

Data loss in root overlayfs and other container mounts when using user namespaces #10704

Closed

rata added 2 commits September 24, 2024 17:45

core/mount: Only remove dirs if unmount succeeded

30f2893

The detached mount is less likely to fail in our case, but if we see any failure to unmount, we should just skip the removal of directories. Signed-off-by: Rodrigo Campos <rodrigoca@microsoft.com>

rata force-pushed the issue-10704 branch from 924af77 to 30f2893 Compare September 24, 2024 15:45

AkihiroSuda reviewed Sep 25, 2024

View reviewed changes

mxpv approved these changes Oct 1, 2024

View reviewed changes

fuweid approved these changes Oct 11, 2024

View reviewed changes

fuweid added this pull request to the merge queue Oct 11, 2024

github-merge-queue bot removed this pull request from the merge queue due to failed status checks Oct 11, 2024

fuweid added this pull request to the merge queue Oct 11, 2024

github-merge-queue bot removed this pull request from the merge queue due to failed status checks Oct 11, 2024

AkihiroSuda added this pull request to the merge queue Oct 12, 2024

github-merge-queue bot removed this pull request from the merge queue due to failed status checks Oct 12, 2024

fuweid added this pull request to the merge queue Oct 12, 2024

github-merge-queue bot removed this pull request from the merge queue due to failed status checks Oct 12, 2024

AkihiroSuda added this pull request to the merge queue Oct 13, 2024

github-merge-queue bot removed this pull request from the merge queue due to failed status checks Oct 13, 2024

fuweid added this pull request to the merge queue Oct 15, 2024

mbaynton mentioned this pull request Nov 5, 2024

Make ovl idmap mounts read-only #10955

Merged

limck5856 approved these changes Dec 15, 2024

View reviewed changes

rata mentioned this pull request Aug 7, 2025

containerd leaks many <root>/tmpmounts/ovl-idmapped* mounts until restart #12139

Closed

Fix data loss in rootfs overlayfs when unmount of tmp dirs fail with idmap mounts #10721

Fix data loss in rootfs overlayfs when unmount of tmp dirs fail with idmap mounts #10721

Uh oh!

Conversation

rata commented Sep 24, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

k8s-ci-robot commented Sep 24, 2024

Uh oh!

AkihiroSuda Sep 25, 2024

Choose a reason for hiding this comment

Uh oh!

rata Sep 25, 2024

Choose a reason for hiding this comment

Uh oh!

rata Sep 25, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

fuweid Oct 11, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rata Oct 17, 2024

Choose a reason for hiding this comment

Uh oh!

mbaynton Oct 23, 2024

Choose a reason for hiding this comment

Uh oh!

mbaynton Oct 26, 2024

Choose a reason for hiding this comment

Uh oh!

mbaynton Oct 30, 2024

Choose a reason for hiding this comment

Uh oh!

mbaynton Nov 6, 2024

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mbaynton commented Oct 13, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rata commented Oct 14, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

rata commented Sep 24, 2024 •

edited

Loading

rata Sep 25, 2024 •

edited

Loading

fuweid Oct 11, 2024 •

edited

Loading

mbaynton commented Oct 13, 2024 •

edited

Loading

rata commented Oct 14, 2024 •

edited

Loading