Skip to content

Conversation

@rata
Copy link
Contributor

@rata rata commented Sep 24, 2024

This fixes #10704 . Big kudos to @mbaynton for reporting this issue with lot of details, nailing it down to containerd lines of code and showing all the log lines to understand the big picture of what was happening.

PR #5890 added code to unmount and delete the tmp directories for each layer of the overlayfs that is idmapped (idmap support for overlayfs needs to idmap each layer). The thing is, it was using os.RemoveAll() which will remove all recursively and not skipping it if the unmount failed. Therefore, if the unmount fails this is removing the rootfs of the container.

This PR just changes this to:

  1. Do the mount detached: this means if the mount is busy (as reported in issue Data loss in root overlayfs and other container mounts when using user namespaces #10704), it will work just fine, make the mount unavailable for new accesses and the actual unmount is done when it stops being busy.
  2. Don't use os.RemoveAll(), as that can erase the rootfs. Let's just use os.Remove() that only deletes empty directories (or single files). To use it, I'm making sure the directories should be empty.
  3. Skip some of the steps if the unmount fails, as it is pointless to remove the directories if the unmount failed.

There is a repro for this issue and I've confirmed that if the unmount fails, we are not handling it correctly and we cause data-loss, which is horrible. Of course, this PR fixes the issue for the repro too.

I'll also open a follow-up PR to add tests for this and maybe some other small fixes and improvements. But let's fix this data-loss ASAP, I'll follow-up with the other changes.

Fixes #10704

cc @fuweid @AkihiroSuda

@k8s-ci-robot
Copy link

Hi @rata. Thanks for your PR.

I'm waiting for a containerd member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@rata rata changed the title Fix data loss in root overlayfs when unmount of tmp dirs fail Fix data loss in rootfs overlayfs when unmount of tmp dirs fail Sep 24, 2024
@rata rata changed the title Fix data loss in rootfs overlayfs when unmount of tmp dirs fail Fix data loss in rootfs overlayfs when unmount of tmp dirs fail with idmap mounts Sep 24, 2024
Overlayfs needs to do an idmap mount of each layer and the cleanup
function just unmounts and deletes the directories. However, when the
resource is busy, the umount fails.

Let's make the unmount detached so the unmount will eventually be done
when it's not busy anymore. Also, making it detached solves the issues with
the unmount failing because it is busy.

Big kudos to @mbaynton for reporting this issue with lot of details,
nailing it down to containerd lines of code and showing all the log
lines to understand the big picture.

Fixes: containerd#10704

Signed-off-by: Rodrigo Campos <rodrigoca@microsoft.com>
Using os.RemoveAll() is quite risky, as if the unmount failed and we
can delete files from the container rootfs. In fact, we were doing just
that.

Let's use os.Remove() to make sure we only deleted empty dirs.

Big kudos to @mbaynton for reporting this issue with lot of details,
nailing it down to containerd lines of code and showing all the log
lines to understand the big picture.

Fixes: containerd#10704

Signed-off-by: Rodrigo Campos <rodrigoca@microsoft.com>
The detached mount is less likely to fail in our case, but if we see any
failure to unmount, we should just skip the removal of directories.

Signed-off-by: Rodrigo Campos <rodrigoca@microsoft.com>
log.L.WithError(terr).Warnf("failed to remove temporary overlay lowerdir's")

// This dir should be empty now. Otherwise, we don't do anything.
if err := os.Remove(filepath.Join(tmpLowerDirs[0], "..")); err != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't we just use filepath.Dir? (And validate the path)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I'm preparing a follow-up PR with more improvements. This is how it was written before, I'd improve it (that and other things, hopefully including tests :))

This PR is just about fixing the data loss, if that is fine, to land ASAP.

Copy link
Contributor Author

@rata rata Sep 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the follow-up PR this thing is removed and handled differently (IMHO, it is clearer). So I'd avoid improving lines that we will remove anyways :)

if err := unix.Unmount(lowerDir, 0); err != nil {
// Do a detached unmount so even if the resource is busy, the mount will be
// gone (eventually) and we can safely delete the directory too.
if err := unix.Unmount(lowerDir, unix.MNT_DETACH); err != nil {
Copy link
Member

@fuweid fuweid Oct 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MNT_DETACH isn't a best option :p
The cleanup is defer callback. It's hard to notify that caller to retry.
Based on this fact, the change is acceptable to me.

However, I think we can handle this issue in other way.
Based on mount_setattr manpage, the MOUNT_ATTR_RDONLY can force the temporary mount in read-only. And no one can delete the content.

NOTE: Even if there are leaky mountpoint, containerd will clean it up during booting.

warnings, err := mount.CleanupTempMounts(0)

It's kind of enhancement and it needs to change IdMapMount interface.
It should be handled in the follow-up.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mbaynton are you interested in tackling this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, potentially. I'm most interested in investigating a new code path for Linux 6.13+ using the new file descriptor based layer setup and the new mount API. I think this will keep the idmap temp mounts off the host filesystem tree at all times, so the one and only accessor of them will be the desired overlayfs, so (I hypothesize) there will be no possibility of EBUSY failures / no need to use MNT_DETACH / no possibility of leaky mountpoints.

But an alternative implementation that removes MNT_DETACH, instead makes these read-only as another safeguard, and accepts the possibility that the unmount may fail without MNT_DETACH seems pretty easy to put together too.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rata it sounds like you already started a follow-up?

In the follow-up PR this thing is removed and handled differently (IMHO, it is clearer).

I don't see one pushed to rata/containerd but are there improvements you've got?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I should have something for this to address @fuweid's request in a week.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fuweid fuweid added this pull request to the merge queue Oct 11, 2024
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Oct 11, 2024
@fuweid fuweid added this pull request to the merge queue Oct 11, 2024
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Oct 11, 2024
@AkihiroSuda AkihiroSuda added this pull request to the merge queue Oct 12, 2024
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Oct 12, 2024
@fuweid fuweid added this pull request to the merge queue Oct 12, 2024
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Oct 12, 2024
@AkihiroSuda AkihiroSuda added this pull request to the merge queue Oct 13, 2024
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Oct 13, 2024
@mbaynton
Copy link
Contributor

mbaynton commented Oct 13, 2024

For the wrong error in the OOM test, perhaps the shim process itself has started using more than this?

https://github.com/kubernetes-sigs/cri-tools/blob/22d7ff92e2b664d415d24943de631a9c886d3d2c/pkg/validate/container_linux.go#L296

@rata
Copy link
Contributor Author

rata commented Oct 14, 2024

It seems to be the update to runc 1.1.15 that is using slightly more than 15MB. It will be nice to adjust the tests. But we are anyways preparing a fix (opencontainers/runc#4439, opencontainers/runc#4427)

@fuweid fuweid added this pull request to the merge queue Oct 15, 2024
mbaynton added a commit to mbaynton/containerd that referenced this pull request Nov 5, 2024
This is an additional safeguard against inadvertent data modification or
deletion as a follow-up from containerd#10721.

Tests covering the lifecycle of the temporary idmap mounts and integrity
of the underlying lower layer data is also included in the normal and
failed-unmount case.

Fixes containerd#10704

Signed-off-by: Mike Baynton <mike@mbaynton.com>
mbaynton added a commit to mbaynton/containerd that referenced this pull request Nov 5, 2024
This is an additional safeguard against inadvertent data modification or
deletion as a follow-up from containerd#10721.

Tests covering the lifecycle of the temporary idmap mounts and integrity
of the underlying lower layer data is also included in the normal and
failed-unmount case.

Fixes containerd#10704

Signed-off-by: Mike Baynton <mike@mbaynton.com>
mbaynton added a commit to mbaynton/containerd that referenced this pull request Nov 5, 2024
This is an additional safeguard against inadvertent data modification or
deletion as a follow-up from containerd#10721.

Tests covering the lifecycle of the temporary idmap mounts and integrity
of the underlying lower layer data is also included in the normal and
failed-unmount case.

Fixes containerd#10704

Signed-off-by: Mike Baynton <mike@mbaynton.com>
mbaynton added a commit to mbaynton/containerd that referenced this pull request Nov 6, 2024
This is an additional safeguard against inadvertent data modification or
deletion as a follow-up from containerd#10721.

Tests covering the lifecycle of the temporary idmap mounts and integrity
of the underlying lower layer data is also included in the normal and
failed-unmount case.

Fixes containerd#10704

Signed-off-by: Mike Baynton <mike@mbaynton.com>
mbaynton added a commit to mbaynton/containerd that referenced this pull request Nov 6, 2024
This is an additional safeguard against inadvertent data modification or
deletion as a follow-up from containerd#10721.

Tests covering the lifecycle of the temporary idmap mounts and integrity
of the underlying lower layer data is also included in the normal and
failed-unmount case.

Fixes containerd#10704

Signed-off-by: Mike Baynton <mike@mbaynton.com>
mbaynton added a commit to mbaynton/containerd that referenced this pull request Nov 6, 2024
This is an additional safeguard against inadvertent data modification or
deletion as a follow-up from containerd#10721.

Tests covering the lifecycle of the temporary idmap mounts and integrity
of the underlying lower layer data is also included in the normal and
failed-unmount case.

Fixes containerd#10704

Signed-off-by: Mike Baynton <mike@mbaynton.com>
mbaynton added a commit to mbaynton/containerd that referenced this pull request Nov 27, 2024
This is a planned follow-on from containerd#10721 primarily at the request of
@fuweid, exchanging MNT_DETACH at unmount time for MOUNT_ATTR_RDONLY at
mount time. The effect is to increase risk of unmount failure due to
EBUSY (as observed in the wild) but add an additional protection that the
then-leaked bind mount does not act as a conduit for inadvertent modification
of the underlying data, including our own efforts to clean up the mountpoint.

Tests covering the lifecycle of the temporary idmap mounts and integrity
of the underlying lower layer data is also included in the normal and
failed-unmount case.

Fixes containerd#10704

Signed-off-by: Mike Baynton <mike@mbaynton.com>
mbaynton added a commit to mbaynton/containerd that referenced this pull request Dec 21, 2024
This is a planned follow-on from containerd#10721 primarily at the request of
@fuweid, exchanging MNT_DETACH at unmount time for MOUNT_ATTR_RDONLY at
mount time. The effect is to increase risk of unmount failure due to
EBUSY (as observed in the wild) but add an additional protection that the
then-leaked bind mount does not act as a conduit for inadvertent modification
of the underlying data, including our own efforts to clean up the mountpoint.

Tests covering the lifecycle of the temporary idmap mounts and integrity
of the underlying lower layer data is also included in the normal and
failed-unmount case.

Fixes containerd#10704

Signed-off-by: Mike Baynton <mike@mbaynton.com>
sreeram-venkitesh pushed a commit to sreeram-venkitesh/containerd that referenced this pull request Feb 3, 2025
This is a planned follow-on from containerd#10721 primarily at the request of
@fuweid, exchanging MNT_DETACH at unmount time for MOUNT_ATTR_RDONLY at
mount time. The effect is to increase risk of unmount failure due to
EBUSY (as observed in the wild) but add an additional protection that the
then-leaked bind mount does not act as a conduit for inadvertent modification
of the underlying data, including our own efforts to clean up the mountpoint.

Tests covering the lifecycle of the temporary idmap mounts and integrity
of the underlying lower layer data is also included in the normal and
failed-unmount case.

Fixes containerd#10704

Signed-off-by: Mike Baynton <mike@mbaynton.com>
rata added a commit to rata/containerd that referenced this pull request Aug 12, 2025
When a lot of pods are created all at the same time, the umount fails
with EBUSY sometimes. This causes some mounts to be leaked. A way to
repro this issue is here:
	containerd#12139 (comment)

While using lsof/fuser it was not possible to get the culprits on time
(the mount is busy for a few ms), I've tried running the containerd
daemon in a mount namespace with propagation private[1], and in that
case there are no leaks.

So, it seems clear it's not containerd but probably systemd that scans
all host mounts to act on them, that is causing the EBUSY for a few ms.

Also, it seems EBUSY is not just returned in the context of userns, but
also for other temp mounts containerd does. See here for more details:
	containerd#12139 (comment)

Let's handle the idmap unmounts in the same way other temp mounts are
done, using the Unmount() helper, that retries the unmount.

This retry fixes it 100% of the times on my system and in the systems
of @halaney that reported the issue too.

This was originally fixed by containerd#10721 by using a detached mount, but the
mount was switched to non-detached again in containerd#10955 and we started to
leak mounts. As @fuwei mentioned, using a detached mount is quite
invisible for the admin and, therefore, a retry is a better alternative.

It seems these umounts are a great candidate for containerd#11303, which will
manage the life-cycle of mounts and can handle these retries whenever
needed.

It's left for future research if it's possible to make the host not scan
these mounts and, if it's possible, decide if we want to do that.

[1]: To do it, I just run as root "unshare -m", that creates a mntns
with private propagation, and then run the containerd daemon.

Signed-off-by: Rodrigo Campos <rodrigoca@microsoft.com>
rata added a commit to rata/containerd that referenced this pull request Aug 14, 2025
When a lot of pods are created all at the same time, the umount fails
with EBUSY sometimes. This causes some mounts to be leaked. A way to
repro this issue is here:
	containerd#12139 (comment)

While using lsof/fuser it was not possible to get the culprits on time
(the mount is busy for a few ms), I've tried running the containerd
daemon in a mount namespace with propagation private[1], and in that
case there are no leaks.

So, it seems clear it's not containerd but probably systemd that scans
all host mounts to act on them, that is causing the EBUSY for a few ms.

Also, it seems EBUSY is not just returned in the context of userns, but
also for other temp mounts containerd does. See here for more details:
	containerd#12139 (comment)

Let's handle the idmap unmounts in the same way other temp mounts are
done, using the Unmount() helper, that retries the unmount.

This retry fixes it 100% of the times on my system and in the systems
of @halaney that reported the issue too.

This was originally fixed by containerd#10721 by using a detached mount, but the
mount was switched to non-detached again in containerd#10955 and we started to
leak mounts. As @fuwei mentioned, using a detached mount is quite
invisible for the admin and, therefore, a retry is a better alternative.

It seems these umounts are a great candidate for containerd#11303, which will
manage the life-cycle of mounts and can handle these retries whenever
needed.

It's left for future research if it's possible to make the host not scan
these mounts and, if it's possible, decide if we want to do that.

[1]: To do it, I just run as root "unshare -m", that creates a mntns
with private propagation, and then run the containerd daemon.

Signed-off-by: Rodrigo Campos <rodrigoca@microsoft.com>
rata added a commit to rata/containerd that referenced this pull request Aug 14, 2025
When a lot of pods are created all at the same time, the umount fails
with EBUSY sometimes. This causes some mounts to be leaked. A way to
repro this issue is here:
	containerd#12139 (comment)

While using lsof/fuser it was not possible to get the culprits on time
(the mount is busy for a few ms), @fuwei has found what is causing the
mount to be busy:
	containerd#12139 (comment)

When we fork in `GetUsernsFD()`, if a call to prepareIDMappedOverlay()
is ongoing and has an open fd to the path idmap, then the unmount
callback will fail as the forked process still has an open fd to it.

Let's handle the idmap unmounts in the same way other temp mounts are
done, using the Unmount() helper, that retries the unmount.

This retry fixes it 100% of the times on my system and in the systems
of @halaney that reported the issue too.

This was originally fixed by containerd#10721 by using a detached mount, but the
mount was switched to non-detached again in containerd#10955 and we started to
leak mounts. As @fuwei mentioned, using a detached mount is quite
invisible for the admin and, therefore, a retry is a better alternative.

It seems these umounts are a great candidate for containerd#11303, which will
manage the life-cycle of mounts and can handle these retries whenever
needed.

It's left for future research if it's possible to make the host not scan
these mounts and, if it's possible, decide if we want to do that.

[1]: To do it, I just run as root "unshare -m", that creates a mntns
with private propagation, and then run the containerd daemon.

Signed-off-by: Rodrigo Campos <rodrigoca@microsoft.com>
rata added a commit to rata/containerd that referenced this pull request Aug 14, 2025
When a lot of pods are created all at the same time, the umount fails
with EBUSY sometimes. This causes some mounts to be leaked. A way to
repro this issue is here:
	containerd#12139 (comment)

While using lsof/fuser it was not possible to get the culprits on time
(the mount is busy for a few ms), @fuwei has found what is causing the
mount to be busy:
	containerd#12139 (comment)

When we fork in `GetUsernsFD()`, if a call to prepareIDMappedOverlay()
is ongoing and has an open fd to the path idmap, then the unmount
callback will fail as the forked process still has an open fd to it.

Let's handle the idmap unmounts in the same way other temp mounts are
done, using the Unmount() helper, that retries the unmount.

This retry fixes it 100% of the times on my system and in the systems
of @halaney that reported the issue too.

This was originally fixed by containerd#10721 by using a detached mount, but the
mount was switched to non-detached again in containerd#10955 and we started to
leak mounts. As @fuwei mentioned, using a detached mount is quite
invisible for the admin and, therefore, a retry is a better alternative.

It seems these umounts are a great candidate for containerd#11303, which will
manage the life-cycle of mounts and can handle these retries whenever
needed.

It's left for future research if it's possible to make the host not scan
these mounts and, if it's possible, decide if we want to do that.

[1]: To do it, I just run as root "unshare -m", that creates a mntns
with private propagation, and then run the containerd daemon.

Signed-off-by: Rodrigo Campos <rodrigoca@microsoft.com>
rata added a commit to rata/containerd that referenced this pull request Aug 18, 2025
When a lot of pods are created all at the same time, the umount fails
with EBUSY sometimes. This causes some mounts to be leaked. A way to
repro this issue is here:
	containerd#12139 (comment)

While using lsof/fuser it was not possible to get the culprits on time
(the mount is busy for a few ms), @fuwei has found what is causing the
mount to be busy:
	containerd#12139 (comment)

When we fork in `GetUsernsFD()`, if a call to prepareIDMappedOverlay()
is ongoing and has an open fd to the path to idmap, then the unmount
callback will fail as the forked process still has an open fd to it.

Let's handle the idmap unmounts in the same way other temp mounts are
done, using the Unmount() helper, that retries the unmount.

This retry fixes it 100% of the times on my system and in the systems
of @halaney that reported the issue too.

This was originally fixed by containerd#10721 by using a detached mount, but the
mount was switched to non-detached again in containerd#10955 and we started to
leak mounts. As @fuwei mentioned, using a detached mount is quite
invisible for the admin and, therefore, a retry is a better alternative.

It seems these umounts are a great candidate for containerd#11303, which will
manage the life-cycle of mounts and can handle these retries whenever
needed.

[1]: To do it, I just run as root "unshare -m", that creates a mntns
with private propagation, and then run the containerd daemon.

Signed-off-by: Rodrigo Campos <rodrigoca@microsoft.com>
rata added a commit to rata/containerd that referenced this pull request Aug 19, 2025
When a lot of pods are created all at the same time, the umount fails
with EBUSY sometimes. This causes some mounts to be leaked. A way to
repro this issue is here:
	containerd#12139 (comment)

While using lsof/fuser it was not possible to get the culprits on time
(the mount is busy for a few ms), @fuwei has found what is causing the
mount to be busy:
	containerd#12139 (comment)

When we fork in `GetUsernsFD()`, if a call to prepareIDMappedOverlay()
is ongoing and has an open fd to the path to idmap, then the unmount
callback will fail as the forked process still has an open fd to it.

Let's handle the idmap unmounts in the same way other temp mounts are
done, using the Unmount() helper, that retries the unmount.

This retry fixes it 100% of the times on my system and in the systems
of @halaney that reported the issue too.

This was originally fixed by containerd#10721 by using a detached mount, but the
mount was switched to non-detached again in containerd#10955 and we started to
leak mounts. As @fuwei mentioned, using a detached mount is quite
invisible for the admin and, therefore, a retry is a better alternative.

It seems these umounts are a great candidate for containerd#11303, which will
manage the life-cycle of mounts and can handle these retries whenever
needed.

[1]: To do it, I just run as root "unshare -m", that creates a mntns
with private propagation, and then run the containerd daemon.

Signed-off-by: Rodrigo Campos <rodrigoca@microsoft.com>
rata added a commit to rata/containerd that referenced this pull request Aug 19, 2025
When a lot of pods are created all at the same time, the umount fails
with EBUSY sometimes. This causes some mounts to be leaked. A way to
repro this issue is here:
	containerd#12139 (comment)

While using lsof/fuser it was not possible to get the culprits on time
(the mount is busy for a few ms), @fuwei has found what is causing the
mount to be busy:
	containerd#12139 (comment)

When we fork in `GetUsernsFD()`, if a call to prepareIDMappedOverlay()
is ongoing and has an open fd to the path to idmap, then the unmount
callback will fail as the forked process still has an open fd to it.

Let's handle the idmap unmounts in the same way other temp mounts are
done, using the Unmount() helper, that retries the unmount.

This retry fixes it 100% of the times on my system and in the systems
of @halaney that reported the issue too.

This was originally fixed by containerd#10721 by using a detached mount, but the
mount was switched to non-detached again in containerd#10955 and we started to
leak mounts. As @fuwei mentioned, using a detached mount is quite
invisible for the admin and, therefore, a retry is a better alternative.

It seems these umounts are a great candidate for containerd#11303, which will
manage the life-cycle of mounts and can handle these retries whenever
needed.

[1]: To do it, I just run as root "unshare -m", that creates a mntns
with private propagation, and then run the containerd daemon.

Signed-off-by: Rodrigo Campos <rodrigoca@microsoft.com>
rata added a commit to rata/containerd that referenced this pull request Aug 19, 2025
When a lot of pods are created all at the same time, the umount fails
with EBUSY sometimes. This causes some mounts to be leaked. A way to
repro this issue is here:
	containerd#12139 (comment)

While using lsof/fuser it was not possible to get the culprits on time
(the mount is busy for a few ms), @fuwei has found what is causing the
mount to be busy:
	containerd#12139 (comment)

When we fork in `GetUsernsFD()`, if a call to prepareIDMappedOverlay()
is ongoing and has an open fd to the path to idmap, then the unmount
callback will fail as the forked process still has an open fd to it.

Let's handle the idmap unmounts in the same way other temp mounts are
done, using the Unmount() helper, that retries the unmount.

This retry fixes it 100% of the times on my system and in the systems
of @halaney that reported the issue too.

This was originally fixed by containerd#10721 by using a detached mount, but the
mount was switched to non-detached again in containerd#10955 and we started to
leak mounts. As @fuwei mentioned, using a detached mount is quite
invisible for the admin and, therefore, a retry is a better alternative.

It seems these umounts are a great candidate for containerd#11303, which will
manage the life-cycle of mounts and can handle these retries whenever
needed.

[1]: To do it, I just run as root "unshare -m", that creates a mntns
with private propagation, and then run the containerd daemon.

Signed-off-by: Rodrigo Campos <rodrigoca@microsoft.com>
rata added a commit to rata/containerd that referenced this pull request Aug 19, 2025
When a lot of pods are created all at the same time, the umount fails
with EBUSY sometimes. This causes some mounts to be leaked. A way to
repro this issue is here:
	containerd#12139 (comment)

While using lsof/fuser it was not possible to get the culprits on time
(the mount is busy for a few ms), @fuwei has found what is causing the
mount to be busy:
	containerd#12139 (comment)

When we fork in `GetUsernsFD()`, if a call to prepareIDMappedOverlay()
is ongoing and has an open fd to the path to idmap, then the unmount
callback will fail as the forked process still has an open fd to it.

Let's handle the idmap unmounts in the same way other temp mounts are
done, using the Unmount() helper, that retries the unmount.

This retry fixes it 100% of the times on my system and in the systems
of @halaney that reported the issue too.

This was originally fixed by containerd#10721 by using a detached mount, but the
mount was switched to non-detached again in containerd#10955 and we started to
leak mounts. As @fuwei mentioned, using a detached mount is quite
invisible for the admin and, therefore, a retry is a better alternative.

It seems these umounts are a great candidate for containerd#11303, which will
manage the life-cycle of mounts and can handle these retries whenever
needed.

[1]: To do it, I just run as root "unshare -m", that creates a mntns
with private propagation, and then run the containerd daemon.

Signed-off-by: Rodrigo Campos <rodrigoca@microsoft.com>
rata added a commit to rata/containerd that referenced this pull request Aug 19, 2025
When a lot of pods are created all at the same time, the umount fails
with EBUSY sometimes. This causes some mounts to be leaked. A way to
repro this issue is here:
	containerd#12139 (comment)

While using lsof/fuser it was not possible to get the culprits on time
(the mount is busy for a few ms), @fuwei has found what is causing the
mount to be busy:
	containerd#12139 (comment)

When we fork in `GetUsernsFD()`, if a call to prepareIDMappedOverlay()
is ongoing and has an open fd to the path to idmap, then the unmount
callback will fail as the forked process still has an open fd to it.

Let's handle the idmap unmounts in the same way other temp mounts are
done, using the Unmount() helper, that retries the unmount.

This retry fixes it 100% of the times on my system and in the systems
of @halaney that reported the issue too.

This was originally fixed by containerd#10721 by using a detached mount, but the
mount was switched to non-detached again in containerd#10955 and we started to
leak mounts. As @fuwei mentioned, using a detached mount is quite
invisible for the admin and, therefore, a retry is a better alternative.

It seems these umounts are a great candidate for containerd#11303, which will
manage the life-cycle of mounts and can handle these retries whenever
needed.

[1]: To do it, I just run as root "unshare -m", that creates a mntns
with private propagation, and then run the containerd daemon.

Signed-off-by: Rodrigo Campos <rodrigoca@microsoft.com>
rata added a commit to rata/containerd that referenced this pull request Aug 19, 2025
When a lot of pods are created all at the same time, the umount fails
with EBUSY sometimes. This causes some mounts to be leaked. A way to
repro this issue is here:
	containerd#12139 (comment)

While using lsof/fuser it was not possible to get the culprits on time
(the mount is busy for a few ms), @fuwei has found what is causing the
mount to be busy:
	containerd#12139 (comment)

When we fork in `GetUsernsFD()`, if a call to prepareIDMappedOverlay()
is ongoing and has an open fd to the path to idmap, then the unmount
callback will fail as the forked process still has an open fd to it.

Let's handle the idmap unmounts in the same way other temp mounts are
done, using the Unmount() helper, that retries the unmount.

This retry fixes it 100% of the times on my system and in the systems
of @halaney that reported the issue too.

This was originally fixed by containerd#10721 by using a detached mount, but the
mount was switched to non-detached again in containerd#10955 and we started to
leak mounts. As @fuwei mentioned, using a detached mount is quite
invisible for the admin and, therefore, a retry is a better alternative.

It seems these umounts are a great candidate for containerd#11303, which will
manage the life-cycle of mounts and can handle these retries whenever
needed.

[1]: To do it, I just run as root "unshare -m", that creates a mntns
with private propagation, and then run the containerd daemon.

Signed-off-by: Rodrigo Campos <rodrigoca@microsoft.com>
halaney pushed a commit to halaney/containerd that referenced this pull request Aug 20, 2025
When a lot of pods are created all at the same time, the umount fails
with EBUSY sometimes. This causes some mounts to be leaked. A way to
repro this issue is here:
	containerd#12139 (comment)

While using lsof/fuser it was not possible to get the culprits on time
(the mount is busy for a few ms), @fuwei has found what is causing the
mount to be busy:
	containerd#12139 (comment)

When we fork in `GetUsernsFD()`, if a call to prepareIDMappedOverlay()
is ongoing and has an open fd to the path to idmap, then the unmount
callback will fail as the forked process still has an open fd to it.

Let's handle the idmap unmounts in the same way other temp mounts are
done, using the Unmount() helper, that retries the unmount.

This retry fixes it 100% of the times on my system and in the systems
of @halaney that reported the issue too.

This was originally fixed by containerd#10721 by using a detached mount, but the
mount was switched to non-detached again in containerd#10955 and we started to
leak mounts. As @fuwei mentioned, using a detached mount is quite
invisible for the admin and, therefore, a retry is a better alternative.

It seems these umounts are a great candidate for containerd#11303, which will
manage the life-cycle of mounts and can handle these retries whenever
needed.

[1]: To do it, I just run as root "unshare -m", that creates a mntns
with private propagation, and then run the containerd daemon.

Signed-off-by: Rodrigo Campos <rodrigoca@microsoft.com>
(cherry picked from commit da3dc1e)
Signed-off-by: Andrew Halaney <ahalaney@netflix.com>
halaney pushed a commit to halaney/containerd that referenced this pull request Aug 20, 2025
This is a planned follow-on from containerd#10721 primarily at the request of
@fuweid, exchanging MNT_DETACH at unmount time for MOUNT_ATTR_RDONLY at
mount time. The effect is to increase risk of unmount failure due to
EBUSY (as observed in the wild) but add an additional protection that the
then-leaked bind mount does not act as a conduit for inadvertent modification
of the underlying data, including our own efforts to clean up the mountpoint.

Tests covering the lifecycle of the temporary idmap mounts and integrity
of the underlying lower layer data is also included in the normal and
failed-unmount case.

Fixes containerd#10704

Signed-off-by: Mike Baynton <mike@mbaynton.com>
(cherry picked from commit 1e3d10d)
Signed-off-by: Andrew Halaney <ahalaney@netflix.com>
halaney pushed a commit to halaney/containerd that referenced this pull request Aug 20, 2025
When a lot of pods are created all at the same time, the umount fails
with EBUSY sometimes. This causes some mounts to be leaked. A way to
repro this issue is here:
	containerd#12139 (comment)

While using lsof/fuser it was not possible to get the culprits on time
(the mount is busy for a few ms), @fuwei has found what is causing the
mount to be busy:
	containerd#12139 (comment)

When we fork in `GetUsernsFD()`, if a call to prepareIDMappedOverlay()
is ongoing and has an open fd to the path to idmap, then the unmount
callback will fail as the forked process still has an open fd to it.

Let's handle the idmap unmounts in the same way other temp mounts are
done, using the Unmount() helper, that retries the unmount.

This retry fixes it 100% of the times on my system and in the systems
of @halaney that reported the issue too.

This was originally fixed by containerd#10721 by using a detached mount, but the
mount was switched to non-detached again in containerd#10955 and we started to
leak mounts. As @fuwei mentioned, using a detached mount is quite
invisible for the admin and, therefore, a retry is a better alternative.

It seems these umounts are a great candidate for containerd#11303, which will
manage the life-cycle of mounts and can handle these retries whenever
needed.

[1]: To do it, I just run as root "unshare -m", that creates a mntns
with private propagation, and then run the containerd daemon.

Signed-off-by: Rodrigo Campos <rodrigoca@microsoft.com>
(cherry picked from commit da3dc1e)
Signed-off-by: Andrew Halaney <ahalaney@netflix.com>
Cindia-blue pushed a commit to Cindia-blue/containerd that referenced this pull request Aug 31, 2025
When a lot of pods are created all at the same time, the umount fails
with EBUSY sometimes. This causes some mounts to be leaked. A way to
repro this issue is here:
	containerd#12139 (comment)

While using lsof/fuser it was not possible to get the culprits on time
(the mount is busy for a few ms), @fuwei has found what is causing the
mount to be busy:
	containerd#12139 (comment)

When we fork in `GetUsernsFD()`, if a call to prepareIDMappedOverlay()
is ongoing and has an open fd to the path to idmap, then the unmount
callback will fail as the forked process still has an open fd to it.

Let's handle the idmap unmounts in the same way other temp mounts are
done, using the Unmount() helper, that retries the unmount.

This retry fixes it 100% of the times on my system and in the systems
of @halaney that reported the issue too.

This was originally fixed by containerd#10721 by using a detached mount, but the
mount was switched to non-detached again in containerd#10955 and we started to
leak mounts. As @fuwei mentioned, using a detached mount is quite
invisible for the admin and, therefore, a retry is a better alternative.

It seems these umounts are a great candidate for containerd#11303, which will
manage the life-cycle of mounts and can handle these retries whenever
needed.

[1]: To do it, I just run as root "unshare -m", that creates a mntns
with private propagation, and then run the containerd daemon.

Signed-off-by: Rodrigo Campos <rodrigoca@microsoft.com>
Cindia-blue pushed a commit to Cindia-blue/containerd that referenced this pull request Sep 1, 2025
When a lot of pods are created all at the same time, the umount fails
with EBUSY sometimes. This causes some mounts to be leaked. A way to
repro this issue is here:
	containerd#12139 (comment)

While using lsof/fuser it was not possible to get the culprits on time
(the mount is busy for a few ms), @fuwei has found what is causing the
mount to be busy:
	containerd#12139 (comment)

When we fork in `GetUsernsFD()`, if a call to prepareIDMappedOverlay()
is ongoing and has an open fd to the path to idmap, then the unmount
callback will fail as the forked process still has an open fd to it.

Let's handle the idmap unmounts in the same way other temp mounts are
done, using the Unmount() helper, that retries the unmount.

This retry fixes it 100% of the times on my system and in the systems
of @halaney that reported the issue too.

This was originally fixed by containerd#10721 by using a detached mount, but the
mount was switched to non-detached again in containerd#10955 and we started to
leak mounts. As @fuwei mentioned, using a detached mount is quite
invisible for the admin and, therefore, a retry is a better alternative.

It seems these umounts are a great candidate for containerd#11303, which will
manage the life-cycle of mounts and can handle these retries whenever
needed.

[1]: To do it, I just run as root "unshare -m", that creates a mntns
with private propagation, and then run the containerd daemon.

Signed-off-by: Rodrigo Campos <rodrigoca@microsoft.com>
Cindia-blue pushed a commit to Cindia-blue/containerd that referenced this pull request Oct 20, 2025
When a lot of pods are created all at the same time, the umount fails
with EBUSY sometimes. This causes some mounts to be leaked. A way to
repro this issue is here:
	containerd#12139 (comment)

While using lsof/fuser it was not possible to get the culprits on time
(the mount is busy for a few ms), @fuwei has found what is causing the
mount to be busy:
	containerd#12139 (comment)

When we fork in `GetUsernsFD()`, if a call to prepareIDMappedOverlay()
is ongoing and has an open fd to the path to idmap, then the unmount
callback will fail as the forked process still has an open fd to it.

Let's handle the idmap unmounts in the same way other temp mounts are
done, using the Unmount() helper, that retries the unmount.

This retry fixes it 100% of the times on my system and in the systems
of @halaney that reported the issue too.

This was originally fixed by containerd#10721 by using a detached mount, but the
mount was switched to non-detached again in containerd#10955 and we started to
leak mounts. As @fuwei mentioned, using a detached mount is quite
invisible for the admin and, therefore, a retry is a better alternative.

It seems these umounts are a great candidate for containerd#11303, which will
manage the life-cycle of mounts and can handle these retries whenever
needed.

[1]: To do it, I just run as root "unshare -m", that creates a mntns
with private propagation, and then run the containerd daemon.

Signed-off-by: Rodrigo Campos <rodrigoca@microsoft.com>
Cindia-blue pushed a commit to Cindia-blue/containerd that referenced this pull request Nov 8, 2025
When a lot of pods are created all at the same time, the umount fails
with EBUSY sometimes. This causes some mounts to be leaked. A way to
repro this issue is here:
	containerd#12139 (comment)

While using lsof/fuser it was not possible to get the culprits on time
(the mount is busy for a few ms), @fuwei has found what is causing the
mount to be busy:
	containerd#12139 (comment)

When we fork in `GetUsernsFD()`, if a call to prepareIDMappedOverlay()
is ongoing and has an open fd to the path to idmap, then the unmount
callback will fail as the forked process still has an open fd to it.

Let's handle the idmap unmounts in the same way other temp mounts are
done, using the Unmount() helper, that retries the unmount.

This retry fixes it 100% of the times on my system and in the systems
of @halaney that reported the issue too.

This was originally fixed by containerd#10721 by using a detached mount, but the
mount was switched to non-detached again in containerd#10955 and we started to
leak mounts. As @fuwei mentioned, using a detached mount is quite
invisible for the admin and, therefore, a retry is a better alternative.

It seems these umounts are a great candidate for containerd#11303, which will
manage the life-cycle of mounts and can handle these retries whenever
needed.

[1]: To do it, I just run as root "unshare -m", that creates a mntns
with private propagation, and then run the containerd daemon.

Signed-off-by: Rodrigo Campos <rodrigoca@microsoft.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Data loss in root overlayfs and other container mounts when using user namespaces

7 participants