-
Notifications
You must be signed in to change notification settings - Fork 3.8k
Fix data loss in rootfs overlayfs when unmount of tmp dirs fail with idmap mounts #10721
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Hi @rata. Thanks for your PR. I'm waiting for a containerd member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
Overlayfs needs to do an idmap mount of each layer and the cleanup function just unmounts and deletes the directories. However, when the resource is busy, the umount fails. Let's make the unmount detached so the unmount will eventually be done when it's not busy anymore. Also, making it detached solves the issues with the unmount failing because it is busy. Big kudos to @mbaynton for reporting this issue with lot of details, nailing it down to containerd lines of code and showing all the log lines to understand the big picture. Fixes: containerd#10704 Signed-off-by: Rodrigo Campos <rodrigoca@microsoft.com>
Using os.RemoveAll() is quite risky, as if the unmount failed and we can delete files from the container rootfs. In fact, we were doing just that. Let's use os.Remove() to make sure we only deleted empty dirs. Big kudos to @mbaynton for reporting this issue with lot of details, nailing it down to containerd lines of code and showing all the log lines to understand the big picture. Fixes: containerd#10704 Signed-off-by: Rodrigo Campos <rodrigoca@microsoft.com>
The detached mount is less likely to fail in our case, but if we see any failure to unmount, we should just skip the removal of directories. Signed-off-by: Rodrigo Campos <rodrigoca@microsoft.com>
| log.L.WithError(terr).Warnf("failed to remove temporary overlay lowerdir's") | ||
|
|
||
| // This dir should be empty now. Otherwise, we don't do anything. | ||
| if err := os.Remove(filepath.Join(tmpLowerDirs[0], "..")); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't we just use filepath.Dir? (And validate the path)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I'm preparing a follow-up PR with more improvements. This is how it was written before, I'd improve it (that and other things, hopefully including tests :))
This PR is just about fixing the data loss, if that is fine, to land ASAP.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the follow-up PR this thing is removed and handled differently (IMHO, it is clearer). So I'd avoid improving lines that we will remove anyways :)
| if err := unix.Unmount(lowerDir, 0); err != nil { | ||
| // Do a detached unmount so even if the resource is busy, the mount will be | ||
| // gone (eventually) and we can safely delete the directory too. | ||
| if err := unix.Unmount(lowerDir, unix.MNT_DETACH); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
MNT_DETACH isn't a best option :p
The cleanup is defer callback. It's hard to notify that caller to retry.
Based on this fact, the change is acceptable to me.
However, I think we can handle this issue in other way.
Based on mount_setattr manpage, the MOUNT_ATTR_RDONLY can force the temporary mount in read-only. And no one can delete the content.
NOTE: Even if there are leaky mountpoint, containerd will clean it up during booting.
containerd/cmd/containerd/command/main.go
Line 181 in 08037e7
warnings, err := mount.CleanupTempMounts(0)
It's kind of enhancement and it needs to change IdMapMount interface.
It should be handled in the follow-up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mbaynton are you interested in tackling this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, potentially. I'm most interested in investigating a new code path for Linux 6.13+ using the new file descriptor based layer setup and the new mount API. I think this will keep the idmap temp mounts off the host filesystem tree at all times, so the one and only accessor of them will be the desired overlayfs, so (I hypothesize) there will be no possibility of EBUSY failures / no need to use MNT_DETACH / no possibility of leaky mountpoints.
But an alternative implementation that removes MNT_DETACH, instead makes these read-only as another safeguard, and accepts the possibility that the unmount may fail without MNT_DETACH seems pretty easy to put together too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rata it sounds like you already started a follow-up?
In the follow-up PR this thing is removed and handled differently (IMHO, it is clearer).
I don't see one pushed to rata/containerd but are there improvements you've got?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I should have something for this to address @fuweid's request in a week.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
For the wrong error in the OOM test, perhaps the shim process itself has started using more than this? |
|
It seems to be the update to runc 1.1.15 that is using slightly more than 15MB. It will be nice to adjust the tests. But we are anyways preparing a fix (opencontainers/runc#4439, opencontainers/runc#4427) |
This is an additional safeguard against inadvertent data modification or deletion as a follow-up from containerd#10721. Tests covering the lifecycle of the temporary idmap mounts and integrity of the underlying lower layer data is also included in the normal and failed-unmount case. Fixes containerd#10704 Signed-off-by: Mike Baynton <mike@mbaynton.com>
This is an additional safeguard against inadvertent data modification or deletion as a follow-up from containerd#10721. Tests covering the lifecycle of the temporary idmap mounts and integrity of the underlying lower layer data is also included in the normal and failed-unmount case. Fixes containerd#10704 Signed-off-by: Mike Baynton <mike@mbaynton.com>
This is an additional safeguard against inadvertent data modification or deletion as a follow-up from containerd#10721. Tests covering the lifecycle of the temporary idmap mounts and integrity of the underlying lower layer data is also included in the normal and failed-unmount case. Fixes containerd#10704 Signed-off-by: Mike Baynton <mike@mbaynton.com>
This is an additional safeguard against inadvertent data modification or deletion as a follow-up from containerd#10721. Tests covering the lifecycle of the temporary idmap mounts and integrity of the underlying lower layer data is also included in the normal and failed-unmount case. Fixes containerd#10704 Signed-off-by: Mike Baynton <mike@mbaynton.com>
This is an additional safeguard against inadvertent data modification or deletion as a follow-up from containerd#10721. Tests covering the lifecycle of the temporary idmap mounts and integrity of the underlying lower layer data is also included in the normal and failed-unmount case. Fixes containerd#10704 Signed-off-by: Mike Baynton <mike@mbaynton.com>
This is an additional safeguard against inadvertent data modification or deletion as a follow-up from containerd#10721. Tests covering the lifecycle of the temporary idmap mounts and integrity of the underlying lower layer data is also included in the normal and failed-unmount case. Fixes containerd#10704 Signed-off-by: Mike Baynton <mike@mbaynton.com>
This is a planned follow-on from containerd#10721 primarily at the request of @fuweid, exchanging MNT_DETACH at unmount time for MOUNT_ATTR_RDONLY at mount time. The effect is to increase risk of unmount failure due to EBUSY (as observed in the wild) but add an additional protection that the then-leaked bind mount does not act as a conduit for inadvertent modification of the underlying data, including our own efforts to clean up the mountpoint. Tests covering the lifecycle of the temporary idmap mounts and integrity of the underlying lower layer data is also included in the normal and failed-unmount case. Fixes containerd#10704 Signed-off-by: Mike Baynton <mike@mbaynton.com>
This is a planned follow-on from containerd#10721 primarily at the request of @fuweid, exchanging MNT_DETACH at unmount time for MOUNT_ATTR_RDONLY at mount time. The effect is to increase risk of unmount failure due to EBUSY (as observed in the wild) but add an additional protection that the then-leaked bind mount does not act as a conduit for inadvertent modification of the underlying data, including our own efforts to clean up the mountpoint. Tests covering the lifecycle of the temporary idmap mounts and integrity of the underlying lower layer data is also included in the normal and failed-unmount case. Fixes containerd#10704 Signed-off-by: Mike Baynton <mike@mbaynton.com>
This is a planned follow-on from containerd#10721 primarily at the request of @fuweid, exchanging MNT_DETACH at unmount time for MOUNT_ATTR_RDONLY at mount time. The effect is to increase risk of unmount failure due to EBUSY (as observed in the wild) but add an additional protection that the then-leaked bind mount does not act as a conduit for inadvertent modification of the underlying data, including our own efforts to clean up the mountpoint. Tests covering the lifecycle of the temporary idmap mounts and integrity of the underlying lower layer data is also included in the normal and failed-unmount case. Fixes containerd#10704 Signed-off-by: Mike Baynton <mike@mbaynton.com>
When a lot of pods are created all at the same time, the umount fails with EBUSY sometimes. This causes some mounts to be leaked. A way to repro this issue is here: containerd#12139 (comment) While using lsof/fuser it was not possible to get the culprits on time (the mount is busy for a few ms), I've tried running the containerd daemon in a mount namespace with propagation private[1], and in that case there are no leaks. So, it seems clear it's not containerd but probably systemd that scans all host mounts to act on them, that is causing the EBUSY for a few ms. Also, it seems EBUSY is not just returned in the context of userns, but also for other temp mounts containerd does. See here for more details: containerd#12139 (comment) Let's handle the idmap unmounts in the same way other temp mounts are done, using the Unmount() helper, that retries the unmount. This retry fixes it 100% of the times on my system and in the systems of @halaney that reported the issue too. This was originally fixed by containerd#10721 by using a detached mount, but the mount was switched to non-detached again in containerd#10955 and we started to leak mounts. As @fuwei mentioned, using a detached mount is quite invisible for the admin and, therefore, a retry is a better alternative. It seems these umounts are a great candidate for containerd#11303, which will manage the life-cycle of mounts and can handle these retries whenever needed. It's left for future research if it's possible to make the host not scan these mounts and, if it's possible, decide if we want to do that. [1]: To do it, I just run as root "unshare -m", that creates a mntns with private propagation, and then run the containerd daemon. Signed-off-by: Rodrigo Campos <rodrigoca@microsoft.com>
When a lot of pods are created all at the same time, the umount fails with EBUSY sometimes. This causes some mounts to be leaked. A way to repro this issue is here: containerd#12139 (comment) While using lsof/fuser it was not possible to get the culprits on time (the mount is busy for a few ms), I've tried running the containerd daemon in a mount namespace with propagation private[1], and in that case there are no leaks. So, it seems clear it's not containerd but probably systemd that scans all host mounts to act on them, that is causing the EBUSY for a few ms. Also, it seems EBUSY is not just returned in the context of userns, but also for other temp mounts containerd does. See here for more details: containerd#12139 (comment) Let's handle the idmap unmounts in the same way other temp mounts are done, using the Unmount() helper, that retries the unmount. This retry fixes it 100% of the times on my system and in the systems of @halaney that reported the issue too. This was originally fixed by containerd#10721 by using a detached mount, but the mount was switched to non-detached again in containerd#10955 and we started to leak mounts. As @fuwei mentioned, using a detached mount is quite invisible for the admin and, therefore, a retry is a better alternative. It seems these umounts are a great candidate for containerd#11303, which will manage the life-cycle of mounts and can handle these retries whenever needed. It's left for future research if it's possible to make the host not scan these mounts and, if it's possible, decide if we want to do that. [1]: To do it, I just run as root "unshare -m", that creates a mntns with private propagation, and then run the containerd daemon. Signed-off-by: Rodrigo Campos <rodrigoca@microsoft.com>
When a lot of pods are created all at the same time, the umount fails with EBUSY sometimes. This causes some mounts to be leaked. A way to repro this issue is here: containerd#12139 (comment) While using lsof/fuser it was not possible to get the culprits on time (the mount is busy for a few ms), @fuwei has found what is causing the mount to be busy: containerd#12139 (comment) When we fork in `GetUsernsFD()`, if a call to prepareIDMappedOverlay() is ongoing and has an open fd to the path idmap, then the unmount callback will fail as the forked process still has an open fd to it. Let's handle the idmap unmounts in the same way other temp mounts are done, using the Unmount() helper, that retries the unmount. This retry fixes it 100% of the times on my system and in the systems of @halaney that reported the issue too. This was originally fixed by containerd#10721 by using a detached mount, but the mount was switched to non-detached again in containerd#10955 and we started to leak mounts. As @fuwei mentioned, using a detached mount is quite invisible for the admin and, therefore, a retry is a better alternative. It seems these umounts are a great candidate for containerd#11303, which will manage the life-cycle of mounts and can handle these retries whenever needed. It's left for future research if it's possible to make the host not scan these mounts and, if it's possible, decide if we want to do that. [1]: To do it, I just run as root "unshare -m", that creates a mntns with private propagation, and then run the containerd daemon. Signed-off-by: Rodrigo Campos <rodrigoca@microsoft.com>
When a lot of pods are created all at the same time, the umount fails with EBUSY sometimes. This causes some mounts to be leaked. A way to repro this issue is here: containerd#12139 (comment) While using lsof/fuser it was not possible to get the culprits on time (the mount is busy for a few ms), @fuwei has found what is causing the mount to be busy: containerd#12139 (comment) When we fork in `GetUsernsFD()`, if a call to prepareIDMappedOverlay() is ongoing and has an open fd to the path idmap, then the unmount callback will fail as the forked process still has an open fd to it. Let's handle the idmap unmounts in the same way other temp mounts are done, using the Unmount() helper, that retries the unmount. This retry fixes it 100% of the times on my system and in the systems of @halaney that reported the issue too. This was originally fixed by containerd#10721 by using a detached mount, but the mount was switched to non-detached again in containerd#10955 and we started to leak mounts. As @fuwei mentioned, using a detached mount is quite invisible for the admin and, therefore, a retry is a better alternative. It seems these umounts are a great candidate for containerd#11303, which will manage the life-cycle of mounts and can handle these retries whenever needed. It's left for future research if it's possible to make the host not scan these mounts and, if it's possible, decide if we want to do that. [1]: To do it, I just run as root "unshare -m", that creates a mntns with private propagation, and then run the containerd daemon. Signed-off-by: Rodrigo Campos <rodrigoca@microsoft.com>
When a lot of pods are created all at the same time, the umount fails with EBUSY sometimes. This causes some mounts to be leaked. A way to repro this issue is here: containerd#12139 (comment) While using lsof/fuser it was not possible to get the culprits on time (the mount is busy for a few ms), @fuwei has found what is causing the mount to be busy: containerd#12139 (comment) When we fork in `GetUsernsFD()`, if a call to prepareIDMappedOverlay() is ongoing and has an open fd to the path to idmap, then the unmount callback will fail as the forked process still has an open fd to it. Let's handle the idmap unmounts in the same way other temp mounts are done, using the Unmount() helper, that retries the unmount. This retry fixes it 100% of the times on my system and in the systems of @halaney that reported the issue too. This was originally fixed by containerd#10721 by using a detached mount, but the mount was switched to non-detached again in containerd#10955 and we started to leak mounts. As @fuwei mentioned, using a detached mount is quite invisible for the admin and, therefore, a retry is a better alternative. It seems these umounts are a great candidate for containerd#11303, which will manage the life-cycle of mounts and can handle these retries whenever needed. [1]: To do it, I just run as root "unshare -m", that creates a mntns with private propagation, and then run the containerd daemon. Signed-off-by: Rodrigo Campos <rodrigoca@microsoft.com>
When a lot of pods are created all at the same time, the umount fails with EBUSY sometimes. This causes some mounts to be leaked. A way to repro this issue is here: containerd#12139 (comment) While using lsof/fuser it was not possible to get the culprits on time (the mount is busy for a few ms), @fuwei has found what is causing the mount to be busy: containerd#12139 (comment) When we fork in `GetUsernsFD()`, if a call to prepareIDMappedOverlay() is ongoing and has an open fd to the path to idmap, then the unmount callback will fail as the forked process still has an open fd to it. Let's handle the idmap unmounts in the same way other temp mounts are done, using the Unmount() helper, that retries the unmount. This retry fixes it 100% of the times on my system and in the systems of @halaney that reported the issue too. This was originally fixed by containerd#10721 by using a detached mount, but the mount was switched to non-detached again in containerd#10955 and we started to leak mounts. As @fuwei mentioned, using a detached mount is quite invisible for the admin and, therefore, a retry is a better alternative. It seems these umounts are a great candidate for containerd#11303, which will manage the life-cycle of mounts and can handle these retries whenever needed. [1]: To do it, I just run as root "unshare -m", that creates a mntns with private propagation, and then run the containerd daemon. Signed-off-by: Rodrigo Campos <rodrigoca@microsoft.com>
When a lot of pods are created all at the same time, the umount fails with EBUSY sometimes. This causes some mounts to be leaked. A way to repro this issue is here: containerd#12139 (comment) While using lsof/fuser it was not possible to get the culprits on time (the mount is busy for a few ms), @fuwei has found what is causing the mount to be busy: containerd#12139 (comment) When we fork in `GetUsernsFD()`, if a call to prepareIDMappedOverlay() is ongoing and has an open fd to the path to idmap, then the unmount callback will fail as the forked process still has an open fd to it. Let's handle the idmap unmounts in the same way other temp mounts are done, using the Unmount() helper, that retries the unmount. This retry fixes it 100% of the times on my system and in the systems of @halaney that reported the issue too. This was originally fixed by containerd#10721 by using a detached mount, but the mount was switched to non-detached again in containerd#10955 and we started to leak mounts. As @fuwei mentioned, using a detached mount is quite invisible for the admin and, therefore, a retry is a better alternative. It seems these umounts are a great candidate for containerd#11303, which will manage the life-cycle of mounts and can handle these retries whenever needed. [1]: To do it, I just run as root "unshare -m", that creates a mntns with private propagation, and then run the containerd daemon. Signed-off-by: Rodrigo Campos <rodrigoca@microsoft.com>
When a lot of pods are created all at the same time, the umount fails with EBUSY sometimes. This causes some mounts to be leaked. A way to repro this issue is here: containerd#12139 (comment) While using lsof/fuser it was not possible to get the culprits on time (the mount is busy for a few ms), @fuwei has found what is causing the mount to be busy: containerd#12139 (comment) When we fork in `GetUsernsFD()`, if a call to prepareIDMappedOverlay() is ongoing and has an open fd to the path to idmap, then the unmount callback will fail as the forked process still has an open fd to it. Let's handle the idmap unmounts in the same way other temp mounts are done, using the Unmount() helper, that retries the unmount. This retry fixes it 100% of the times on my system and in the systems of @halaney that reported the issue too. This was originally fixed by containerd#10721 by using a detached mount, but the mount was switched to non-detached again in containerd#10955 and we started to leak mounts. As @fuwei mentioned, using a detached mount is quite invisible for the admin and, therefore, a retry is a better alternative. It seems these umounts are a great candidate for containerd#11303, which will manage the life-cycle of mounts and can handle these retries whenever needed. [1]: To do it, I just run as root "unshare -m", that creates a mntns with private propagation, and then run the containerd daemon. Signed-off-by: Rodrigo Campos <rodrigoca@microsoft.com>
When a lot of pods are created all at the same time, the umount fails with EBUSY sometimes. This causes some mounts to be leaked. A way to repro this issue is here: containerd#12139 (comment) While using lsof/fuser it was not possible to get the culprits on time (the mount is busy for a few ms), @fuwei has found what is causing the mount to be busy: containerd#12139 (comment) When we fork in `GetUsernsFD()`, if a call to prepareIDMappedOverlay() is ongoing and has an open fd to the path to idmap, then the unmount callback will fail as the forked process still has an open fd to it. Let's handle the idmap unmounts in the same way other temp mounts are done, using the Unmount() helper, that retries the unmount. This retry fixes it 100% of the times on my system and in the systems of @halaney that reported the issue too. This was originally fixed by containerd#10721 by using a detached mount, but the mount was switched to non-detached again in containerd#10955 and we started to leak mounts. As @fuwei mentioned, using a detached mount is quite invisible for the admin and, therefore, a retry is a better alternative. It seems these umounts are a great candidate for containerd#11303, which will manage the life-cycle of mounts and can handle these retries whenever needed. [1]: To do it, I just run as root "unshare -m", that creates a mntns with private propagation, and then run the containerd daemon. Signed-off-by: Rodrigo Campos <rodrigoca@microsoft.com>
When a lot of pods are created all at the same time, the umount fails with EBUSY sometimes. This causes some mounts to be leaked. A way to repro this issue is here: containerd#12139 (comment) While using lsof/fuser it was not possible to get the culprits on time (the mount is busy for a few ms), @fuwei has found what is causing the mount to be busy: containerd#12139 (comment) When we fork in `GetUsernsFD()`, if a call to prepareIDMappedOverlay() is ongoing and has an open fd to the path to idmap, then the unmount callback will fail as the forked process still has an open fd to it. Let's handle the idmap unmounts in the same way other temp mounts are done, using the Unmount() helper, that retries the unmount. This retry fixes it 100% of the times on my system and in the systems of @halaney that reported the issue too. This was originally fixed by containerd#10721 by using a detached mount, but the mount was switched to non-detached again in containerd#10955 and we started to leak mounts. As @fuwei mentioned, using a detached mount is quite invisible for the admin and, therefore, a retry is a better alternative. It seems these umounts are a great candidate for containerd#11303, which will manage the life-cycle of mounts and can handle these retries whenever needed. [1]: To do it, I just run as root "unshare -m", that creates a mntns with private propagation, and then run the containerd daemon. Signed-off-by: Rodrigo Campos <rodrigoca@microsoft.com>
When a lot of pods are created all at the same time, the umount fails with EBUSY sometimes. This causes some mounts to be leaked. A way to repro this issue is here: containerd#12139 (comment) While using lsof/fuser it was not possible to get the culprits on time (the mount is busy for a few ms), @fuwei has found what is causing the mount to be busy: containerd#12139 (comment) When we fork in `GetUsernsFD()`, if a call to prepareIDMappedOverlay() is ongoing and has an open fd to the path to idmap, then the unmount callback will fail as the forked process still has an open fd to it. Let's handle the idmap unmounts in the same way other temp mounts are done, using the Unmount() helper, that retries the unmount. This retry fixes it 100% of the times on my system and in the systems of @halaney that reported the issue too. This was originally fixed by containerd#10721 by using a detached mount, but the mount was switched to non-detached again in containerd#10955 and we started to leak mounts. As @fuwei mentioned, using a detached mount is quite invisible for the admin and, therefore, a retry is a better alternative. It seems these umounts are a great candidate for containerd#11303, which will manage the life-cycle of mounts and can handle these retries whenever needed. [1]: To do it, I just run as root "unshare -m", that creates a mntns with private propagation, and then run the containerd daemon. Signed-off-by: Rodrigo Campos <rodrigoca@microsoft.com>
When a lot of pods are created all at the same time, the umount fails with EBUSY sometimes. This causes some mounts to be leaked. A way to repro this issue is here: containerd#12139 (comment) While using lsof/fuser it was not possible to get the culprits on time (the mount is busy for a few ms), @fuwei has found what is causing the mount to be busy: containerd#12139 (comment) When we fork in `GetUsernsFD()`, if a call to prepareIDMappedOverlay() is ongoing and has an open fd to the path to idmap, then the unmount callback will fail as the forked process still has an open fd to it. Let's handle the idmap unmounts in the same way other temp mounts are done, using the Unmount() helper, that retries the unmount. This retry fixes it 100% of the times on my system and in the systems of @halaney that reported the issue too. This was originally fixed by containerd#10721 by using a detached mount, but the mount was switched to non-detached again in containerd#10955 and we started to leak mounts. As @fuwei mentioned, using a detached mount is quite invisible for the admin and, therefore, a retry is a better alternative. It seems these umounts are a great candidate for containerd#11303, which will manage the life-cycle of mounts and can handle these retries whenever needed. [1]: To do it, I just run as root "unshare -m", that creates a mntns with private propagation, and then run the containerd daemon. Signed-off-by: Rodrigo Campos <rodrigoca@microsoft.com> (cherry picked from commit da3dc1e) Signed-off-by: Andrew Halaney <ahalaney@netflix.com>
This is a planned follow-on from containerd#10721 primarily at the request of @fuweid, exchanging MNT_DETACH at unmount time for MOUNT_ATTR_RDONLY at mount time. The effect is to increase risk of unmount failure due to EBUSY (as observed in the wild) but add an additional protection that the then-leaked bind mount does not act as a conduit for inadvertent modification of the underlying data, including our own efforts to clean up the mountpoint. Tests covering the lifecycle of the temporary idmap mounts and integrity of the underlying lower layer data is also included in the normal and failed-unmount case. Fixes containerd#10704 Signed-off-by: Mike Baynton <mike@mbaynton.com> (cherry picked from commit 1e3d10d) Signed-off-by: Andrew Halaney <ahalaney@netflix.com>
When a lot of pods are created all at the same time, the umount fails with EBUSY sometimes. This causes some mounts to be leaked. A way to repro this issue is here: containerd#12139 (comment) While using lsof/fuser it was not possible to get the culprits on time (the mount is busy for a few ms), @fuwei has found what is causing the mount to be busy: containerd#12139 (comment) When we fork in `GetUsernsFD()`, if a call to prepareIDMappedOverlay() is ongoing and has an open fd to the path to idmap, then the unmount callback will fail as the forked process still has an open fd to it. Let's handle the idmap unmounts in the same way other temp mounts are done, using the Unmount() helper, that retries the unmount. This retry fixes it 100% of the times on my system and in the systems of @halaney that reported the issue too. This was originally fixed by containerd#10721 by using a detached mount, but the mount was switched to non-detached again in containerd#10955 and we started to leak mounts. As @fuwei mentioned, using a detached mount is quite invisible for the admin and, therefore, a retry is a better alternative. It seems these umounts are a great candidate for containerd#11303, which will manage the life-cycle of mounts and can handle these retries whenever needed. [1]: To do it, I just run as root "unshare -m", that creates a mntns with private propagation, and then run the containerd daemon. Signed-off-by: Rodrigo Campos <rodrigoca@microsoft.com> (cherry picked from commit da3dc1e) Signed-off-by: Andrew Halaney <ahalaney@netflix.com>
When a lot of pods are created all at the same time, the umount fails with EBUSY sometimes. This causes some mounts to be leaked. A way to repro this issue is here: containerd#12139 (comment) While using lsof/fuser it was not possible to get the culprits on time (the mount is busy for a few ms), @fuwei has found what is causing the mount to be busy: containerd#12139 (comment) When we fork in `GetUsernsFD()`, if a call to prepareIDMappedOverlay() is ongoing and has an open fd to the path to idmap, then the unmount callback will fail as the forked process still has an open fd to it. Let's handle the idmap unmounts in the same way other temp mounts are done, using the Unmount() helper, that retries the unmount. This retry fixes it 100% of the times on my system and in the systems of @halaney that reported the issue too. This was originally fixed by containerd#10721 by using a detached mount, but the mount was switched to non-detached again in containerd#10955 and we started to leak mounts. As @fuwei mentioned, using a detached mount is quite invisible for the admin and, therefore, a retry is a better alternative. It seems these umounts are a great candidate for containerd#11303, which will manage the life-cycle of mounts and can handle these retries whenever needed. [1]: To do it, I just run as root "unshare -m", that creates a mntns with private propagation, and then run the containerd daemon. Signed-off-by: Rodrigo Campos <rodrigoca@microsoft.com>
When a lot of pods are created all at the same time, the umount fails with EBUSY sometimes. This causes some mounts to be leaked. A way to repro this issue is here: containerd#12139 (comment) While using lsof/fuser it was not possible to get the culprits on time (the mount is busy for a few ms), @fuwei has found what is causing the mount to be busy: containerd#12139 (comment) When we fork in `GetUsernsFD()`, if a call to prepareIDMappedOverlay() is ongoing and has an open fd to the path to idmap, then the unmount callback will fail as the forked process still has an open fd to it. Let's handle the idmap unmounts in the same way other temp mounts are done, using the Unmount() helper, that retries the unmount. This retry fixes it 100% of the times on my system and in the systems of @halaney that reported the issue too. This was originally fixed by containerd#10721 by using a detached mount, but the mount was switched to non-detached again in containerd#10955 and we started to leak mounts. As @fuwei mentioned, using a detached mount is quite invisible for the admin and, therefore, a retry is a better alternative. It seems these umounts are a great candidate for containerd#11303, which will manage the life-cycle of mounts and can handle these retries whenever needed. [1]: To do it, I just run as root "unshare -m", that creates a mntns with private propagation, and then run the containerd daemon. Signed-off-by: Rodrigo Campos <rodrigoca@microsoft.com>
When a lot of pods are created all at the same time, the umount fails with EBUSY sometimes. This causes some mounts to be leaked. A way to repro this issue is here: containerd#12139 (comment) While using lsof/fuser it was not possible to get the culprits on time (the mount is busy for a few ms), @fuwei has found what is causing the mount to be busy: containerd#12139 (comment) When we fork in `GetUsernsFD()`, if a call to prepareIDMappedOverlay() is ongoing and has an open fd to the path to idmap, then the unmount callback will fail as the forked process still has an open fd to it. Let's handle the idmap unmounts in the same way other temp mounts are done, using the Unmount() helper, that retries the unmount. This retry fixes it 100% of the times on my system and in the systems of @halaney that reported the issue too. This was originally fixed by containerd#10721 by using a detached mount, but the mount was switched to non-detached again in containerd#10955 and we started to leak mounts. As @fuwei mentioned, using a detached mount is quite invisible for the admin and, therefore, a retry is a better alternative. It seems these umounts are a great candidate for containerd#11303, which will manage the life-cycle of mounts and can handle these retries whenever needed. [1]: To do it, I just run as root "unshare -m", that creates a mntns with private propagation, and then run the containerd daemon. Signed-off-by: Rodrigo Campos <rodrigoca@microsoft.com>
When a lot of pods are created all at the same time, the umount fails with EBUSY sometimes. This causes some mounts to be leaked. A way to repro this issue is here: containerd#12139 (comment) While using lsof/fuser it was not possible to get the culprits on time (the mount is busy for a few ms), @fuwei has found what is causing the mount to be busy: containerd#12139 (comment) When we fork in `GetUsernsFD()`, if a call to prepareIDMappedOverlay() is ongoing and has an open fd to the path to idmap, then the unmount callback will fail as the forked process still has an open fd to it. Let's handle the idmap unmounts in the same way other temp mounts are done, using the Unmount() helper, that retries the unmount. This retry fixes it 100% of the times on my system and in the systems of @halaney that reported the issue too. This was originally fixed by containerd#10721 by using a detached mount, but the mount was switched to non-detached again in containerd#10955 and we started to leak mounts. As @fuwei mentioned, using a detached mount is quite invisible for the admin and, therefore, a retry is a better alternative. It seems these umounts are a great candidate for containerd#11303, which will manage the life-cycle of mounts and can handle these retries whenever needed. [1]: To do it, I just run as root "unshare -m", that creates a mntns with private propagation, and then run the containerd daemon. Signed-off-by: Rodrigo Campos <rodrigoca@microsoft.com>
This fixes #10704 . Big kudos to @mbaynton for reporting this issue with lot of details, nailing it down to containerd lines of code and showing all the log lines to understand the big picture of what was happening.
PR #5890 added code to unmount and delete the tmp directories for each layer of the overlayfs that is idmapped (idmap support for overlayfs needs to idmap each layer). The thing is, it was using os.RemoveAll() which will remove all recursively and not skipping it if the unmount failed. Therefore, if the unmount fails this is removing the rootfs of the container.
This PR just changes this to:
There is a repro for this issue and I've confirmed that if the unmount fails, we are not handling it correctly and we cause data-loss, which is horrible. Of course, this PR fixes the issue for the repro too.
I'll also open a follow-up PR to add tests for this and maybe some other small fixes and improvements. But let's fix this data-loss ASAP, I'll follow-up with the other changes.
Fixes #10704
cc @fuweid @AkihiroSuda