-
Notifications
You must be signed in to change notification settings - Fork 3.8k
Make ovl idmap mounts read-only #10955
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Hi @mbaynton. Thanks for your PR. I'm waiting for a containerd member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
a1ad9ed to
7a2b1a2
Compare
3e7685e to
2b89ebb
Compare
| // Using os.Remove() so if it's not empty, we don't delete files in the | ||
| // rootfs. | ||
| if err := os.Remove(lowerDir); err != nil { | ||
| log.L.WithError(err).Warnf("failed to remove temporary overlay lowerdir's") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please remove the unix.MNT_DETACH at the line 256? I think it's not required if we use read-only.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@fuweid sadly it is needed. There is some "contention" issue (the umount fails with busy when creating a lot of pods), see here for a repro: #10704 (comment)
I'm unsure yet why the layers seem to be re-used by other mounts or what. We should debug that and fix it, but until then, the detached mount is the best way around it.
Also, I'm not sure I'd remove it even after that is fixed. A detached mount just makes it more robust: if it is used for some reason, all following access will be disconnected and the unmount will be performed as soon as it is not being used anymore.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left it in for the reasons @rata stated. That said, I think @fuweid may understand the possibility of leaking mounts if we remove it and feels this is preferable to using MNT_DETACH, see their prior comment about it.
NOTE: Even if there are leaky mountpoint, containerd will clean it up during booting.
I'm curious if there's a specific risk or downside to using MNT_DETACH that you are aware of @fuweid?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @mbaynton Sorry for late reply. I can't reproduce this issue by the script you provided. The EBUSY might be caused by other running processes, for example, security application scans all the files. When one process holds fd on that mountpoint, it's unlikely to umount it without MNT_DETACH.
Personally, I don't like MNT_DETACH because it hides real issue. We, contained, have similar issue like #5538. That's why I want to use readonly here.
Since we have cleanup during starting containerd, if it's caused by security application and it's transient issue, it won't be issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Additionally, MNT_DETACH will hide mountpoint and we can't see it. If there are applications holding files in that mountpoint for a long time, we can say it's leaky issue. The application should be aware of this, instead of using MNT_DETACH.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've removed MNT_DETACH as you requested.
I agree that without knowing the real issue in detail (is it in fact because both my clusters and @rata's were running other software that jumped right on the new mounts, or is there another explanation) it is a bit speculation as to whether MNT_DETACH is doing what we think.
I would like to explore achieving these idmap mounts using the new mount API and overlayfs API extension to specify lower layers by fd (later, in other issues/PRs), because I think this method will lead to greater isolation to just containerd and the kernel. Lower layer idmap mounts need not ever be attached to any mount namespace / directory tree. However, if that method does not work out, then I think it will be worth the tracing necessary to understand what is usually causing the EBUSY we originally saw. In that case, I may come back and advocate for MNT_DETACH again :)
I can't reproduce this issue by the script you provided.
Interesting, thanks for trying it out. You were using a build prior to #10721? This is good feedback, I only knew that it worked on a few of my clusters and it worked for @rata, I was starting to think it was a highly reliable repro.
When one process holds fd on that mountpoint, it's unlikely to umount it without MNT_DETACH
I think with MNT_DETACH we have a chance for the mount to become unused and freed, where as with it, we ensure the mount remains indefinitely or at least until the restart cleanup you mentioned.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would like to explore achieving these idmap mounts using the new mount API and overlayfs API extension to specify lower layers by fd (later, in other issues/PRs), because I think this method will lead to greater isolation to just containerd and the kernel.
+1 looking forward to that!
Interesting, thanks for trying it out. You were using a build prior to #10721? This is good feedback, I only knew that it worked on a few of my clusters and it worked for @rata, I was starting to think it was a highly reliable repro.
Yes, I reverts #10721 in my local and I didn't reproduce this issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The issue is these mounts might be copied into another namespace with MS_PRIVATE and mount will fail until those namespaces are torn down.
core/mount/mount_linux_test.go
Outdated
| runtime.LockOSThread() | ||
| // Never unlock in this goroutine, ensures the thread isn't reused. | ||
| // Thus we don't have to worry about restoring the caps. | ||
| ourCaps, err := capability.NewPid2(0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it thread-leader level or sub thread level?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In practice it's coming up as not the group leader but the reason why is perhaps fragile:
- At the test function level eg
TestdoPrepareIdMappedOverlay,go testruns this function on the leader thread. - Each
t.Run()is in a new goroutine but because we have nott.Parallel()ized there is no need to spawn new threads and the go runtime is mapping the new goroutines to the leader thread. Here we callruntime.LockOSThread()the first time. - Then we spawn a new goroutine in which we conduct our real business. At this point the runtime is required to create a new thread since it previously only had the leader which is now locked. This new thread must not be the leader.
We'll see if we keep this, as @rata also points out it is pretty complicated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good to know this. Since you prepare tempdir for temporary mountpoint, we can try to open file in that mountpoint, which can cause EBUSY during umount. Maybe it's easier to reproduce it this issue instead of using runtime.LockOSThread(). What do you think? :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If not using MNT_DETACH yes that works and is easier.
rata
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mbaynton thanks a lot for the PR! Some comments below
core/mount/mount_linux_test.go
Outdated
| if tc.injectUmountFault { | ||
| runtime.LockOSThread() | ||
| // Never unlock in this goroutine, ensures the thread isn't reused. | ||
| // Thus we don't have to worry about restoring the caps. | ||
| ourCaps, err := capability.NewPid2(0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is quite complicated. This is basically to test the cleanup function, right?
I have a WIP change to just move the cleanup to a different function, so we can easily unit-test it. It's still a WIP, if you want, I can share it with you.
What I'm doing is the cleanup function has some dirs to unmount and some dirs to delete. So to "simulate" an unmount failure you just don't say to unmount that dir, then the dir will not be unmounted.
But wouldn't have the cleanup be a different function simplify the complexity of the test here? Or am I missing something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is quite complicated.
I think so too.
This is basically to test the cleanup function, right?
Yes, specifically to generate a fault with unmount even when it has MNT_DETACH and verify that we don't remove the underlying files. We could also just not test that, it's vanishingly unlikely with MNT_DETACH present and we have multiple safeguards at this point to prevent it.
I'm also open to doing it in a more unit test / dependency injected way, I was just trying to keep the outside _test.go change size small. Happy to take a look at your approach!
go.mod
Outdated
| github.com/prometheus/client_golang v1.20.5 | ||
| github.com/sirupsen/logrus v1.9.3 | ||
| github.com/stretchr/testify v1.9.0 | ||
| github.com/syndtr/gocapability v0.0.0-20200815063812-42c35b437635 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This pkg is kind of unmaintained. In runc we are moving away to our fork that we moved to the moby org. I think avoiding it will be nice.
|
@mbaynton I'm unsure if we should mark the issue as fixed. I mean, the issue is fixed in 2.0 already, there is no data corruption. But there is "contention" and the unmount fails qith resource busy, that is why we the detached unmount for now (so it doesn't fail), and it is not clear to me why yet. I didn't write that code, so it's not easy for me to see it now. However, I think it needs to be investigated why the unmount fails without detached with the repro case you shared in the issue. It might be legit, but it might not be. At first glance, it seems to me that those directories should be unsued outside the current pod and probably container too, I don't know why the unmount without detached fails. Doing an stress test of mount/unmount, it doesn't seem to be the kernel generating sporadic failures. So it seems there is something going on worth investigating, IMHO. If you want to investigate that too, @mbaynton, it would be great! |
Sure, I don't really have a stake in whether the issue is marked fixed :). I just assumed it was reopened to do this follow-up. The approach I personally am going to take on next iterations is to see what becomes available in kernel 6.13 and if lower layer specification by file handle indeed lands, see if the reproducer still produces busy idmapped unmounts when they are never added to the host's path tree or if we can just sidestep the whole issue that way. |
ca7cbfc to
478ec16
Compare
48af384 to
65ac080
Compare
fuweid
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks!
|
Rebase/squash? |
Yes. Please squash into one commit. Thanks |
65ac080 to
a72bc12
Compare
|
Sorry for the delay, missed your message |
fuweid
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
|
What should we do from here? |
| return target, mounts | ||
| } | ||
|
|
||
| func supportsIDMap(path string) bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only glancing over, but wondering if there's expected and unexpected errors that could happen in this helper that we may be hiding?
Mostly considering if this helper should return an error (instead of a bool), and handle expected errors, and return unexpected ones, so that they can be logged in the caller (possibly even make the test fail).
Skipping a test may be easy to overlook ("skipping test because doesn't support idmapped mounts") but could hide actual issues that need looking into?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this function is copied from
| func supportsIDMap(path string) bool { |
I think we can make exported function in the follow-up to unify them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, this was copied. Does it feel best to tinker with all copies of it in this PR or I could also look at doing that separately with a new issue. I see your point @thaJeztah and it seems easy enough to tighten up either way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ping @thaJeztah let's handle this in the follow-up. WDYT?
|
ping @AkihiroSuda @dmcgowan |
| }, nil | ||
| } | ||
|
|
||
| // IDMapMount applies GID/UID shift according to gidmap/uidmap for target path |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The interface is now stable, add a function with the new signature. This one can be marked as deprecated if it should never be used with the default values.
965c9df to
42e717e
Compare
This is a planned follow-on from containerd#10721 primarily at the request of @fuweid, exchanging MNT_DETACH at unmount time for MOUNT_ATTR_RDONLY at mount time. The effect is to increase risk of unmount failure due to EBUSY (as observed in the wild) but add an additional protection that the then-leaked bind mount does not act as a conduit for inadvertent modification of the underlying data, including our own efforts to clean up the mountpoint. Tests covering the lifecycle of the temporary idmap mounts and integrity of the underlying lower layer data is also included in the normal and failed-unmount case. Fixes containerd#10704 Signed-off-by: Mike Baynton <mike@mbaynton.com>
42e717e to
1e3d10d
Compare
|
Restored original Rebased. |
When a lot of pods are created all at the same time, the umount fails with EBUSY sometimes. This causes some mounts to be leaked. A way to repro this issue is here: containerd#12139 (comment) While using lsof/fuser it was not possible to get the culprits on time (the mount is busy for a few ms), I've tried running the containerd daemon in a mount namespace with propagation private[1], and in that case there are no leaks. So, it seems clear it's not containerd but probably systemd that scans all host mounts to act on them, that is causing the EBUSY for a few ms. Also, it seems EBUSY is not just returned in the context of userns, but also for other temp mounts containerd does. See here for more details: containerd#12139 (comment) Let's handle the idmap unmounts in the same way other temp mounts are done, using the Unmount() helper, that retries the unmount. This retry fixes it 100% of the times on my system and in the systems of @halaney that reported the issue too. This was originally fixed by containerd#10721 by using a detached mount, but the mount was switched to non-detached again in containerd#10955 and we started to leak mounts. As @fuwei mentioned, using a detached mount is quite invisible for the admin and, therefore, a retry is a better alternative. It seems these umounts are a great candidate for containerd#11303, which will manage the life-cycle of mounts and can handle these retries whenever needed. It's left for future research if it's possible to make the host not scan these mounts and, if it's possible, decide if we want to do that. [1]: To do it, I just run as root "unshare -m", that creates a mntns with private propagation, and then run the containerd daemon. Signed-off-by: Rodrigo Campos <rodrigoca@microsoft.com>
When a lot of pods are created all at the same time, the umount fails with EBUSY sometimes. This causes some mounts to be leaked. A way to repro this issue is here: containerd#12139 (comment) While using lsof/fuser it was not possible to get the culprits on time (the mount is busy for a few ms), I've tried running the containerd daemon in a mount namespace with propagation private[1], and in that case there are no leaks. So, it seems clear it's not containerd but probably systemd that scans all host mounts to act on them, that is causing the EBUSY for a few ms. Also, it seems EBUSY is not just returned in the context of userns, but also for other temp mounts containerd does. See here for more details: containerd#12139 (comment) Let's handle the idmap unmounts in the same way other temp mounts are done, using the Unmount() helper, that retries the unmount. This retry fixes it 100% of the times on my system and in the systems of @halaney that reported the issue too. This was originally fixed by containerd#10721 by using a detached mount, but the mount was switched to non-detached again in containerd#10955 and we started to leak mounts. As @fuwei mentioned, using a detached mount is quite invisible for the admin and, therefore, a retry is a better alternative. It seems these umounts are a great candidate for containerd#11303, which will manage the life-cycle of mounts and can handle these retries whenever needed. It's left for future research if it's possible to make the host not scan these mounts and, if it's possible, decide if we want to do that. [1]: To do it, I just run as root "unshare -m", that creates a mntns with private propagation, and then run the containerd daemon. Signed-off-by: Rodrigo Campos <rodrigoca@microsoft.com>
When a lot of pods are created all at the same time, the umount fails with EBUSY sometimes. This causes some mounts to be leaked. A way to repro this issue is here: containerd#12139 (comment) While using lsof/fuser it was not possible to get the culprits on time (the mount is busy for a few ms), @fuwei has found what is causing the mount to be busy: containerd#12139 (comment) When we fork in `GetUsernsFD()`, if a call to prepareIDMappedOverlay() is ongoing and has an open fd to the path idmap, then the unmount callback will fail as the forked process still has an open fd to it. Let's handle the idmap unmounts in the same way other temp mounts are done, using the Unmount() helper, that retries the unmount. This retry fixes it 100% of the times on my system and in the systems of @halaney that reported the issue too. This was originally fixed by containerd#10721 by using a detached mount, but the mount was switched to non-detached again in containerd#10955 and we started to leak mounts. As @fuwei mentioned, using a detached mount is quite invisible for the admin and, therefore, a retry is a better alternative. It seems these umounts are a great candidate for containerd#11303, which will manage the life-cycle of mounts and can handle these retries whenever needed. It's left for future research if it's possible to make the host not scan these mounts and, if it's possible, decide if we want to do that. [1]: To do it, I just run as root "unshare -m", that creates a mntns with private propagation, and then run the containerd daemon. Signed-off-by: Rodrigo Campos <rodrigoca@microsoft.com>
When a lot of pods are created all at the same time, the umount fails with EBUSY sometimes. This causes some mounts to be leaked. A way to repro this issue is here: containerd#12139 (comment) While using lsof/fuser it was not possible to get the culprits on time (the mount is busy for a few ms), @fuwei has found what is causing the mount to be busy: containerd#12139 (comment) When we fork in `GetUsernsFD()`, if a call to prepareIDMappedOverlay() is ongoing and has an open fd to the path idmap, then the unmount callback will fail as the forked process still has an open fd to it. Let's handle the idmap unmounts in the same way other temp mounts are done, using the Unmount() helper, that retries the unmount. This retry fixes it 100% of the times on my system and in the systems of @halaney that reported the issue too. This was originally fixed by containerd#10721 by using a detached mount, but the mount was switched to non-detached again in containerd#10955 and we started to leak mounts. As @fuwei mentioned, using a detached mount is quite invisible for the admin and, therefore, a retry is a better alternative. It seems these umounts are a great candidate for containerd#11303, which will manage the life-cycle of mounts and can handle these retries whenever needed. It's left for future research if it's possible to make the host not scan these mounts and, if it's possible, decide if we want to do that. [1]: To do it, I just run as root "unshare -m", that creates a mntns with private propagation, and then run the containerd daemon. Signed-off-by: Rodrigo Campos <rodrigoca@microsoft.com>
When a lot of pods are created all at the same time, the umount fails with EBUSY sometimes. This causes some mounts to be leaked. A way to repro this issue is here: containerd#12139 (comment) While using lsof/fuser it was not possible to get the culprits on time (the mount is busy for a few ms), @fuwei has found what is causing the mount to be busy: containerd#12139 (comment) When we fork in `GetUsernsFD()`, if a call to prepareIDMappedOverlay() is ongoing and has an open fd to the path to idmap, then the unmount callback will fail as the forked process still has an open fd to it. Let's handle the idmap unmounts in the same way other temp mounts are done, using the Unmount() helper, that retries the unmount. This retry fixes it 100% of the times on my system and in the systems of @halaney that reported the issue too. This was originally fixed by containerd#10721 by using a detached mount, but the mount was switched to non-detached again in containerd#10955 and we started to leak mounts. As @fuwei mentioned, using a detached mount is quite invisible for the admin and, therefore, a retry is a better alternative. It seems these umounts are a great candidate for containerd#11303, which will manage the life-cycle of mounts and can handle these retries whenever needed. [1]: To do it, I just run as root "unshare -m", that creates a mntns with private propagation, and then run the containerd daemon. Signed-off-by: Rodrigo Campos <rodrigoca@microsoft.com>
When a lot of pods are created all at the same time, the umount fails with EBUSY sometimes. This causes some mounts to be leaked. A way to repro this issue is here: containerd#12139 (comment) While using lsof/fuser it was not possible to get the culprits on time (the mount is busy for a few ms), @fuwei has found what is causing the mount to be busy: containerd#12139 (comment) When we fork in `GetUsernsFD()`, if a call to prepareIDMappedOverlay() is ongoing and has an open fd to the path to idmap, then the unmount callback will fail as the forked process still has an open fd to it. Let's handle the idmap unmounts in the same way other temp mounts are done, using the Unmount() helper, that retries the unmount. This retry fixes it 100% of the times on my system and in the systems of @halaney that reported the issue too. This was originally fixed by containerd#10721 by using a detached mount, but the mount was switched to non-detached again in containerd#10955 and we started to leak mounts. As @fuwei mentioned, using a detached mount is quite invisible for the admin and, therefore, a retry is a better alternative. It seems these umounts are a great candidate for containerd#11303, which will manage the life-cycle of mounts and can handle these retries whenever needed. [1]: To do it, I just run as root "unshare -m", that creates a mntns with private propagation, and then run the containerd daemon. Signed-off-by: Rodrigo Campos <rodrigoca@microsoft.com>
When a lot of pods are created all at the same time, the umount fails with EBUSY sometimes. This causes some mounts to be leaked. A way to repro this issue is here: containerd#12139 (comment) While using lsof/fuser it was not possible to get the culprits on time (the mount is busy for a few ms), @fuwei has found what is causing the mount to be busy: containerd#12139 (comment) When we fork in `GetUsernsFD()`, if a call to prepareIDMappedOverlay() is ongoing and has an open fd to the path to idmap, then the unmount callback will fail as the forked process still has an open fd to it. Let's handle the idmap unmounts in the same way other temp mounts are done, using the Unmount() helper, that retries the unmount. This retry fixes it 100% of the times on my system and in the systems of @halaney that reported the issue too. This was originally fixed by containerd#10721 by using a detached mount, but the mount was switched to non-detached again in containerd#10955 and we started to leak mounts. As @fuwei mentioned, using a detached mount is quite invisible for the admin and, therefore, a retry is a better alternative. It seems these umounts are a great candidate for containerd#11303, which will manage the life-cycle of mounts and can handle these retries whenever needed. [1]: To do it, I just run as root "unshare -m", that creates a mntns with private propagation, and then run the containerd daemon. Signed-off-by: Rodrigo Campos <rodrigoca@microsoft.com>
When a lot of pods are created all at the same time, the umount fails with EBUSY sometimes. This causes some mounts to be leaked. A way to repro this issue is here: containerd#12139 (comment) While using lsof/fuser it was not possible to get the culprits on time (the mount is busy for a few ms), @fuwei has found what is causing the mount to be busy: containerd#12139 (comment) When we fork in `GetUsernsFD()`, if a call to prepareIDMappedOverlay() is ongoing and has an open fd to the path to idmap, then the unmount callback will fail as the forked process still has an open fd to it. Let's handle the idmap unmounts in the same way other temp mounts are done, using the Unmount() helper, that retries the unmount. This retry fixes it 100% of the times on my system and in the systems of @halaney that reported the issue too. This was originally fixed by containerd#10721 by using a detached mount, but the mount was switched to non-detached again in containerd#10955 and we started to leak mounts. As @fuwei mentioned, using a detached mount is quite invisible for the admin and, therefore, a retry is a better alternative. It seems these umounts are a great candidate for containerd#11303, which will manage the life-cycle of mounts and can handle these retries whenever needed. [1]: To do it, I just run as root "unshare -m", that creates a mntns with private propagation, and then run the containerd daemon. Signed-off-by: Rodrigo Campos <rodrigoca@microsoft.com>
When a lot of pods are created all at the same time, the umount fails with EBUSY sometimes. This causes some mounts to be leaked. A way to repro this issue is here: containerd#12139 (comment) While using lsof/fuser it was not possible to get the culprits on time (the mount is busy for a few ms), @fuwei has found what is causing the mount to be busy: containerd#12139 (comment) When we fork in `GetUsernsFD()`, if a call to prepareIDMappedOverlay() is ongoing and has an open fd to the path to idmap, then the unmount callback will fail as the forked process still has an open fd to it. Let's handle the idmap unmounts in the same way other temp mounts are done, using the Unmount() helper, that retries the unmount. This retry fixes it 100% of the times on my system and in the systems of @halaney that reported the issue too. This was originally fixed by containerd#10721 by using a detached mount, but the mount was switched to non-detached again in containerd#10955 and we started to leak mounts. As @fuwei mentioned, using a detached mount is quite invisible for the admin and, therefore, a retry is a better alternative. It seems these umounts are a great candidate for containerd#11303, which will manage the life-cycle of mounts and can handle these retries whenever needed. [1]: To do it, I just run as root "unshare -m", that creates a mntns with private propagation, and then run the containerd daemon. Signed-off-by: Rodrigo Campos <rodrigoca@microsoft.com>
When a lot of pods are created all at the same time, the umount fails with EBUSY sometimes. This causes some mounts to be leaked. A way to repro this issue is here: containerd#12139 (comment) While using lsof/fuser it was not possible to get the culprits on time (the mount is busy for a few ms), @fuwei has found what is causing the mount to be busy: containerd#12139 (comment) When we fork in `GetUsernsFD()`, if a call to prepareIDMappedOverlay() is ongoing and has an open fd to the path to idmap, then the unmount callback will fail as the forked process still has an open fd to it. Let's handle the idmap unmounts in the same way other temp mounts are done, using the Unmount() helper, that retries the unmount. This retry fixes it 100% of the times on my system and in the systems of @halaney that reported the issue too. This was originally fixed by containerd#10721 by using a detached mount, but the mount was switched to non-detached again in containerd#10955 and we started to leak mounts. As @fuwei mentioned, using a detached mount is quite invisible for the admin and, therefore, a retry is a better alternative. It seems these umounts are a great candidate for containerd#11303, which will manage the life-cycle of mounts and can handle these retries whenever needed. [1]: To do it, I just run as root "unshare -m", that creates a mntns with private propagation, and then run the containerd daemon. Signed-off-by: Rodrigo Campos <rodrigoca@microsoft.com>
When a lot of pods are created all at the same time, the umount fails with EBUSY sometimes. This causes some mounts to be leaked. A way to repro this issue is here: containerd#12139 (comment) While using lsof/fuser it was not possible to get the culprits on time (the mount is busy for a few ms), @fuwei has found what is causing the mount to be busy: containerd#12139 (comment) When we fork in `GetUsernsFD()`, if a call to prepareIDMappedOverlay() is ongoing and has an open fd to the path to idmap, then the unmount callback will fail as the forked process still has an open fd to it. Let's handle the idmap unmounts in the same way other temp mounts are done, using the Unmount() helper, that retries the unmount. This retry fixes it 100% of the times on my system and in the systems of @halaney that reported the issue too. This was originally fixed by containerd#10721 by using a detached mount, but the mount was switched to non-detached again in containerd#10955 and we started to leak mounts. As @fuwei mentioned, using a detached mount is quite invisible for the admin and, therefore, a retry is a better alternative. It seems these umounts are a great candidate for containerd#11303, which will manage the life-cycle of mounts and can handle these retries whenever needed. [1]: To do it, I just run as root "unshare -m", that creates a mntns with private propagation, and then run the containerd daemon. Signed-off-by: Rodrigo Campos <rodrigoca@microsoft.com>
When a lot of pods are created all at the same time, the umount fails with EBUSY sometimes. This causes some mounts to be leaked. A way to repro this issue is here: containerd#12139 (comment) While using lsof/fuser it was not possible to get the culprits on time (the mount is busy for a few ms), @fuwei has found what is causing the mount to be busy: containerd#12139 (comment) When we fork in `GetUsernsFD()`, if a call to prepareIDMappedOverlay() is ongoing and has an open fd to the path to idmap, then the unmount callback will fail as the forked process still has an open fd to it. Let's handle the idmap unmounts in the same way other temp mounts are done, using the Unmount() helper, that retries the unmount. This retry fixes it 100% of the times on my system and in the systems of @halaney that reported the issue too. This was originally fixed by containerd#10721 by using a detached mount, but the mount was switched to non-detached again in containerd#10955 and we started to leak mounts. As @fuwei mentioned, using a detached mount is quite invisible for the admin and, therefore, a retry is a better alternative. It seems these umounts are a great candidate for containerd#11303, which will manage the life-cycle of mounts and can handle these retries whenever needed. [1]: To do it, I just run as root "unshare -m", that creates a mntns with private propagation, and then run the containerd daemon. Signed-off-by: Rodrigo Campos <rodrigoca@microsoft.com> (cherry picked from commit da3dc1e) Signed-off-by: Andrew Halaney <ahalaney@netflix.com>
When a lot of pods are created all at the same time, the umount fails with EBUSY sometimes. This causes some mounts to be leaked. A way to repro this issue is here: containerd#12139 (comment) While using lsof/fuser it was not possible to get the culprits on time (the mount is busy for a few ms), @fuwei has found what is causing the mount to be busy: containerd#12139 (comment) When we fork in `GetUsernsFD()`, if a call to prepareIDMappedOverlay() is ongoing and has an open fd to the path to idmap, then the unmount callback will fail as the forked process still has an open fd to it. Let's handle the idmap unmounts in the same way other temp mounts are done, using the Unmount() helper, that retries the unmount. This retry fixes it 100% of the times on my system and in the systems of @halaney that reported the issue too. This was originally fixed by containerd#10721 by using a detached mount, but the mount was switched to non-detached again in containerd#10955 and we started to leak mounts. As @fuwei mentioned, using a detached mount is quite invisible for the admin and, therefore, a retry is a better alternative. It seems these umounts are a great candidate for containerd#11303, which will manage the life-cycle of mounts and can handle these retries whenever needed. [1]: To do it, I just run as root "unshare -m", that creates a mntns with private propagation, and then run the containerd daemon. Signed-off-by: Rodrigo Campos <rodrigoca@microsoft.com> (cherry picked from commit da3dc1e) Signed-off-by: Andrew Halaney <ahalaney@netflix.com>
When a lot of pods are created all at the same time, the umount fails with EBUSY sometimes. This causes some mounts to be leaked. A way to repro this issue is here: containerd#12139 (comment) While using lsof/fuser it was not possible to get the culprits on time (the mount is busy for a few ms), @fuwei has found what is causing the mount to be busy: containerd#12139 (comment) When we fork in `GetUsernsFD()`, if a call to prepareIDMappedOverlay() is ongoing and has an open fd to the path to idmap, then the unmount callback will fail as the forked process still has an open fd to it. Let's handle the idmap unmounts in the same way other temp mounts are done, using the Unmount() helper, that retries the unmount. This retry fixes it 100% of the times on my system and in the systems of @halaney that reported the issue too. This was originally fixed by containerd#10721 by using a detached mount, but the mount was switched to non-detached again in containerd#10955 and we started to leak mounts. As @fuwei mentioned, using a detached mount is quite invisible for the admin and, therefore, a retry is a better alternative. It seems these umounts are a great candidate for containerd#11303, which will manage the life-cycle of mounts and can handle these retries whenever needed. [1]: To do it, I just run as root "unshare -m", that creates a mntns with private propagation, and then run the containerd daemon. Signed-off-by: Rodrigo Campos <rodrigoca@microsoft.com>
When a lot of pods are created all at the same time, the umount fails with EBUSY sometimes. This causes some mounts to be leaked. A way to repro this issue is here: containerd#12139 (comment) While using lsof/fuser it was not possible to get the culprits on time (the mount is busy for a few ms), @fuwei has found what is causing the mount to be busy: containerd#12139 (comment) When we fork in `GetUsernsFD()`, if a call to prepareIDMappedOverlay() is ongoing and has an open fd to the path to idmap, then the unmount callback will fail as the forked process still has an open fd to it. Let's handle the idmap unmounts in the same way other temp mounts are done, using the Unmount() helper, that retries the unmount. This retry fixes it 100% of the times on my system and in the systems of @halaney that reported the issue too. This was originally fixed by containerd#10721 by using a detached mount, but the mount was switched to non-detached again in containerd#10955 and we started to leak mounts. As @fuwei mentioned, using a detached mount is quite invisible for the admin and, therefore, a retry is a better alternative. It seems these umounts are a great candidate for containerd#11303, which will manage the life-cycle of mounts and can handle these retries whenever needed. [1]: To do it, I just run as root "unshare -m", that creates a mntns with private propagation, and then run the containerd daemon. Signed-off-by: Rodrigo Campos <rodrigoca@microsoft.com>
When a lot of pods are created all at the same time, the umount fails with EBUSY sometimes. This causes some mounts to be leaked. A way to repro this issue is here: containerd#12139 (comment) While using lsof/fuser it was not possible to get the culprits on time (the mount is busy for a few ms), @fuwei has found what is causing the mount to be busy: containerd#12139 (comment) When we fork in `GetUsernsFD()`, if a call to prepareIDMappedOverlay() is ongoing and has an open fd to the path to idmap, then the unmount callback will fail as the forked process still has an open fd to it. Let's handle the idmap unmounts in the same way other temp mounts are done, using the Unmount() helper, that retries the unmount. This retry fixes it 100% of the times on my system and in the systems of @halaney that reported the issue too. This was originally fixed by containerd#10721 by using a detached mount, but the mount was switched to non-detached again in containerd#10955 and we started to leak mounts. As @fuwei mentioned, using a detached mount is quite invisible for the admin and, therefore, a retry is a better alternative. It seems these umounts are a great candidate for containerd#11303, which will manage the life-cycle of mounts and can handle these retries whenever needed. [1]: To do it, I just run as root "unshare -m", that creates a mntns with private propagation, and then run the containerd daemon. Signed-off-by: Rodrigo Campos <rodrigoca@microsoft.com>
When a lot of pods are created all at the same time, the umount fails with EBUSY sometimes. This causes some mounts to be leaked. A way to repro this issue is here: containerd#12139 (comment) While using lsof/fuser it was not possible to get the culprits on time (the mount is busy for a few ms), @fuwei has found what is causing the mount to be busy: containerd#12139 (comment) When we fork in `GetUsernsFD()`, if a call to prepareIDMappedOverlay() is ongoing and has an open fd to the path to idmap, then the unmount callback will fail as the forked process still has an open fd to it. Let's handle the idmap unmounts in the same way other temp mounts are done, using the Unmount() helper, that retries the unmount. This retry fixes it 100% of the times on my system and in the systems of @halaney that reported the issue too. This was originally fixed by containerd#10721 by using a detached mount, but the mount was switched to non-detached again in containerd#10955 and we started to leak mounts. As @fuwei mentioned, using a detached mount is quite invisible for the admin and, therefore, a retry is a better alternative. It seems these umounts are a great candidate for containerd#11303, which will manage the life-cycle of mounts and can handle these retries whenever needed. [1]: To do it, I just run as root "unshare -m", that creates a mntns with private propagation, and then run the containerd daemon. Signed-off-by: Rodrigo Campos <rodrigoca@microsoft.com>
This is an additional safeguard against inadvertent data modification or deletion as a follow-up from #10721.
Tests covering the lifecycle of the temporary idmap mounts and integrity of the underlying lower layer data is also included in the normal and failed-unmount case.
Fixes #10704
From the discussion in #10721 I had three goals:
IdMapMount's signature to accept mount options.filepath.Dir/filepath.Joinon a particular layer's path. Not sure if this is exactly what @AkihiroSuda had in mind with comment