Use a retry loop for DeleteDevice in deleteTransaction#33846
Use a retry loop for DeleteDevice in deleteTransaction#33846slp wants to merge 1 commit intomoby:masterfrom slp:retry-devmapper-delete
Conversation
On an environment using the devmapper driver with deferredRemove but
without deferredDelete, when a container running with the auto-removal
option ("run --rm") exits, the call to devicemapper.DeleteDevice in
deleteTransaction may fail with EBUSY, as removal of the device has been
deferred while deletion is expected to succeed immediately.
This change implements a retry loop for DeleteDevice, similar to the one
already present in removeDevice, which is only used when running with
the combination deferredRemove==true && deferredDelete=false.
In my tests, this retry loop always succeeded in one of the first 5
iterations.
Signed-off-by: Sergio Lopez <slp@sinrega.org>
|
So what problem are we trying to solve here? User asked for synchronous deletion and if that fails (because deferred removal has not succeeded yet), so be it. Is this is a common problem. Are you really running into cases where deferred removal is keeping device busy for a short while. My understanding is that deferred removal will probably remove device immediately if device is free at the time of scheduling removal. If that's the case then this patch is not required. Because there is no guarantee that device will be free soon. |
|
In my test environment, I've empirically determined that when devicemapper.DeleteDevice fails with EBUSY, most time it success on the second try. I did not have time yet to determine why the DM device is still busy on the first try, but I could work on it tomorrow. |
|
@slp, if deferred deleted device is temporarily busy and becomes free very soon, then it makes sense to add a loop in DeleteDevice(). That will make "docker --rm" experience better. Is it easy to reproduce? Can I just do "docker run --rm -ti fedora bash" in a loop and it should reproduce? I want to see it happening first. Not sure what's keeping the device busy though. udev rules? |
|
We do see this as well on some systems, however /* FIXME Detect and warn if cookie set but should not be. */
repeat_ioctl:
if (!(dmi = _do_dm_ioctl(dmt, command, _ioctl_buffer_double_factor,
ioctl_retry, &retryable))) {
/*
* Async udev rules that scan devices commonly cause transient
* failures. Normally you'd expect the user to have made sure
* nothing was using the device before issuing REMOVE, so it's
* worth retrying in case the failure is indeed transient.
*/
if (retryable && dmt->type == DM_DEVICE_REMOVE &&
dmt->retry_remove && ++ioctl_retry <= DM_IOCTL_RETRIES) {
usleep(DM_RETRY_USLEEP_DELAY);
goto repeat_ioctl;
}
_udev_complete(dmt);
return 0;
}But yes, it appears this may be related to udev rules. However, I would prefer if we can figure out a way of solving this than doubling-up on the retries done through Though, I actually am fairly sure this might fix a bug we've been debugging for the past few days. And while debugging I wrote #33845. /cc @vrothberg |
|
As mentioned by @cyphar, we've been experiencing similar issues lately, so I want to share my thoughts on the proposed patch. When looking at the code, the deletion has a clear dependency on the removal of a device. Hence, shouldn't a deletion be forced to deferral when deferred removal is set? We could either enforce a constraint on those options or, for instance, change 694 func (devices *DeviceSet) startDeviceDeletionWorker() {
695 // Deferred deletion is not enabled. Don't do anything.
696 if !devices.deferredDelete {
697 return
698 }
699
700 logrus.Debug("devmapper: Worker to cleanup deleted devices started")
701 for range devices.deletionWorkerTicker.C {
702 devices.cleanupDeletedDevices()
703 }
704 } |
|
So far, I found two reasons for which the device may still be busy when arriving at deleteTranstacion. One is a bug in systemd which causes some mount points from the namespace of the first container started to be leaked into systemd-machined (https://bugzilla.redhat.com/show_bug.cgi?id=1465485). This bug affects RHEL7.3, but not Fedora 25, due to the way in which the service is defined. Not sure about other distros. In this case, retrying won't help as the mount points won't be released until systemd-machined exits. The other one is a bit more complicated. Playing with SystemTap, I came to this: Here, first we see an attempt to remove (in DM sense, which is more like deactivating) the device failing with EBUSY, and immediately after a close operation of the same device, coming from a mount points cleanup procedure of a process named "runc:[2:INIT]" with PID 21341. The name "runc:[2:INIT]" is used temporarily by runc before exec'ing the actual containerized pseudo-init process. And, even more interestingly, PID 21341 corresponds to a different container to the one which is being cleaned here, which was started at the same time. I think what's happening here is that, when starting multiple containers in parallel, the mount namespace from one of them may be temporarily leaked into another one. Apparently, runc drops the foreign namespaces early on, but depending on how is this process scheduled, dockerd may reach deleteTransaction before it has been done, causing it to fail with EBUSY. It may also happen that the namespaces were dropped at deleteTransaction, but were still present at deactivateDevice, causing the removal to be deferred. In this case, dockerd may reach deleteTransaction before the DM kernel work task has been scheduled yet, causing it to fail to EBUSY too. In either case, in my tests the retry loop always succeeds on the first retry, which makes me thing that the 200 retries may be too much. Perhaps lowering to 3 would make more sense, as if the call won't succeed in the 300 ms, it'll probably never succeed. What do you think? |
|
I forgot to describe the test setup I'm using for reproducing the issue:
On this setup, I run dockerd foreground with the following arguments: And then I use Terminator to launch 3 "docker run --rm busybox" (it's also reproducible with other container images) in parallel. At least once every five runs, something like this will be logged by dockerd: With the failing CLI instance logging a corresponding message: If I'm right, this should be easily reproducible on slightly different environments. |
I don't think the problem is leaking between containers namespaces -- that really should not happen, because the namespaces are created by two completely separate What I think might be happening is this:
The thing is, I don't buy that this race window is large enough. This is it in it's entirety: if err := unix.PivotRoot(".", "."); err != nil {
return fmt.Errorf("pivot_root %s", err)
}
// Currently our "." is oldroot (according to the current kernel code).
// However, purely for safety, we will fchdir(oldroot) since there isn't
// really any guarantee from the kernel what /proc/self/cwd will be after a
// pivot_root(2).
if err := unix.Fchdir(oldroot); err != nil {
return err
}
// Make oldroot rprivate to make sure our unmounts don't propagate to the
// host (and thus bork the machine).
if err := unix.Mount("", ".", "", unix.MS_PRIVATE|unix.MS_REC, ""); err != nil {
return err
}
// Preform the unmount. MNT_DETACH allows us to unmount /proc/self/cwd.
if err := unix.Unmount(".", unix.MNT_DETACH); err != nil {
return err
}But if this is the cause then we'll have a fun time fixing this in runc (/cc @crosbymichael). I think if we used However, if this is all it takes to cause
I think we need to solve the cause of the issue, not it's symptoms. I agree that As an aside, the issue we're seeing is slightly different (it's related to |
|
@slp Can you see if this From 117c92745bd098bf05a69489b7b78cac6364e1d0 Mon Sep 17 00:00:00 2001
From: Aleksa Sarai <asarai@suse.de>
Date: Thu, 29 Jun 2017 01:20:23 +1000
Subject: [PATCH] rootfs: switch ms_private remount of oldroot to ms_slave
Using MS_PRIVATE meant that there was a race between the mount(2) and
the umount2(2) calls where runc inadvertently has a live reference to a
mountpoint that existed on the host (which the host cannot kill
implicitly through an unmount and peer sharing).
In particular, this means that if we have a devicemapper mountpoint and
the host is trying to delete the underlying device, the delete will fail
because it is "in use" during the race. While the race is _very_ small
(and libdm actually retries to avoid these sorts of cases) this appears
to manifest in various cases.
Signed-off-by: Aleksa Sarai <asarai@suse.de>
---
libcontainer/rootfs_linux.go | 9 ++++++---
1 file changed, 6 insertions(+), 3 deletions(-)
diff --git a/libcontainer/rootfs_linux.go b/libcontainer/rootfs_linux.go
index e2e734a8..c4dbe3d5 100644
--- a/libcontainer/rootfs_linux.go
+++ b/libcontainer/rootfs_linux.go
@@ -668,9 +668,12 @@ func pivotRoot(rootfs string) error {
return err
}
- // Make oldroot rprivate to make sure our unmounts don't propagate to the
- // host (and thus bork the machine).
- if err := unix.Mount("", ".", "", unix.MS_PRIVATE|unix.MS_REC, ""); err != nil {
+ // Make oldroot rslave to make sure our unmounts don't propagate to the
+ // host (and thus bork the machine). We don't use rprivate because this is
+ // known to cause issues due to races where we still have a reference to a
+ // mount while a process in the host namespace are trying to operate on
+ // something they think has no mounts (devicemapper in particular).
+ if err := unix.Mount("", ".", "", unix.MS_SLAVE|unix.MS_REC, ""); err != nil {
return err
}
// Preform the unmount. MNT_DETACH allows us to unmount /proc/self/cwd.
--
2.13.1
|
|
Why don't we make sure that device is removed before we try non-deferred deletion. IOW, Now callers of this call can decide when deactivateDevice() need to be synchronous and when it can be deferred. When a device is being deactivated in UnmountDevice() path, we can say deferred removal is fine. But when same device is being deleted in DeleteDevice() path, we will say do the synchronous deactivation (for the case of when deferred removal is enabled and deferred deletion is not enabled). Given there is already a busy loop in deactivateDevice(), we don't have to introduce another loop in DeleteDevice(). |
|
@rhvgoyal Does removal work if you still have live mounts? I'm going to see whether my theory about |
|
I think how many times we should try to remove a device and whether libdm is already doing it is a separate optimization. That does not help explain why devices can be busy for a very short period of time. So I am fine reducing the retry loop or completely getting rid of it (if it can be proven that for all our use case libdm is going to retry anyway). But for now, first issue seems to be that if deferred removal is enabled but deferred deletion is not enabled, then it is possible that something like udev might keep the device busy for a brief period of time and deletion of device will fail. If device deletion path, there is a call to deactivateDevice() and if we can make that synchronous, that will take care of this small window. |
|
@cyphar if you have live mounts, device removal will fail with -EBUSY. I think runc race is a possibility on both old and new kernels. Theory is that on new kernels directory removal will remove mount points in other namespace as well. But I think we first try to delete device before we try directory removal. That means if there is a small window of mount point leak (due to pivot_root), that mount point can keep device busy. in daemon/graphdriver/devmapper/driver.go, we delete device first before we try removing mnt dir. |
|
I am wondering if in device deletion path, should we try to remove And that increases the probability of successful device deletion. Anyway, I think making deleteDevice() synchronous in DeleteDevice() path will take care of most of the issues as long as device being busy is a temporary situation. If this is something which is long lived, then |
Yeah, probably "leaked" isn't the best word to describe this.
Aaaaand we have a winner! I've added timestamps to the SystemTap scripts, and created a new one to monitor all syscalls from processes named "runc:[2:INIT]". Take a look at this run that failed to cleanup the device:
Note the gap between the return of umount and the entry of chdir, which should be called immediately after. I suspect this happens because, this time, this runc instance is last owner of a reference to this XFS mount point, and it's implicitly tasked with whatever work is needed to properly unmount this kind of filesystem. This task is executed in the return path of the syscall, via "task_work_run". And judging by the timestamps, can be quite expensive. This gives time to dockerd to catch up and reach both RemoveDeviceDeferred and DeleteDevice before the DM device has actually been closed. BTW, remounting with MS_SLAVE instead of MS_PRIVATE doesn't help (you can see in the trace that this test was already running with the first).
Please note the libdm retry path (from I know that "The Right Way" to solve this would be making sure that foreign mount points are dropped from the namespace ASAP, ideally without suffering from side-effects like this, but I'm pretty sure this kind of change won't be trivial. On the other hand, a (sane) retry loop is way simpler and safer, can be easily backported to stable versions, and can be eventually dropped when the actual root cause if fixed for good. |
I feel like I should be happy, but I don't feel good about this one. 😸
Well that's frustrating (as an aside, I would really like to see what the mount tables look like for the two processes). I think the "proper" fix for this problem would be that Docker/containerd mounts the As you said though, this is quite complicated to do properly and the symptomatic fix is much simpler. I think doing a busy loop waiting for the device to have been removed (which is what @rhvgoyal is proposing) is probably the most sane course of action, barring us actually making image handling sane.
Yeah, I realised this after re-reading the issue a few times. We've been hitting a similar problem with |
This is the mount table before and after (it didn't change) And yes, it does indeed contain mount points from another container. This is the table after There's been a noticeable clean up. Also, foreign mount points are not there anymore.
While I'm not really familiar with containerd/runc, this sounds like a good idea to me. On the other hand, probably the kernel should provide more facilities to make this work a bit easier, something like being able to "tag" mount points, which IIRC is what Solaris does with zones. |
I don't think this will help. From PoV of the process working on deleting the layers, the mount point has already been unmounted. The problem is that the DM device is still open by a filesystem still active in the mount namespace of a foreign container, so it can't be removed/deleted.
This is what I initially proposed in the first BZ, but now I don't think it's worth the effort. The decision to make deleteDevice() synchronous must be taken from a place where we can check if forceRemove is true, which implies propagating the decision through multiple layers until reaching the corresponding graphdriver. But, in the end, the actual effect would be retrying in devices.removeDevice(), instead of doing it in device.deleteTransaction(). Maybe I'm missing something, but IMHO is not worth touching so many places for a little-to-none benefit. |
|
ping @slp #33846 (comment) |
|
@thaJeztah @rhvgoyal Sure, closing this PR. |
|
Great! Thanks everyone 👍 |
On an environment using the devmapper driver with deferredRemove but
without deferredDelete, when a container running with the auto-removal
option ("run --rm") exits, the call to devicemapper.DeleteDevice in
deleteTransaction may fail with EBUSY, as removal of the device has been
deferred while deletion is expected to succeed immediately.
This change implements a retry loop for DeleteDevice, similar to the one
already present in removeDevice, which is only used when running with
the combination deferredRemove==true && deferredDelete=false.
In my tests, this retry loop always succeeded in one of the first 5
iterations.
Signed-off-by: Sergio Lopez slp@sinrega.org