Use a retry loop for DeleteDevice in deleteTransaction by slp · Pull Request #33846 · moby/moby

slp · 2017-06-27T15:09:52Z

On an environment using the devmapper driver with deferredRemove but
without deferredDelete, when a container running with the auto-removal
option ("run --rm") exits, the call to devicemapper.DeleteDevice in
deleteTransaction may fail with EBUSY, as removal of the device has been
deferred while deletion is expected to succeed immediately.

This change implements a retry loop for DeleteDevice, similar to the one
already present in removeDevice, which is only used when running with
the combination deferredRemove==true && deferredDelete=false.

In my tests, this retry loop always succeeded in one of the first 5
iterations.

Signed-off-by: Sergio Lopez slp@sinrega.org

On an environment using the devmapper driver with deferredRemove but without deferredDelete, when a container running with the auto-removal option ("run --rm") exits, the call to devicemapper.DeleteDevice in deleteTransaction may fail with EBUSY, as removal of the device has been deferred while deletion is expected to succeed immediately. This change implements a retry loop for DeleteDevice, similar to the one already present in removeDevice, which is only used when running with the combination deferredRemove==true && deferredDelete=false. In my tests, this retry loop always succeeded in one of the first 5 iterations. Signed-off-by: Sergio Lopez <slp@sinrega.org>

rhvgoyal · 2017-06-27T16:01:26Z

So what problem are we trying to solve here? User asked for synchronous deletion and if that fails (because deferred removal has not succeeded yet), so be it.

Is this is a common problem. Are you really running into cases where deferred removal is keeping device busy for a short while.

My understanding is that deferred removal will probably remove device immediately if device is free at the time of scheduling removal. If that's the case then this patch is not required. Because there is no guarantee that device will be free soon.

slp · 2017-06-27T17:03:29Z

In my test environment, I've empirically determined that when devicemapper.DeleteDevice fails with EBUSY, most time it success on the second try.

I did not have time yet to determine why the DM device is still busy on the first try, but I could work on it tomorrow.

rhvgoyal · 2017-06-27T20:50:25Z

@slp, if deferred deleted device is temporarily busy and becomes free very soon, then it makes sense to add a loop in DeleteDevice(). That will make "docker --rm" experience better.

Is it easy to reproduce? Can I just do "docker run --rm -ti fedora bash" in a loop and it should reproduce? I want to see it happening first. Not sure what's keeping the device busy though. udev rules?

cyphar · 2017-06-28T07:27:21Z

We do see this as well on some systems, however libdm already does this internally so we need to be careful to avoid retrying far too liberally (from libdm/ioctl/libdm-iface.c):

	/* FIXME Detect and warn if cookie set but should not be. */
repeat_ioctl:
	if (!(dmi = _do_dm_ioctl(dmt, command, _ioctl_buffer_double_factor,
				 ioctl_retry, &retryable))) {
		/*
		 * Async udev rules that scan devices commonly cause transient
		 * failures.  Normally you'd expect the user to have made sure
		 * nothing was using the device before issuing REMOVE, so it's
		 * worth retrying in case the failure is indeed transient.
		 */
		if (retryable && dmt->type == DM_DEVICE_REMOVE &&
		    dmt->retry_remove && ++ioctl_retry <= DM_IOCTL_RETRIES) {
			usleep(DM_RETRY_USLEEP_DELAY);
			goto repeat_ioctl;
		}

		_udev_complete(dmt);
		return 0;
	}

But yes, it appears this may be related to udev rules. However, I would prefer if we can figure out a way of solving this than doubling-up on the retries done through libdm. If you read the above comment it's clear that you shouldn't get EBUSY and it's a bug if you get it, and that "as a user" we should be making sure nothing is using the device.

Though, I actually am fairly sure this might fix a bug we've been debugging for the past few days. And while debugging I wrote #33845.

/cc @vrothberg

vrothberg · 2017-06-28T09:16:25Z

As mentioned by @cyphar, we've been experiencing similar issues lately, so I want to share my thoughts on the proposed patch.

When looking at the code, the deletion has a clear dependency on the removal of a device. Hence, shouldn't a deletion be forced to deferral when deferred removal is set?

We could either enforce a constraint on those options or, for instance, change startDeviceDeletionWorker() (daemon/graphdriver/devmapper/deviceset.go) to also start the DeletionWorker when devices.deferredRemove is set. Personally, the latter seems more like a symptomatic fix to me.

694 func (devices *DeviceSet) startDeviceDeletionWorker() {                      
695         // Deferred deletion is not enabled. Don't do anything.              
696         if !devices.deferredDelete {                                         
697                 return                                                       
698         }                                                                    
699                                                                              
700         logrus.Debug("devmapper: Worker to cleanup deleted devices started") 
701         for range devices.deletionWorkerTicker.C {                           
702                 devices.cleanupDeletedDevices()                              
703         }                                                                    
704 }

slp · 2017-06-28T13:59:30Z

So far, I found two reasons for which the device may still be busy when arriving at deleteTranstacion.

One is a bug in systemd which causes some mount points from the namespace of the first container started to be leaked into systemd-machined (https://bugzilla.redhat.com/show_bug.cgi?id=1465485). This bug affects RHEL7.3, but not Fedora 25, due to the way in which the service is defined. Not sure about other distros.

In this case, retrying won't help as the mount points won't be released until systemd-machined exits.

The other one is a bit more complicated. Playing with SystemTap, I came to this:

dm_remove EBUSY: (0xffff88d5c998b000) (dockerd)
dm_blk_close: (0xffff88d5c998b000) - 21341 (runc:[2:INIT])
 0xffffffff99666390 : dm_blk_close+0x20/0x80 [kernel]
 0xffffffff9928dc74 : __blkdev_put+0x234/0x290 [kernel] (inexact)
 0xffffffffc019b03e [xfs] (inexact)
 0xffffffff9928dd1c : blkdev_put+0x4c/0x110 [kernel] (inexact)
 0xffffffff992533c1 : kill_block_super+0x41/0x70 [kernel] (inexact)
 0xffffffff99253703 : deactivate_locked_super+0x43/0x70 [kernel] (inexact)
 0xffffffff9925378a : deactivate_super+0x5a/0x60 [kernel] (inexact)
 0xffffffff992722cf : cleanup_mnt+0x3f/0x90 [kernel] (inexact)
 0xffffffff99272362 : __cleanup_mnt+0x12/0x20 [kernel] (inexact)
 0xffffffff990bf070 : task_work_run+0x80/0xa0 [kernel] (inexact)
 0xffffffff990032d2 : exit_to_usermode_loop+0xc2/0xd0 [kernel] (inexact)
 0xffffffff99003be1 : syscall_return_slowpath+0xa1/0xb0 [kernel] (inexact)
 0xffffffff998025fa : entry_SYSCALL_64_fastpath+0xa2/0xa4 [kernel] (inexact)

Here, first we see an attempt to remove (in DM sense, which is more like deactivating) the device failing with EBUSY, and immediately after a close operation of the same device, coming from a mount points cleanup procedure of a process named "runc:[2:INIT]" with PID 21341.

The name "runc:[2:INIT]" is used temporarily by runc before exec'ing the actual containerized pseudo-init process. And, even more interestingly, PID 21341 corresponds to a different container to the one which is being cleaned here, which was started at the same time.

I think what's happening here is that, when starting multiple containers in parallel, the mount namespace from one of them may be temporarily leaked into another one. Apparently, runc drops the foreign namespaces early on, but depending on how is this process scheduled, dockerd may reach deleteTransaction before it has been done, causing it to fail with EBUSY.

It may also happen that the namespaces were dropped at deleteTransaction, but were still present at deactivateDevice, causing the removal to be deferred. In this case, dockerd may reach deleteTransaction before the DM kernel work task has been scheduled yet, causing it to fail to EBUSY too.

In either case, in my tests the retry loop always succeeds on the first retry, which makes me thing that the 200 retries may be too much. Perhaps lowering to 3 would make more sense, as if the call won't succeed in the 300 ms, it'll probably never succeed.

What do you think?

slp · 2017-06-28T14:08:26Z

I forgot to describe the test setup I'm using for reproducing the issue:

Fedora 25
kernel 4.8.6-300.fc25.x86_64
VM with 4 cores and 8GB of RAM running under KVM
docker built from upstream sources

On this setup, I run dockerd foreground with the following arguments:

sudo ./bundles/17.06.0-dev/dynbinary-daemon/dockerd --add-runtime oci=/var/tmp/binary-daemon/docker-runc --default-runtime=oci --containerd /run/containerd.sock --exec-opt native.cgroupdriver=systemd --userland-proxy-path=/var/tmp/binary-daemon/docker-proxy --selinux-enabled --log-driver=journald --storage-driver devicemapper --storage-opt dm.fs=xfs --storage-opt dm.thinpooldev=/dev/mapper/fedora-docker--pool --storage-opt dm.use_deferred_removal=true --storage-opt dm.use_deferred_deletion=false

And then I use Terminator to launch 3 "docker run --rm busybox" (it's also reproducible with other container images) in parallel. At least once every five runs, something like this will be logged by dockerd:

ERRO[1687] Error removing mounted layer a9f5346229fa49055811938d88cf1f4bbd353757ffb6da65faeb030477741b53: failed to remove device d3b939e7cc1fc6a80affb3d69b8971448f27ad77b6e92ecf9bd129fe8eea96f8: Device is Busy 
ERRO[1687] Handler for DELETE /v1.24/containers/a9f5346229fa49055811938d88cf1f4bbd353757ffb6da65faeb030477741b53 returned error: driver "devicemapper" failed to remove root filesystem for a9f5346229fa49055811938d88cf1f4bbd353757ffb6da65faeb030477741b53: failed to remove device d3b939e7cc1fc6a80affb3d69b8971448f27ad77b6e92ecf9bd129fe8eea96f8: Device is Busy

With the failing CLI instance logging a corresponding message:

Error response from daemon: driver "devicemapper" failed to remove root filesystem for a9f5346229fa49055811938d88cf1f4bbd353757ffb6da65faeb030477741b53: failed to remove device d3b939e7cc1fc6a80affb3d69b8971448f27ad77b6e92ecf9bd129fe8eea96f8: Device is Busy

If I'm right, this should be easily reproducible on slightly different environments.

cyphar · 2017-06-28T15:16:07Z

@slp

The name "runc:[2:INIT]" is used temporarily by runc before exec'ing the actual containerized pseudo-init process.

runc:[2:INIT] does more than just exec, it does the entire setup of the container's mounts and so on. runc:[2:INIT] is the first point in the runc execution when you can say that the container "exists" and is being set up from the "inside".

[...] And, even more interestingly, PID 21341 corresponds to a different container to the one which is being cleaned here, which was started at the same time. I think what's happening here is that, when starting multiple containers in parallel, the mount namespace from one of them may be temporarily leaked into another one.

I don't think the problem is leaking between containers namespaces -- that really should not happen, because the namespaces are created by two completely separate runc processes running under two completely different containerd-shims (they're several hops away from each other).

What I think might be happening is this:

runc's inherited namespace is containerd's mount namespace (which is the same as Docker's).
Thus, runc will see all of the storage driver mountpoints when it's being started.
So, when it does an unshare its namespace will contain those other storage drivers. This is fine because the mounts aren't MS_PRIVATE so any unmounts in the host are not a problem.
However, after runc's pivot_root before doing the unmount we remount the old root as MS_PRIVATE so that the unmount doesn't mess with the host. This means that in this race window we have a private reference to the devicemapper mount.
At this point, dockerd has unmounted the filesystem and is trying to delete the mount.

The thing is, I don't buy that this race window is large enough. This is it in it's entirety:

	if err := unix.PivotRoot(".", "."); err != nil {
		return fmt.Errorf("pivot_root %s", err)
	}

	// Currently our "." is oldroot (according to the current kernel code).
	// However, purely for safety, we will fchdir(oldroot) since there isn't
	// really any guarantee from the kernel what /proc/self/cwd will be after a
	// pivot_root(2).
	if err := unix.Fchdir(oldroot); err != nil {
		return err
	}

	// Make oldroot rprivate to make sure our unmounts don't propagate to the
	// host (and thus bork the machine).
	if err := unix.Mount("", ".", "", unix.MS_PRIVATE|unix.MS_REC, ""); err != nil {
		return err
	}
	// Preform the unmount. MNT_DETACH allows us to unmount /proc/self/cwd.
	if err := unix.Unmount(".", unix.MNT_DETACH); err != nil {
		return err
	}

But if this is the cause then we'll have a fun time fixing this in runc (/cc @crosbymichael). I think if we used MS_SLAVE this might be solved but I would have to check that this wouldn't break machines inadvertently.

However, if this is all it takes to cause devicemapper to break, this is a kernel bug. Recently (3.17 IIRC) they made it so that rmdir works for a directory even if it is a mountpoint in another mount namespace. The argument was specifically to avoid errors like this where you could cause a devicemapper script to be DoS'd by just creating an unprivileged mount namespace and doing an MS_PRIVATE remount.

In either case, in my tests the retry loop always succeeds on the first retry, which makes me thing that the 200 retries may be too much. Perhaps lowering to 3 would make more sense, as if the call won't succeed in the 300 ms, it'll probably never succeed.

I think we need to solve the cause of the issue, not it's symptoms. I agree that 200 retries is overkill, but doing additional retries in above libdm is just patching over the underlying issue (which is that for some reason parallel containers are causing inter-namespace interference).

As an aside, the issue we're seeing is slightly different (it's related to dm_task_run failed) but I'll see if we can reproduce it in a similar way to you.

cyphar · 2017-06-28T15:28:00Z

@slp Can you see if this runc patch fixes the problem for you (it applies on master but shouldn't need backporting). Please don't run this on any important machines, changes like this have caused working machines to need a reboot to fix the mountpoints:

From 117c92745bd098bf05a69489b7b78cac6364e1d0 Mon Sep 17 00:00:00 2001
From: Aleksa Sarai <asarai@suse.de>
Date: Thu, 29 Jun 2017 01:20:23 +1000
Subject: [PATCH] rootfs: switch ms_private remount of oldroot to ms_slave

Using MS_PRIVATE meant that there was a race between the mount(2) and
the umount2(2) calls where runc inadvertently has a live reference to a
mountpoint that existed on the host (which the host cannot kill
implicitly through an unmount and peer sharing).

In particular, this means that if we have a devicemapper mountpoint and
the host is trying to delete the underlying device, the delete will fail
because it is "in use" during the race. While the race is _very_ small
(and libdm actually retries to avoid these sorts of cases) this appears
to manifest in various cases.

Signed-off-by: Aleksa Sarai <asarai@suse.de>
---
 libcontainer/rootfs_linux.go | 9 ++++++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/libcontainer/rootfs_linux.go b/libcontainer/rootfs_linux.go
index e2e734a8..c4dbe3d5 100644
--- a/libcontainer/rootfs_linux.go
+++ b/libcontainer/rootfs_linux.go
@@ -668,9 +668,12 @@ func pivotRoot(rootfs string) error {
 		return err
 	}
 
-	// Make oldroot rprivate to make sure our unmounts don't propagate to the
-	// host (and thus bork the machine).
-	if err := unix.Mount("", ".", "", unix.MS_PRIVATE|unix.MS_REC, ""); err != nil {
+	// Make oldroot rslave to make sure our unmounts don't propagate to the
+	// host (and thus bork the machine). We don't use rprivate because this is
+	// known to cause issues due to races where we still have a reference to a
+	// mount while a process in the host namespace are trying to operate on
+	// something they think has no mounts (devicemapper in particular).
+	if err := unix.Mount("", ".", "", unix.MS_SLAVE|unix.MS_REC, ""); err != nil {
 		return err
 	}
 	// Preform the unmount. MNT_DETACH allows us to unmount /proc/self/cwd.
-- 
2.13.1

rhvgoyal · 2017-06-28T19:37:55Z

Why don't we make sure that device is removed before we try non-deferred deletion. IOW, devices.deactivateDevice() function could be modified to pass in extra parameter which says whether deactivation need to be synchronous or deferred.

Now callers of this call can decide when deactivateDevice() need to be synchronous and when it can be deferred. When a device is being deactivated in UnmountDevice() path, we can say deferred removal is fine. But when same device is being deleted in DeleteDevice() path, we will say do the synchronous deactivation (for the case of when deferred removal is enabled and deferred deletion is not enabled).

Given there is already a busy loop in deactivateDevice(), we don't have to introduce another loop in DeleteDevice().

cyphar · 2017-06-28T19:53:15Z

@rhvgoyal Does removal work if you still have live mounts? I'm going to see whether my theory about runc and the race window is sane later today (I somewhat doubt it to be honest), but is it possible that the deferred removal will succeed (because it's deferred) but the deletion will not work because it hasn't been removed (on account of there still being live mounts)? I'm not very familiar with libdm -- is there a way to check if a device has been removed and should we just wait until it has (which sort of makes the separation of the two options meaningless because they become blocking if one of them is disabled)?

rhvgoyal · 2017-06-28T19:53:30Z

I think how many times we should try to remove a device and whether libdm is already doing it is a separate optimization. That does not help explain why devices can be busy for a very short period of time. So I am fine reducing the retry loop or completely getting rid of it (if it can be proven that for all our use case libdm is going to retry anyway).

But for now, first issue seems to be that if deferred removal is enabled but deferred deletion is not enabled, then it is possible that something like udev might keep the device busy for a brief period of time and deletion of device will fail.

If device deletion path, there is a call to deactivateDevice() and if we can make that synchronous, that will take care of this small window.

rhvgoyal · 2017-06-28T19:58:47Z

@cyphar if you have live mounts, device removal will fail with -EBUSY.

I think runc race is a possibility on both old and new kernels. Theory is that on new kernels directory removal will remove mount points in other namespace as well. But I think we first try to delete device before we try directory removal. That means if there is a small window of mount point leak (due to pivot_root), that mount point can keep device busy.

in daemon/graphdriver/devmapper/driver.go, we delete device first before we try removing mnt dir.

    // This assumes the device has been properly Get/Put:ed and thus is unmounted
    if err := d.DeviceSet.DeleteDevice(id, false); err != nil {
            return fmt.Errorf("failed to remove device %s: %v", id, err)
    }

    mp := path.Join(d.home, "mnt", id)
    if err := system.EnsureRemoveAll(mp); err != nil {
            return err
    }

rhvgoyal · 2017-06-28T20:32:39Z

I am wondering if in device deletion path, should we try to remove mnt directory first and then do DeleteDevice(). That way, if this directory is a mount point in some other mount namespace, those mount points will go away. (With new kernels and assuming that some process is not inside that mount point).

And that increases the probability of successful device deletion.

Anyway, I think making deleteDevice() synchronous in DeleteDevice() path will take care of most of the issues as long as device being busy is a temporary situation. If this is something which is long lived, then
doing directory removal before device deletion should help a bit.

slp · 2017-06-28T23:05:00Z

@cyphar

I don't think the problem is leaking between containers namespaces -- that really should not happen, because the namespaces are created by two completely separate runc processes running under two completely different containerd-shims (they're several hops away from each other).

Yeah, probably "leaked" isn't the best word to describe this.

What I think might be happening is this:

runc's inherited namespace is containerd's mount namespace (which is the same as Docker's).
Thus, runc will see all of the storage driver mountpoints when it's being started.
So, when it does an unshare its namespace will contain those other storage drivers. This is fine because the mounts aren't MS_PRIVATE so any unmounts in the host are not a problem.
However, after runc's pivot_root before doing the unmount we remount the old root as MS_PRIVATE so that the unmount doesn't mess with the host. This means that in this race window we have a private reference to the devicemapper mount.
At this point, dockerd has unmounted the filesystem and is trying to delete the mount.

Aaaaand we have a winner!

I've added timestamps to the SystemTap scripts, and created a new one to monitor all syscalls from processes named "runc:[2:INIT]". Take a look at this run that failed to cleanup the device:

runc

1498689712.875296 runc:[2:INIT][3303]mount("", ".", "", MS_REC|MS_SLAVE, 0x0) ENTRY
1498689712.875309 runc:[2:INIT][3303]mount RETURN=0
1498689712.875314 runc:[2:INIT][3303]umount(".", MNT_DETACH) ENTRY
1498689712.889023 runc:[2:INIT][3303]umount RETURN=0
1498689713.097825 runc:[2:INIT][3303]chdir("/") ENTRY

dmtrace

dm_remove EBUSY: (0xffff88d4a876e000) (dockerd)
1498689713.097621 : dm_blk_close: (0xffff88d4a876e000) - 3303 (runc:[2:INIT])
 0xffffffff99666390 : dm_blk_close+0x20/0x80 [kernel]
 0xffffffff9928dc74 : __blkdev_put+0x234/0x290 [kernel] (inexact)
 0xffffffffc01b4f58 [xfs] (inexact)
 0xffffffff9928dd1c : blkdev_put+0x4c/0x110 [kernel] (inexact)
 0xffffffff992533c1 : kill_block_super+0x41/0x70 [kernel] (inexact)
 0xffffffff99253703 : deactivate_locked_super+0x43/0x70 [kernel] (inexact)
 0xffffffff9925378a : deactivate_super+0x5a/0x60 [kernel] (inexact)
 0xffffffff992722cf : cleanup_mnt+0x3f/0x90 [kernel] (inexact)
 0xffffffff99272362 : __cleanup_mnt+0x12/0x20 [kernel] (inexact)
 0xffffffff990bf070 : task_work_run+0x80/0xa0 [kernel] (inexact)
 0xffffffff990032d2 : exit_to_usermode_loop+0xc2/0xd0 [kernel] (inexact)
 0xffffffff99003d47 : do_syscall_64+0x157/0x160 [kernel] (inexact)
 0xffffffff99802621 : return_from_SYSCALL_64+0x0/0x6a [kernel] (inexact)

Note the gap between the return of umount and the entry of chdir, which should be called immediately after. I suspect this happens because, this time, this runc instance is last owner of a reference to this XFS mount point, and it's implicitly tasked with whatever work is needed to properly unmount this kind of filesystem.

This task is executed in the return path of the syscall, via "task_work_run". And judging by the timestamps, can be quite expensive. This gives time to dockerd to catch up and reach both RemoveDeviceDeferred and DeleteDevice before the DM device has actually been closed.

BTW, remounting with MS_SLAVE instead of MS_PRIVATE doesn't help (you can see in the trace that this test was already running with the first).

I think we need to solve the cause of the issue, not it's symptoms. I agree that 200 retries is overkill, but doing additional retries in above libdm is just patching over the underlying issue (which is that for some reason parallel containers are causing inter-namespace interference).

Please note the libdm retry path (from libdm/ioctl/libdm-iface.c) referenced in a previous comment doesn't apply to this case, as AFAIK we're using dm_task_deferred_remove, not dm_task_retry_remove, so dmt->retry_remove should be 0.

I know that "The Right Way" to solve this would be making sure that foreign mount points are dropped from the namespace ASAP, ideally without suffering from side-effects like this, but I'm pretty sure this kind of change won't be trivial.

On the other hand, a (sane) retry loop is way simpler and safer, can be easily backported to stable versions, and can be eventually dropped when the actual root cause if fixed for good.

cyphar · 2017-06-29T04:56:12Z

Aaaaand we have a winner!

I feel like I should be happy, but I don't feel good about this one. 😸

BTW, remounting with MS_SLAVE instead of MS_PRIVATE doesn't help (you can see in the trace that this test was already running with the first).

Well that's frustrating (as an aside, I would really like to see what the mount tables look like for the two processes). I think the "proper" fix for this problem would be that Docker/containerd mounts the devicemapper image inside a new mount namespace for every container that runc will then inherit -- you can't drop "foreign mounts" until you've hit pivot_root which is very late in the setup of a container. I have some idea of what plans @stevvooe has for containerd and image mounting, but this is something that he will have to consider and weigh in on.

As you said though, this is quite complicated to do properly and the symptomatic fix is much simpler. I think doing a busy loop waiting for the device to have been removed (which is what @rhvgoyal is proposing) is probably the most sane course of action, barring us actually making image handling sane.

Please note the libdm retry path (from libdm/ioctl/libdm-iface.c) referenced in a previous comment doesn't apply to this case, as AFAIK we're using dm_task_deferred_remove, not dm_task_retry_remove, so dmt->retry_remove should be 0.

Yeah, I realised this after re-reading the issue a few times. We've been hitting a similar problem with libdm when you don't have deferred removal or deletion so I assumed this issue was the same.

slp · 2017-06-29T08:54:18Z

@cyphar

Well that's frustrating (as an aside, I would really like to see what the mount tables look like for the two processes).

This is the mount table before and after (it didn't change) unix.Mount("", ".", "", unix.MS_SLAVE|unix.MS_REC, ""):

/dev/mapper/fedora-root / xfs rw,seclabel,relatime,attr2,inode64,noquota 0 0
devtmpfs /dev devtmpfs rw,seclabel,nosuid,size=4075736k,nr_inodes=1018934,mode=755 0 0
tmpfs /dev/shm tmpfs rw,seclabel,nosuid,nodev 0 0
devpts /dev/pts devpts rw,seclabel,nosuid,noexec,relatime,gid=5,mode=620,ptmxmode=000 0 0
mqueue /dev/mqueue mqueue rw,seclabel,relatime 0 0
hugetlbfs /dev/hugepages hugetlbfs rw,seclabel,relatime 0 0
proc /proc proc rw,nosuid,nodev,noexec,relatime 0 0
systemd-1 /proc/sys/fs/binfmt_misc autofs rw,relatime,fd=37,pgrp=0,timeout=0,minproto=5,maxproto=5,direct,pipe_ino=16418 0 0
nfsd /proc/fs/nfsd nfsd rw,relatime 0 0
sysfs /sys sysfs rw,seclabel,nosuid,nodev,noexec,relatime 0 0
securityfs /sys/kernel/security securityfs rw,nosuid,nodev,noexec,relatime 0 0
tmpfs /sys/fs/cgroup tmpfs ro,seclabel,nosuid,nodev,noexec,mode=755 0 0
cgroup /sys/fs/cgroup/systemd cgroup rw,nosuid,nodev,noexec,relatime,xattr,release_agent=/usr/lib/systemd/systemd-cgroups-agent,name=systemd 0 0
cgroup /sys/fs/cgroup/cpu,cpuacct cgroup rw,nosuid,nodev,noexec,relatime,cpu,cpuacct 0 0
cgroup /sys/fs/cgroup/hugetlb cgroup rw,nosuid,nodev,noexec,relatime,hugetlb 0 0
cgroup /sys/fs/cgroup/net_cls,net_prio cgroup rw,nosuid,nodev,noexec,relatime,net_cls,net_prio 0 0
cgroup /sys/fs/cgroup/memory cgroup rw,nosuid,nodev,noexec,relatime,memory 0 0
cgroup /sys/fs/cgroup/devices cgroup rw,nosuid,nodev,noexec,relatime,devices 0 0
cgroup /sys/fs/cgroup/blkio cgroup rw,nosuid,nodev,noexec,relatime,blkio 0 0
cgroup /sys/fs/cgroup/perf_event cgroup rw,nosuid,nodev,noexec,relatime,perf_event 0 0
cgroup /sys/fs/cgroup/cpuset cgroup rw,nosuid,nodev,noexec,relatime,cpuset 0 0
cgroup /sys/fs/cgroup/freezer cgroup rw,nosuid,nodev,noexec,relatime,freezer 0 0
cgroup /sys/fs/cgroup/pids cgroup rw,nosuid,nodev,noexec,relatime,pids 0 0
pstore /sys/fs/pstore pstore rw,seclabel,nosuid,nodev,noexec,relatime 0 0
configfs /sys/kernel/config configfs rw,relatime 0 0
selinuxfs /sys/fs/selinux selinuxfs rw,relatime 0 0
debugfs /sys/kernel/debug debugfs rw,seclabel,relatime 0 0
tmpfs /run tmpfs rw,seclabel,nosuid,nodev,mode=755 0 0
tmpfs /run/user/1000 tmpfs rw,seclabel,nosuid,nodev,relatime,size=817528k,mode=700,uid=1000,gid=1000 0 0
nsfs /run/docker/netns/f6093fb5b747 nsfs rw 0 0
tmpfs /tmp tmpfs rw,seclabel,nosuid,nodev 0 0
/dev/vda1 /boot ext4 rw,seclabel,relatime,data=ordered 0 0
sunrpc /var/lib/nfs/rpc_pipefs rpc_pipefs rw,relatime 0 0
/dev/mapper/fedora-root /var/lib/docker/devicemapper xfs rw,seclabel,relatime,attr2,inode64,noquota 0 0
/dev/mapper/docker-253:0-17074721-203aceb50e88c6c2dabba4ab9c051e32443d35ad7581677a1646bbe6cb478076 /var/lib/docker/devicemapper/mnt/203aceb50e88c6c2dabba4ab9c051e32443d35ad7581677a1646bbe6cb478076 xfs rw,context="system_u:object_r:container_file_t:s0:c138,c629",relatime,nouuid,attr2,inode64,sunit=1024,swidth=1024,noquota 0 0
/dev/mapper/docker-253:0-17074721-51aa1e506ca845b8725024a7a668464c420ca17dbd8b15c4765556c0d293167a /var/lib/docker/devicemapper/mnt/51aa1e506ca845b8725024a7a668464c420ca17dbd8b15c4765556c0d293167a xfs rw,context="system_u:object_r:container_file_t:s0:c753,c826",relatime,nouuid,attr2,inode64,sunit=1024,swidth=1024,noquota 0 0
shm /var/lib/docker/containers/9d802fa05bf74520dd15a7575fa7c67fdba40e309c9d704f29c9fb5cebc88f13/shm tmpfs rw,context="system_u:object_r:container_file_t:s0:c138,c629",nosuid,nodev,noexec,relatime,size=65536k 0 0
shm /var/lib/docker/containers/2fe59eedd79ae64a7e5505ded70b51030c01285425a5c55927d10f2f70cc4a4a/shm tmpfs rw,context="system_u:object_r:container_file_t:s0:c753,c826",nosuid,nodev,noexec,relatime,size=65536k 0 0
/dev/mapper/docker-253:0-17074721-51aa1e506ca845b8725024a7a668464c420ca17dbd8b15c4765556c0d293167a / xfs rw,context="system_u:object_r:container_file_t:s0:c753,c826",relatime,nouuid,attr2,inode64,sunit=1024,swidth=1024,noquota 0 0
proc /proc proc rw,nosuid,nodev,noexec,relatime 0 0
tmpfs /dev tmpfs rw,context="system_u:object_r:container_file_t:s0:c753,c826",nosuid,mode=755 0 0
devpts /dev/pts devpts rw,context="system_u:object_r:container_file_t:s0:c753,c826",nosuid,noexec,relatime,gid=5,mode=620,ptmxmode=666 0 0
sysfs /sys sysfs ro,seclabel,nosuid,nodev,noexec,relatime 0 0
tmpfs /sys/fs/cgroup tmpfs ro,context="system_u:object_r:container_file_t:s0:c753,c826",nosuid,nodev,noexec,relatime,mode=755 0 0
cgroup /sys/fs/cgroup/systemd cgroup ro,nosuid,nodev,noexec,relatime,xattr,release_agent=/usr/lib/systemd/systemd-cgroups-agent,name=systemd 0 0
cgroup /sys/fs/cgroup/cpu,cpuacct cgroup ro,nosuid,nodev,noexec,relatime,cpu,cpuacct 0 0
cgroup /sys/fs/cgroup/hugetlb cgroup ro,nosuid,nodev,noexec,relatime,hugetlb 0 0
cgroup /sys/fs/cgroup/net_cls,net_prio cgroup ro,nosuid,nodev,noexec,relatime,net_cls,net_prio 0 0
cgroup /sys/fs/cgroup/memory cgroup ro,nosuid,nodev,noexec,relatime,memory 0 0
cgroup /sys/fs/cgroup/devices cgroup ro,nosuid,nodev,noexec,relatime,devices 0 0
cgroup /sys/fs/cgroup/blkio cgroup ro,nosuid,nodev,noexec,relatime,blkio 0 0
cgroup /sys/fs/cgroup/perf_event cgroup ro,nosuid,nodev,noexec,relatime,perf_event 0 0
cgroup /sys/fs/cgroup/cpuset cgroup ro,nosuid,nodev,noexec,relatime,cpuset 0 0
cgroup /sys/fs/cgroup/freezer cgroup ro,nosuid,nodev,noexec,relatime,freezer 0 0
cgroup /sys/fs/cgroup/pids cgroup ro,nosuid,nodev,noexec,relatime,pids 0 0
mqueue /dev/mqueue mqueue rw,seclabel,nosuid,nodev,noexec,relatime 0 0
/dev/mapper/fedora-root /etc/resolv.conf xfs rw,seclabel,relatime,attr2,inode64,noquota 0 0
/dev/mapper/fedora-root /etc/hostname xfs rw,seclabel,relatime,attr2,inode64,noquota 0 0
/dev/mapper/fedora-root /etc/hosts xfs rw,seclabel,relatime,attr2,inode64,noquota 0 0
shm /dev/shm tmpfs rw,context="system_u:object_r:container_file_t:s0:c753,c826",nosuid,nodev,noexec,relatime,size=65536k 0 0

And yes, it does indeed contain mount points from another container.

This is the table after unix.Unmount(".", unix.MNT_DETACH):

/dev/mapper/docker-253:0-17074721-51aa1e506ca845b8725024a7a668464c420ca17dbd8b15c4765556c0d293167a / xfs rw,context="system_u:object_r:container_file_t:s0:c753,c826",relatime,nouuid,attr2,inode64,sunit=1024,swidth=1024,noquota 0 0
proc /proc proc rw,nosuid,nodev,noexec,relatime 0 0
tmpfs /dev tmpfs rw,context="system_u:object_r:container_file_t:s0:c753,c826",nosuid,mode=755 0 0
devpts /dev/pts devpts rw,context="system_u:object_r:container_file_t:s0:c753,c826",nosuid,noexec,relatime,gid=5,mode=620,ptmxmode=666 0 0
sysfs /sys sysfs ro,seclabel,nosuid,nodev,noexec,relatime 0 0
tmpfs /sys/fs/cgroup tmpfs ro,context="system_u:object_r:container_file_t:s0:c753,c826",nosuid,nodev,noexec,relatime,mode=755 0 0
cgroup /sys/fs/cgroup/systemd cgroup ro,nosuid,nodev,noexec,relatime,xattr,release_agent=/usr/lib/systemd/systemd-cgroups-agent,name=systemd 0 0
cgroup /sys/fs/cgroup/cpu,cpuacct cgroup ro,nosuid,nodev,noexec,relatime,cpu,cpuacct 0 0
cgroup /sys/fs/cgroup/hugetlb cgroup ro,nosuid,nodev,noexec,relatime,hugetlb 0 0
cgroup /sys/fs/cgroup/net_cls,net_prio cgroup ro,nosuid,nodev,noexec,relatime,net_cls,net_prio 0 0
cgroup /sys/fs/cgroup/memory cgroup ro,nosuid,nodev,noexec,relatime,memory 0 0
cgroup /sys/fs/cgroup/devices cgroup ro,nosuid,nodev,noexec,relatime,devices 0 0
cgroup /sys/fs/cgroup/blkio cgroup ro,nosuid,nodev,noexec,relatime,blkio 0 0
cgroup /sys/fs/cgroup/perf_event cgroup ro,nosuid,nodev,noexec,relatime,perf_event 0 0
cgroup /sys/fs/cgroup/cpuset cgroup ro,nosuid,nodev,noexec,relatime,cpuset 0 0
cgroup /sys/fs/cgroup/freezer cgroup ro,nosuid,nodev,noexec,relatime,freezer 0 0
cgroup /sys/fs/cgroup/pids cgroup ro,nosuid,nodev,noexec,relatime,pids 0 0
mqueue /dev/mqueue mqueue rw,seclabel,nosuid,nodev,noexec,relatime 0 0
/dev/mapper/fedora-root /etc/resolv.conf xfs rw,seclabel,relatime,attr2,inode64,noquota 0 0
/dev/mapper/fedora-root /etc/hostname xfs rw,seclabel,relatime,attr2,inode64,noquota 0 0
/dev/mapper/fedora-root /etc/hosts xfs rw,seclabel,relatime,attr2,inode64,noquota 0 0
shm /dev/shm tmpfs rw,context="system_u:object_r:container_file_t:s0:c753,c826",nosuid,nodev,noexec,relatime,size=65536k 0 0

There's been a noticeable clean up. Also, foreign mount points are not there anymore.

I think the "proper" fix for this problem would be that Docker/containerd mounts the devicemapper image inside a new mount namespace for every container that runc will then inherit -- you can't drop "foreign mounts" until you've hit pivot_root which is very late in the setup of a container. I have some idea of what plans @stevvooe has for containerd and image mounting, but this is something that he will have to consider and weigh in on.

While I'm not really familiar with containerd/runc, this sounds like a good idea to me. On the other hand, probably the kernel should provide more facilities to make this work a bit easier, something like being able to "tag" mount points, which IIRC is what Solaris does with zones.

slp · 2017-06-29T09:09:24Z

@rhvgoyal

I am wondering if in device deletion path, should we try to remove mnt directory first and then do DeleteDevice(). That way, if this directory is a mount point in some other mount namespace, those mount points will go away. (With new kernels and assuming that some process is not inside that mount point).

I don't think this will help. From PoV of the process working on deleting the layers, the mount point has already been unmounted. The problem is that the DM device is still open by a filesystem still active in the mount namespace of a foreign container, so it can't be removed/deleted.

Anyway, I think making deleteDevice() synchronous in DeleteDevice() path will take care of most of the issues as long as device being busy is a temporary situation. If this is something which is long lived, then
doing directory removal before device deletion should help a bit.

This is what I initially proposed in the first BZ, but now I don't think it's worth the effort. The decision to make deleteDevice() synchronous must be taken from a place where we can check if forceRemove is true, which implies propagating the decision through multiple layers until reaching the corresponding graphdriver.

But, in the end, the actual effect would be retrying in devices.removeDevice(), instead of doing it in device.deleteTransaction(). Maybe I'm missing something, but IMHO is not worth touching so many places for a little-to-none benefit.

rhvgoyal · 2017-06-29T14:47:18Z

@slp I have created a PR for forcing sync device removal if needed. Can you please give it a try and see if this fixes the issue you are facing. I have only compile tested it so far.

#33877

thaJeztah · 2017-07-13T02:24:18Z

ping @slp #33846 (comment)

rhvgoyal · 2017-07-13T11:39:52Z

I think @slp verified that other PR fixes this. So I think we can close this one. @slp WDYT?

slp · 2017-07-13T12:09:05Z

@thaJeztah @rhvgoyal Sure, closing this PR.

thaJeztah · 2017-07-14T08:14:09Z

Great! Thanks everyone 👍

GordonTheTurtle added the status/0-triage label Jun 27, 2017

cyphar mentioned this pull request Jun 28, 2017

rootfs: switch ms_private remount of oldroot to ms_slave opencontainers/runc#1500

Merged

cyphar mentioned this pull request Jun 29, 2017

devicemapper: Wait for device removal if deferredRemoval=true and deferredDeletion=… #33877

Merged

GordonTheTurtle assigned aluzzardi Jul 12, 2017

slp closed this Jul 13, 2017

Conversation

slp commented Jun 27, 2017

Uh oh!

rhvgoyal commented Jun 27, 2017

Uh oh!

slp commented Jun 27, 2017

Uh oh!

rhvgoyal commented Jun 27, 2017

Uh oh!

cyphar commented Jun 28, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vrothberg commented Jun 28, 2017

Uh oh!

slp commented Jun 28, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

slp commented Jun 28, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cyphar commented Jun 28, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cyphar commented Jun 28, 2017

Uh oh!

rhvgoyal commented Jun 28, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cyphar commented Jun 28, 2017

Uh oh!

rhvgoyal commented Jun 28, 2017

Uh oh!

rhvgoyal commented Jun 28, 2017

Uh oh!

rhvgoyal commented Jun 28, 2017

Uh oh!

slp commented Jun 28, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cyphar commented Jun 29, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

slp commented Jun 29, 2017

Uh oh!

slp commented Jun 29, 2017

Uh oh!

rhvgoyal commented Jun 29, 2017

Uh oh!

thaJeztah commented Jul 13, 2017

Uh oh!

rhvgoyal commented Jul 13, 2017

Uh oh!

slp commented Jul 13, 2017

Uh oh!

thaJeztah commented Jul 14, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

cyphar commented Jun 28, 2017 •

edited

Loading

slp commented Jun 28, 2017 •

edited

Loading

slp commented Jun 28, 2017 •

edited

Loading

cyphar commented Jun 28, 2017 •

edited

Loading

rhvgoyal commented Jun 28, 2017 •

edited

Loading

slp commented Jun 28, 2017 •

edited

Loading

cyphar commented Jun 29, 2017 •

edited

Loading