virtcontainers: set private propagation in rootfs by devimc · Pull Request #980 · kata-containers/runtime

devimc · 2018-12-05T18:38:27Z

When overlay is used as storage driver, kata runtime creates a new bind mount
point to the merged directory, that way this directory can be shared with the
VM through 9p. By default the mount propagation is shared, that means mount
events are propagated, but umount events not, to deal with this problem and to
avoid left mount points in the host once container finishes, the mount
propagation of bind mounts should be set to private.

Depends-on: github.com/kata-containers/tests#971

fixes #794

Signed-off-by: Julio Montes julio.montes@intel.com

devimc · 2018-12-05T18:38:35Z

/test

Unskip docker cp test to check mount points are not left after running docker cp. Depends-on: github.com/kata-containers/runtime#980 fixes kata-containers#970 Signed-off-by: Julio Montes <julio.montes@intel.com>

sboeuf · 2018-12-05T18:47:03Z

virtcontainers/mount.go

 		return fmt.Errorf("Could not bind mount %v to %v: %v", absSource, destination, err)
 	}

+	if err := syscall.Mount("none", destination, "", syscall.MS_PRIVATE, ""); err != nil {


Just curious to really understand how this solves the problem, and how is the PRIVATE flag helping?

http://man7.org/linux/man-pages/man8/mount.8.html

and http://man7.org/linux/man-pages/man2/mount.2.html

amshinde · 2018-12-05T21:19:03Z

Nice. So this means we do not strictly need to create a new mount namespace anymore @devimc .

amshinde

lgtm

devimc · 2018-12-05T21:30:07Z

@amshinde

Nice. So this means we do not strictly need to create a new mount namespace anymore @devimc .

it depends, if we want to improve the isolation/security we have to, because currently host's rootfs is available/visible for qemu,shim and proxy, this change is just to fix the issues with docker cp

sboeuf · 2018-12-06T00:47:51Z

@devimc by using this private mount, are we sure we're not breaking the mount propagation?

devimc · 2018-12-06T13:56:27Z

@kata-containers/runtime please take a look

caoruidong · 2018-12-07T01:43:48Z

Why umount events are not propagated? For mount propagation, mount and umount events should be same.

devimc · 2018-12-07T13:14:13Z

@caoruidong

Why umount events are not propagated?

It depends of the propagation type: shared, private or slave, see http://man7.org/linux/man-pages/man2/mount.2.html

please take a look to kata-containers/tests#971

caoruidong · 2018-12-07T16:39:50Z

I see shared is “Mount and unmount events
immediately under this mount point will propagate to the other
mount points that are members of this mount's peer group.”

raravena80 · 2018-12-20T22:30:11Z

@devimc any updates in this PR?

Unskip docker cp test to check mount points are not left after running docker cp. Depends-on: github.com/kata-containers/runtime#980 fixes kata-containers#970 Signed-off-by: Julio Montes <julio.montes@intel.com>

devimc · 2019-01-07T16:29:20Z

/test

devimc · 2019-01-07T16:30:02Z

added Depends-on: github.com/kata-containers/tests#971 to make sure this PR fixes #794

devimc · 2019-01-07T16:38:02Z

@caoruidong could you please add more details?, I'd like to understand your concerns

jodh-intel · 2019-01-07T17:13:58Z

Thanks for raising this @devimc. This looks like a delicate issue. If I'm reading the man pages correctly, I think the following summarises what this PR does (but please confirm @devimc et al :)...

Currently (No `MS_PRIVATE`):

Operation	Host outcome	Guest outcome
unmount bind-mounted sandbox dir	bind mount removed	bind mount NOT removed (bug)
host creates sub-mount below bind-mounted sandbox dir	mount created	mount visible
guest creates new sub-mount below sandbox dir	mount visible	mount created

New behaviour (using `MS_PRIVATE`):

Operation	Host outcome	Guest outcome
unmount bind-mounted sandbox dir	bind mount removed	bind mount removed (yay!)
host creates sub-mount below bind-mounted sandbox dir	mount created	mount NOT visible
guest creates new sub-mount below sandbox dir	mount NOT visible	mount created

Clearly, we're happy for the bug scenario to be fixed. The question is, are we happy for the change in behaviour for the other two rows in the tables? I think the answer is "yes" since:

the host context should not be trying to modify the existing container mounts.
If the guest creates new sub-mounts, those should not be visible to the host.

Both of these two scenarios can (and really should) be handled directly using "docker run -v from:to" or equivalent.

jodh-intel · 2019-01-07T17:32:26Z

/cc @bergwolf, @sboeuf.

egernst · 2019-01-07T17:37:28Z

@mcastelino PTAL?

@jodh-intel thanks for the table - this helps a lot. I agree with your assessment.

caoruidong · 2019-01-08T01:42:12Z

@devimc this PR lgtm. It's just to my knowledge mount and umount events are propagated equally in a mount ns. Maybe I'm wrong.

sboeuf · 2019-01-08T15:13:36Z

@jodh-intel
Thanks for the analysis! I have a few comments:

If the guest creates new sub-mounts, those should not be visible to the host.

I agree with that, I don't see why bind mount inside the guest should be propagated back to the host.

the host context should not be trying to modify the existing container mounts.

I'm not sure about that. What about the use case where you do something like docker run -v /foo/bar:/foo/bar busybox, and later you bind mount something into /foo/bar from the host? Don't you expect this mount to be propagated?

Now, talking about the initial issue described by @devimc:

When overlay is used as storage driver, kata runtime creates a new bind mount
point to the merged directory, that way this directory can be shared with the
VM through 9p. By default the mount propagation is shared, that means mount
events are propagated, but umount events not, to deal with this problem and to
avoid left mount points in the host once container finishes, the mount
propagation of bind mounts should be set to private.

It looks to me that using MS_PRIVATE will simply workaround the root cause. IIUC, the problem we're trying to solve here is to make sure everything is properly unmounted when the container is being stopped and removed. Using MS_PRIVATE by not propagating anything will allow for a simpler de-init path since nothing has been previously propagated. But it sounds like the root cause is about our runtime code not being able to unmount what we've been mounting on the create/start path.

Maybe I'm missing one piece here, but I'd like to make sure it's obvious for everybody which issue we're solving here, and how we solve it.

When overlay is used as storage driver, kata runtime creates a new bind mount point to the merged directory, that way this directory can be shared with the VM through 9p. By default the mount propagation is shared, that means mount events are propagated, but umount events not, to deal with this problem and to avoid left mount points in the host once container finishes, the mount propagation of bind mounts should be set to private. Depends-on: github.com/kata-containers/tests#971 fixes kata-containers#794 Signed-off-by: Julio Montes <julio.montes@intel.com>

devimc · 2019-01-08T19:29:02Z

@sboeuf

I'm not sure about that. What about the use case where you do something like docker run -v /foo/bar:/foo/bar busybox, and later you bind mount something into /foo/bar from the host? Don't you expect this mount to be propagated?

/foo/bar is shared using 9p, not the rootfs, as far as I know, mounts events are not propagated through 9p, for example

$ mkdir -p /tmp/dir/{bar,foo}
$ sudo mount --bind /tmp/dir/{bar,foo}
$ touch /tmp/dir/bar/file
$ ls -R /tmp/dir/{bar,foo}
/tmp/dir/bar:
file

/tmp/dir/foo:
file

$ docker run --rm -ti --runtime=kata-runtime -v /tmp/dir:/tmp/dir debian bash -c 'ls -R /tmp/dir/{bar,foo}'
/tmp/dir/bar:
file

/tmp/dir/foo:

in the container foo is a directory not a mount point, hence it's another limitation of 9p

But it sounds like the root cause is about our runtime code not being able to unmount what we've been mounting on the create/start path.

nop, the runtime doesn't mount anything, docker does it, for example -v /foo/bar:/foo/bar is mounted and unmounted by docker in the overlay fs each time cp is used, mount events are propagated to our bind mount point but not unmount events, even using MS_SLAVE those events are not propagated, having said that the solution I propose is to don't propagate any event (mount/umount), we don't need them since volumes are shared through 9p and not using the rootfs like runc does.

`docker cp` might bind mount some files/dirs under container rootfs without notifying runtime but expect runtime to unmount them. We need to unmount them otherwise docker will fail to clean up containers. Depends-on: github.com/kata-containers#980 Signed-off-by: Peng Tao <bergwolf@gmail.com>

bergwolf · 2019-01-18T04:08:30Z

I think the rational behind the patch is that docker cp is really a dirty hack that it bind-mounts host dir under container rootfs without notifying the runtime in any means. So it makes sense to set the rootfs mount private to avoid being propagated by docker cp. For volumes OTOH, we use dedicated 9pfs mountpoint for each of them and they can still propagate new mounts to the guest properly.

LGTM! Thanks @devimc!

sboeuf · 2019-01-18T08:37:40Z

/test

sboeuf · 2019-01-18T08:38:25Z

Thanks @bergwolf and @devimc for your explanations. Let's move forward with this PR and merge it when the CI turns green!

bergwolf · 2019-01-21T05:28:18Z

jenkins-ci-fedora-vsocks and jenkins-metrics-ubuntu-16-04 are broken unrelated to this PR. Merging.

Unskip docker cp test to check mount points are not left after running docker cp. Depends-on: github.com/kata-containers/runtime#980 fixes kata-containers#970 Signed-off-by: Julio Montes <julio.montes@intel.com>

devimc mentioned this pull request Dec 5, 2018

unskip docker cp "check mount points" test kata-containers/tests#970

Closed

devimc mentioned this pull request Dec 5, 2018

integration/docker: unskip docer cp test kata-containers/tests#971

Merged

sboeuf reviewed Dec 5, 2018

View reviewed changes

amshinde approved these changes Dec 5, 2018

View reviewed changes

devimc force-pushed the topic/left_mount_points branch 2 times, most recently from 75326f5 to 71cdab5 Compare January 7, 2019 16:29

devimc force-pushed the topic/left_mount_points branch from 71cdab5 to b029e44 Compare January 8, 2019 19:15

devimc mentioned this pull request Jan 17, 2019

clean up container dir #1139

Merged

bergwolf merged commit 0c09d2b into kata-containers:master Jan 21, 2019

jodh-intel mentioned this pull request Mar 7, 2019

nvdimm: support nvdimm on arm64 kernel kata-containers/packaging#377

Merged

devimc deleted the topic/left_mount_points branch April 8, 2019 14:51

Conversation

devimc commented Dec 5, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

devimc commented Dec 5, 2018

Uh oh!

sboeuf Dec 5, 2018

Choose a reason for hiding this comment

Uh oh!

devimc Dec 5, 2018

Choose a reason for hiding this comment

Uh oh!

devimc Dec 5, 2018

Choose a reason for hiding this comment

Uh oh!

amshinde commented Dec 5, 2018

Uh oh!

amshinde left a comment

Choose a reason for hiding this comment

Uh oh!

devimc commented Dec 5, 2018

Uh oh!

sboeuf commented Dec 6, 2018

Uh oh!

devimc commented Dec 6, 2018

Uh oh!

caoruidong commented Dec 7, 2018

Uh oh!

devimc commented Dec 7, 2018

Uh oh!

caoruidong commented Dec 7, 2018

Uh oh!

raravena80 commented Dec 20, 2018

Uh oh!

devimc commented Jan 7, 2019

Uh oh!

devimc commented Jan 7, 2019

Uh oh!

devimc commented Jan 7, 2019

Uh oh!

jodh-intel commented Jan 7, 2019

Currently (No MS_PRIVATE):

New behaviour (using MS_PRIVATE):

Uh oh!

jodh-intel commented Jan 7, 2019

Uh oh!

egernst commented Jan 7, 2019

Uh oh!

caoruidong commented Jan 8, 2019

Uh oh!

sboeuf commented Jan 8, 2019

Uh oh!

devimc commented Jan 8, 2019

Uh oh!

bergwolf commented Jan 18, 2019 • edited by amshinde Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sboeuf commented Jan 18, 2019

Uh oh!

sboeuf commented Jan 18, 2019

Uh oh!

bergwolf commented Jan 21, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

devimc commented Dec 5, 2018 •

edited

Loading

Currently (No `MS_PRIVATE`):

New behaviour (using `MS_PRIVATE`):

bergwolf commented Jan 18, 2019 •

edited by amshinde

Loading