virtcontainers: set private propagation in rootfs#980
virtcontainers: set private propagation in rootfs#980bergwolf merged 1 commit intokata-containers:masterfrom
Conversation
|
/test |
Unskip docker cp test to check mount points are not left after running docker cp. Depends-on: github.com/kata-containers/runtime#980 fixes kata-containers#970 Signed-off-by: Julio Montes <julio.montes@intel.com>
| return fmt.Errorf("Could not bind mount %v to %v: %v", absSource, destination, err) | ||
| } | ||
|
|
||
| if err := syscall.Mount("none", destination, "", syscall.MS_PRIVATE, ""); err != nil { |
There was a problem hiding this comment.
Just curious to really understand how this solves the problem, and how is the PRIVATE flag helping?
There was a problem hiding this comment.
There was a problem hiding this comment.
|
Nice. So this means we do not strictly need to create a new mount namespace anymore @devimc . |
it depends, if we want to improve the isolation/security we have to, because currently host's rootfs is available/visible for qemu,shim and proxy, this change is just to fix the issues with |
|
@devimc by using this private mount, are we sure we're not breaking the mount propagation? |
|
@kata-containers/runtime please take a look |
|
Why umount events are not propagated? For mount propagation, mount and umount events should be same. |
It depends of the propagation type: shared, private or slave, see http://man7.org/linux/man-pages/man2/mount.2.html please take a look to kata-containers/tests#971 |
|
I see shared is “Mount and unmount events |
|
@devimc any updates in this PR? |
Unskip docker cp test to check mount points are not left after running docker cp. Depends-on: github.com/kata-containers/runtime#980 fixes kata-containers#970 Signed-off-by: Julio Montes <julio.montes@intel.com>
75326f5 to
71cdab5
Compare
|
/test |
|
added |
|
@caoruidong could you please add more details?, I'd like to understand your concerns |
|
Thanks for raising this @devimc. This looks like a delicate issue. If I'm reading the man pages correctly, I think the following summarises what this PR does (but please confirm @devimc et al :)... Currently (No
|
| Operation | Host outcome | Guest outcome |
|---|---|---|
| unmount bind-mounted sandbox dir | bind mount removed | bind mount NOT removed (bug) |
| host creates sub-mount below bind-mounted sandbox dir | mount created | mount visible |
| guest creates new sub-mount below sandbox dir | mount visible | mount created |
New behaviour (using MS_PRIVATE):
| Operation | Host outcome | Guest outcome |
|---|---|---|
| unmount bind-mounted sandbox dir | bind mount removed | bind mount removed (yay!) |
| host creates sub-mount below bind-mounted sandbox dir | mount created | mount NOT visible |
| guest creates new sub-mount below sandbox dir | mount NOT visible | mount created |
Clearly, we're happy for the bug scenario to be fixed. The question is, are we happy for the change in behaviour for the other two rows in the tables? I think the answer is "yes" since:
- the host context should not be trying to modify the existing container mounts.
- If the guest creates new sub-mounts, those should not be visible to the host.
Both of these two scenarios can (and really should) be handled directly using "docker run -v from:to" or equivalent.
|
@mcastelino PTAL? @jodh-intel thanks for the table - this helps a lot. I agree with your assessment. |
|
@devimc this PR lgtm. It's just to my knowledge mount and umount events are propagated equally in a mount ns. Maybe I'm wrong. |
|
@jodh-intel
I agree with that, I don't see why bind mount inside the guest should be propagated back to the host.
I'm not sure about that. What about the use case where you do something like Now, talking about the initial issue described by @devimc:
It looks to me that using Maybe I'm missing one piece here, but I'd like to make sure it's obvious for everybody which issue we're solving here, and how we solve it. |
When overlay is used as storage driver, kata runtime creates a new bind mount point to the merged directory, that way this directory can be shared with the VM through 9p. By default the mount propagation is shared, that means mount events are propagated, but umount events not, to deal with this problem and to avoid left mount points in the host once container finishes, the mount propagation of bind mounts should be set to private. Depends-on: github.com/kata-containers/tests#971 fixes kata-containers#794 Signed-off-by: Julio Montes <julio.montes@intel.com>
71cdab5 to
b029e44
Compare
in the container
nop, the runtime doesn't mount anything, docker does it, for example |
`docker cp` might bind mount some files/dirs under container rootfs without notifying runtime but expect runtime to unmount them. We need to unmount them otherwise docker will fail to clean up containers. Depends-on: github.com/kata-containers#980 Signed-off-by: Peng Tao <bergwolf@gmail.com>
|
I think the rational behind the patch is that LGTM! Thanks @devimc! |
|
/test |
|
|
Unskip docker cp test to check mount points are not left after running docker cp. Depends-on: github.com/kata-containers/runtime#980 fixes kata-containers#970 Signed-off-by: Julio Montes <julio.montes@intel.com>
When overlay is used as storage driver, kata runtime creates a new bind mount
point to the merged directory, that way this directory can be shared with the
VM through 9p. By default the mount propagation is shared, that means mount
events are propagated, but umount events not, to deal with this problem and to
avoid left mount points in the host once container finishes, the mount
propagation of bind mounts should be set to private.
Depends-on: github.com/kata-containers/tests#971
fixes #794
Signed-off-by: Julio Montes julio.montes@intel.com