volumes: Attach volumes that are block device files as block devices by amshinde · Pull Request #138 · kata-containers/runtime

amshinde · 2018-03-31T00:33:04Z

Check if a volume passed to the container with -v is a block device
file, and if so pass the block device by hotplugging it to the VM
instead of passing this as a 9pfs volume. This would give us
better performance.

Fixes #137

Signed-off-by: Archana Shinde archana.m.shinde@intel.com

amshinde · 2018-04-02T17:41:49Z

@sboeuf Can you take a look at this?

egernst · 2018-04-02T17:45:25Z

@bergwolf @laijs - PTAL?

amshinde · 2018-04-02T21:58:09Z

@sboeuf I have reworked the PR a bit associating a block device to a virtcontainers mount. PTAL.

sboeuf · 2018-04-02T22:04:05Z

@amshinde thx, I will !

sboeuf · 2018-04-02T23:59:52Z

virtcontainers/container.go

+			}
+
+			// Attach this block device, all other devices passed in the config have been attached at this point
+			if err := b.attach(c.pod.hypervisor, c); err != nil {


This is being attached but where are we detaching (unplugging) this device from the VM ? We need to make sure the device is properly detached when the container is stopped(destroyed).

@sboeuf This has been taken care of now.

Do you need to check for duplication? Or is it OK to attach/detach the same device multiple times?

@bergwolf I had not seen any issues in attaching/detaching same device files multiple times to a VM. However I tried mounting the device files at two different locations and am seeing some sync issues. I think it would be necessary to keep track at a pod level if the device has already been passed and not pass it again. But I am thinking of opening a separate PR to handle that issue, as this is an existing issue for block devices passed with --device as well. What do you think @bergwolf ?

@amshinde I'm not sure that I followed you. AFAIK, when we mount the same block device at different mount points, they share the same superblock data structure in the kernel, which makes sure data is always consistent. What sync issues did you see with that kind of setup?

@bergwolf Right now if a device file say "/dev/sdb" is attached twice, I did not see any issues in attaching it twice and detaching it. However, since the file is attached twice, it may appear twice inside the VM. (say as "/dev/vda" and "/dev/vdb"). Sync issues will arise if you mount those two.

@amshinde @bergwolf I have performed the same testing last week and I can confirm that if you pass a device/volume to a container A, and you modify the content of the mounted filesystem by adding a new file for instance, a new container B created after this will see the change. But unfortunately, after both container are running, any change is only reported back to the host but not sync'ed back to the guest.

sboeuf · 2018-04-03T00:01:47Z

virtcontainers/kata_agent.go

+			break
+		}
+	}
+	return nil


Seems there is no need to have this function returning an error since you return only nil at the end.

I am performing some error checking now to return an error if a mount is not found in the list.

sboeuf · 2018-04-03T00:03:21Z

virtcontainers/mount.go

+	// BlockDevice represents block device that is attached to the
+	// VM in case this mount is a block device file or a directory
+	// backed by a block device.
+	BlockDevice *BlockDevice


Just wondering if a volume device is always gonna be a block device ? Or can it be something else ?

You didn't answer this question.

Volumes will mostly be a block based device, I suppose you can pass a character device here, but then we dont do any special handling, just pass it through 9p.

In runv, we also support NFS volumes and ceph rbd volumes. I guess we can add a structure similar to agent's storage that can describe them all. But it does not block this PR and can be left for future improvements.

WeiZhang555 · 2018-04-03T01:42:37Z

Check if a volume passed to the container with -v is a block device
file

I like the design, so good to me. Will check the implementation later 👍

jshachm · 2018-04-03T03:27:23Z

@WeiZhang555 @sboeuf
A silly question:
Is rollback method needed when add device failed into container ?

sboeuf · 2018-04-03T04:53:03Z

@jshachm not a silly question! You're right! The device being added is going to be part of the container creation, meaning it will also need to be part of the rollback.

amshinde · 2018-04-03T16:11:18Z

@jshachm @sboeuf I had missed unplugging the device for kata agent, but now I make sure I do so. The block device associated with a volume is unplugged with all other devices associated with a container.

sboeuf · 2018-04-04T18:36:30Z

@amshinde could you rebase this PR please ? There are some conflicts on the first commit.

amshinde · 2018-04-04T21:43:47Z

@sboeuf Rebase done.

sboeuf · 2018-04-04T22:04:33Z

@amshinde thanks I'll do another review!

sboeuf

Comment about optimization but looks good otherwise !

sboeuf · 2018-04-05T06:54:47Z

virtcontainers/kata_agent.go

+	// all hotplugged devices are unplugged, so this needs be done
+	// after devices passed with --device are handled.
+	volumeStorages := k.handleBlockVolumes(c)
+	if err := k.replaceOCIMountsForStorages(ociSpec, volumeStorages); err != nil {


The fact that those functions are called in sequence makes me think you could factorize them into a unique function. This would allow for a more optimized loop when you're going over every container mount. You would not to have to loop twice for the same thing, reducing the complexity.

@sboeuf The first function loops over the the virtcontainers mounts and the second one over the OCI mounts. So we would need them in any case if I understand correctly.

sboeuf · 2018-04-05T07:06:56Z

virtcontainers/mount.go

+	// BlockDevice represents block device that is attached to the
+	// VM in case this mount is a block device file or a directory
+	// backed by a block device.
+	BlockDevice *BlockDevice


You didn't answer this question.

bergwolf · 2018-04-07T04:04:00Z

Can you let the runtime cli make the decision to pass the mount source as shared 9pfs or as a block device, because when passing it as a block device, it is mounted both in the host and in the guest, which is not supported by any of the upstream Linux block-based file systems, and we potentially face data corruptions and kernel crashes by doing it. So I would expect there is a cli configure option to toggle it and I suggest we set the default as off with a warning if someone wants to enable it.

WeiZhang555 · 2018-04-08T02:04:06Z

@bergwolf
A runtime CLI won't work because the volume is passed through OCI spec.

because when passing it as a block device, it is mounted both in the host and in the guest, which is not supported by any of ...

And actually this is also not true, we can choose NOT mount the volume in host, and pass the block directly to the kata-runtime. Mounting volume in host isn't a MUST-DO step for volume to work.

bergwolf · 2018-04-08T03:07:35Z

@WeiZhang555

And actually this is also not true, we can choose NOT mount the volume in host, and pass the block directly to the kata-runtime. Mounting volume in host isn't a MUST-DO step for volume to work.

Then you have to call something like docker -v <block-device>:/path. I'm fine with it as long as we do not use it to handle something like docker -v <volume-name-or-path>:/path that would mount the block device on the host.

WeiZhang555 · 2018-04-08T03:12:41Z

@bergwolf

Then you have to call something like docker -v :/path. I'm fine with it as long as we do not use it to handle something like docker -v :/path that would mount the block device on the host.

Exactly.
My understanding is docker run -v <volume-name-or-path>:/path isn't involved in this PR's scenario. Am I right? @amshinde

bergwolf · 2018-04-08T03:11:45Z

virtcontainers/container.go

+			}
+
+			// Attach this block device, all other devices passed in the config have been attached at this point
+			if err := b.attach(c.pod.hypervisor, c); err != nil {


Do you need to check for duplication? Or is it OK to attach/detach the same device multiple times?

bergwolf · 2018-04-08T03:29:37Z

virtcontainers/container.go

 	}
 }

+func (c *Container) checkBlockDeviceSupport() bool {


So we are going to have different semantics with different underlying hypervisor? Might just reject the usage if the hypervisor does not support passing block devices like this, at least we are sticking with a constant API semantics.

Different hypervisors would indicate if they support block devices using the capabilities() interface, whereas the DisableBlockDeviceUse is meant to be a user config to use block devices/ 9p.

My point is that we need to make sure the kata library API is constant across different hypervisors. E.g., when we get the same OCI spec, we should deliver the same container fs layout, no matter which hypervisor is chosen. So if DisableBlockDeviceUse is set to false, and the hypervisor does not support it, we should just fail the API instead of trying to work around it.

The approach so far has been to fallback to 9p incase the hypervisor does not support it. The block device usage is an optimization after all. You dont agee? We pass the rootfs as well as through 9p in case of devicemapper if the hypervisor does not support block devices.

Well, as a side effect of the performance optimization, we are delivering different container fs layout than runc which can be seen as less-OCI complaint. I think that is OK because 9pfs really sucks. However, one would at least expect kata runtime to deliver the same fs layout given the same OCI spec IMO. That's why I'm asking to fail the case if hypervisor does not support block devices.

bergwolf · 2018-04-08T03:44:51Z

virtcontainers/mount.go

+	// BlockDevice represents block device that is attached to the
+	// VM in case this mount is a block device file or a directory
+	// backed by a block device.
+	BlockDevice *BlockDevice


In runv, we also support NFS volumes and ceph rbd volumes. I guess we can add a structure similar to agent's storage that can describe them all. But it does not block this PR and can be left for future improvements.

amshinde · 2018-04-09T20:30:44Z

@bergwolf @WeiZhang555 Yes this PR only handles docker -v <block-device>:/path case. The other scenario docker run -v <volume-name-or-path>:/path is not involved in this PR.

amshinde · 2018-04-09T20:38:42Z

@bergwolf I am using a cli config option to pass the volume as 9pfs or block-device listed here

runtime/cli/config/configuration.toml.in

Line 66 in 204e402

disable_block_device_use = @DEFDISABLEBLOCK@

This is what we used for the rootfs, I suppose it makes sense to introduce another option specific to volumes being passed as 9p/block device.

bergwolf · 2018-04-10T01:09:49Z

@amshinde I think disable_block_device_use is enough for this. No need for another cli option.

bergwolf · 2018-04-10T01:15:21Z

@amshinde we might also want to document the behavior somewhere in the documentation repo, because we are delivering different results when compared to runc for docker run -v <block-dev>:/path. kata-containers/documentation/pull/48 might be a candidate place to do it. cc @jodh-intel

sboeuf · 2018-04-11T17:02:35Z

@amshinde this PR needs some rebasing.
@bergwolf what is missing for this PR to be merged ?

Check if a volume passed to the container with -v is a block device file, and if so pass the block device by hotplugging it to the VM instead of passing this as a 9pfs volume. This would give us better performance. Add block device associated with a volume to the list of container devices, so that it is detached with all other devices when the container is stopped with detachDevices() Fixes kata-containers#137 Signed-off-by: Archana Shinde <archana.m.shinde@intel.com>

amshinde · 2018-04-11T19:54:52Z

@bergwolf @sboeuf I have opened a separate issue to address the issue of attaching a device just once when the device appears in the devices/volumes of several containers in a pod sandbox. I think this is a complex issue in itself and I plan to handle this in a separate PR.

bergwolf · 2018-04-12T01:40:26Z

I'm fine with handling device/storage sharing in #202. @amshinde please fix CI. Other than that,
LGTM

amshinde · 2018-04-12T19:58:52Z

@sboeuf Can you merge this ?

sboeuf · 2018-04-12T20:00:57Z

Merging as only centos is failing (which is expected for now).

amshinde added the review label Mar 31, 2018

amshinde force-pushed the block-volumes branch from 3973e35 to 3127bd5 Compare March 31, 2018 00:39

amshinde mentioned this pull request Apr 2, 2018

Add test to check if a block device passed as a volume is passed to VM using a block driver instead of 9p. kata-containers/tests#191

Closed

amshinde force-pushed the block-volumes branch from 3127bd5 to 1799201 Compare April 2, 2018 21:56

sboeuf reviewed Apr 3, 2018

View reviewed changes

amshinde force-pushed the block-volumes branch from 1799201 to 6df853a Compare April 3, 2018 16:07

amshinde force-pushed the block-volumes branch from 6df853a to 2b18ddb Compare April 4, 2018 21:43

sboeuf reviewed Apr 5, 2018

View reviewed changes

sboeuf approved these changes Apr 6, 2018

View reviewed changes

bergwolf reviewed Apr 8, 2018

View reviewed changes

bergwolf mentioned this pull request Apr 8, 2018

virtcontainers: fix unit tests #193

Merged

amshinde force-pushed the block-volumes branch from 2b18ddb to 7d4ee20 Compare April 10, 2018 23:55

amshinde force-pushed the block-volumes branch from 7d4ee20 to ed1078c Compare April 11, 2018 19:35

amshinde mentioned this pull request Apr 11, 2018

Devices that are passed with --device or --volume should be passed to the virtual machine just once. #202

Closed

sboeuf merged commit ca25177 into kata-containers:master Apr 12, 2018

sboeuf removed the review label Apr 12, 2018

bergwolf mentioned this pull request Apr 17, 2018

virtcontainers: remove ConfigJSONKey annotation #216

Closed

WeiZhang555 mentioned this pull request Apr 18, 2018

Handle device nodes and regular files in /dev #220

Merged

amshinde deleted the block-volumes branch July 11, 2019 22:26

Conversation

amshinde commented Mar 31, 2018

Uh oh!

amshinde commented Apr 2, 2018

Uh oh!

egernst commented Apr 2, 2018

Uh oh!

amshinde commented Apr 2, 2018

Uh oh!

sboeuf commented Apr 2, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

WeiZhang555 commented Apr 3, 2018

Uh oh!

jshachm commented Apr 3, 2018

Uh oh!

sboeuf commented Apr 3, 2018

Uh oh!

amshinde commented Apr 3, 2018

Uh oh!

sboeuf commented Apr 4, 2018

Uh oh!

amshinde commented Apr 4, 2018

Uh oh!

sboeuf commented Apr 4, 2018

Uh oh!

sboeuf left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bergwolf commented Apr 7, 2018

Uh oh!

WeiZhang555 commented Apr 8, 2018

Uh oh!

bergwolf commented Apr 8, 2018

Uh oh!

WeiZhang555 commented Apr 8, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

amshinde Apr 10, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

WeiZhang555 commented Apr 8, 2018 •

edited

Loading

amshinde Apr 10, 2018 •

edited

Loading

bergwolf commented Apr 12, 2018 •

edited by amshinde

Loading