volumes: Attach volumes that are block device files as block devices#138
volumes: Attach volumes that are block device files as block devices#138sboeuf merged 1 commit intokata-containers:masterfrom
Conversation
|
@sboeuf Can you take a look at this? |
|
@sboeuf I have reworked the PR a bit associating a block device to a virtcontainers mount. PTAL. |
|
@amshinde thx, I will ! |
| } | ||
|
|
||
| // Attach this block device, all other devices passed in the config have been attached at this point | ||
| if err := b.attach(c.pod.hypervisor, c); err != nil { |
There was a problem hiding this comment.
This is being attached but where are we detaching (unplugging) this device from the VM ? We need to make sure the device is properly detached when the container is stopped(destroyed).
There was a problem hiding this comment.
Do you need to check for duplication? Or is it OK to attach/detach the same device multiple times?
There was a problem hiding this comment.
@bergwolf I had not seen any issues in attaching/detaching same device files multiple times to a VM. However I tried mounting the device files at two different locations and am seeing some sync issues. I think it would be necessary to keep track at a pod level if the device has already been passed and not pass it again. But I am thinking of opening a separate PR to handle that issue, as this is an existing issue for block devices passed with --device as well. What do you think @bergwolf ?
There was a problem hiding this comment.
@amshinde I'm not sure that I followed you. AFAIK, when we mount the same block device at different mount points, they share the same superblock data structure in the kernel, which makes sure data is always consistent. What sync issues did you see with that kind of setup?
There was a problem hiding this comment.
@bergwolf Right now if a device file say "/dev/sdb" is attached twice, I did not see any issues in attaching it twice and detaching it. However, since the file is attached twice, it may appear twice inside the VM. (say as "/dev/vda" and "/dev/vdb"). Sync issues will arise if you mount those two.
There was a problem hiding this comment.
@amshinde @bergwolf I have performed the same testing last week and I can confirm that if you pass a device/volume to a container A, and you modify the content of the mounted filesystem by adding a new file for instance, a new container B created after this will see the change. But unfortunately, after both container are running, any change is only reported back to the host but not sync'ed back to the guest.
| break | ||
| } | ||
| } | ||
| return nil |
There was a problem hiding this comment.
Seems there is no need to have this function returning an error since you return only nil at the end.
There was a problem hiding this comment.
I am performing some error checking now to return an error if a mount is not found in the list.
| // BlockDevice represents block device that is attached to the | ||
| // VM in case this mount is a block device file or a directory | ||
| // backed by a block device. | ||
| BlockDevice *BlockDevice |
There was a problem hiding this comment.
Just wondering if a volume device is always gonna be a block device ? Or can it be something else ?
There was a problem hiding this comment.
Volumes will mostly be a block based device, I suppose you can pass a character device here, but then we dont do any special handling, just pass it through 9p.
There was a problem hiding this comment.
In runv, we also support NFS volumes and ceph rbd volumes. I guess we can add a structure similar to agent's storage that can describe them all. But it does not block this PR and can be left for future improvements.
I like the design, so good to me. Will check the implementation later 👍 |
|
@WeiZhang555 @sboeuf |
|
@jshachm not a silly question! You're right! The device being added is going to be part of the container creation, meaning it will also need to be part of the rollback. |
|
@amshinde could you rebase this PR please ? There are some conflicts on the first commit. |
|
@sboeuf Rebase done. |
|
@amshinde thanks I'll do another review! |
sboeuf
left a comment
There was a problem hiding this comment.
Comment about optimization but looks good otherwise !
| // all hotplugged devices are unplugged, so this needs be done | ||
| // after devices passed with --device are handled. | ||
| volumeStorages := k.handleBlockVolumes(c) | ||
| if err := k.replaceOCIMountsForStorages(ociSpec, volumeStorages); err != nil { |
There was a problem hiding this comment.
The fact that those functions are called in sequence makes me think you could factorize them into a unique function. This would allow for a more optimized loop when you're going over every container mount. You would not to have to loop twice for the same thing, reducing the complexity.
There was a problem hiding this comment.
@sboeuf The first function loops over the the virtcontainers mounts and the second one over the OCI mounts. So we would need them in any case if I understand correctly.
| // BlockDevice represents block device that is attached to the | ||
| // VM in case this mount is a block device file or a directory | ||
| // backed by a block device. | ||
| BlockDevice *BlockDevice |
|
Can you let the runtime cli make the decision to pass the mount source as shared 9pfs or as a block device, because when passing it as a block device, it is mounted both in the host and in the guest, which is not supported by any of the upstream Linux block-based file systems, and we potentially face data corruptions and kernel crashes by doing it. So I would expect there is a cli configure option to toggle it and I suggest we set the default as off with a warning if someone wants to enable it. |
|
@bergwolf
And actually this is also not true, we can choose NOT mount the volume in host, and pass the block directly to the kata-runtime. Mounting volume in host isn't a MUST-DO step for volume to work. |
Then you have to call something like |
Exactly. |
| } | ||
|
|
||
| // Attach this block device, all other devices passed in the config have been attached at this point | ||
| if err := b.attach(c.pod.hypervisor, c); err != nil { |
There was a problem hiding this comment.
Do you need to check for duplication? Or is it OK to attach/detach the same device multiple times?
| } | ||
| } | ||
|
|
||
| func (c *Container) checkBlockDeviceSupport() bool { |
There was a problem hiding this comment.
So we are going to have different semantics with different underlying hypervisor? Might just reject the usage if the hypervisor does not support passing block devices like this, at least we are sticking with a constant API semantics.
There was a problem hiding this comment.
Different hypervisors would indicate if they support block devices using the capabilities() interface, whereas the DisableBlockDeviceUse is meant to be a user config to use block devices/ 9p.
There was a problem hiding this comment.
My point is that we need to make sure the kata library API is constant across different hypervisors. E.g., when we get the same OCI spec, we should deliver the same container fs layout, no matter which hypervisor is chosen. So if DisableBlockDeviceUse is set to false, and the hypervisor does not support it, we should just fail the API instead of trying to work around it.
There was a problem hiding this comment.
The approach so far has been to fallback to 9p incase the hypervisor does not support it. The block device usage is an optimization after all. You dont agee? We pass the rootfs as well as through 9p in case of devicemapper if the hypervisor does not support block devices.
There was a problem hiding this comment.
Well, as a side effect of the performance optimization, we are delivering different container fs layout than runc which can be seen as less-OCI complaint. I think that is OK because 9pfs really sucks. However, one would at least expect kata runtime to deliver the same fs layout given the same OCI spec IMO. That's why I'm asking to fail the case if hypervisor does not support block devices.
| // BlockDevice represents block device that is attached to the | ||
| // VM in case this mount is a block device file or a directory | ||
| // backed by a block device. | ||
| BlockDevice *BlockDevice |
There was a problem hiding this comment.
In runv, we also support NFS volumes and ceph rbd volumes. I guess we can add a structure similar to agent's storage that can describe them all. But it does not block this PR and can be left for future improvements.
|
@bergwolf @WeiZhang555 Yes this PR only handles |
|
@bergwolf I am using a cli config option to pass the volume as 9pfs or block-device listed here runtime/cli/config/configuration.toml.in Line 66 in 204e402 This is what we used for the rootfs, I suppose it makes sense to introduce another option specific to volumes being passed as 9p/block device. |
|
@amshinde I think |
|
@amshinde we might also want to document the behavior somewhere in the documentation repo, because we are delivering different results when compared to runc for |
Check if a volume passed to the container with -v is a block device file, and if so pass the block device by hotplugging it to the VM instead of passing this as a 9pfs volume. This would give us better performance. Add block device associated with a volume to the list of container devices, so that it is detached with all other devices when the container is stopped with detachDevices() Fixes kata-containers#137 Signed-off-by: Archana Shinde <archana.m.shinde@intel.com>
|
@sboeuf Can you merge this ? |
|
Merging as only centos is failing (which is expected for now). |
Check if a volume passed to the container with -v is a block device
file, and if so pass the block device by hotplugging it to the VM
instead of passing this as a 9pfs volume. This would give us
better performance.
Fixes #137
Signed-off-by: Archana Shinde archana.m.shinde@intel.com