Skip to content
This repository was archived by the owner on May 12, 2021. It is now read-only.

volumes: Attach volumes that are block device files as block devices#138

Merged
sboeuf merged 1 commit intokata-containers:masterfrom
amshinde:block-volumes
Apr 12, 2018
Merged

volumes: Attach volumes that are block device files as block devices#138
sboeuf merged 1 commit intokata-containers:masterfrom
amshinde:block-volumes

Conversation

@amshinde
Copy link
Copy Markdown
Member

Check if a volume passed to the container with -v is a block device
file, and if so pass the block device by hotplugging it to the VM
instead of passing this as a 9pfs volume. This would give us
better performance.

Fixes #137

Signed-off-by: Archana Shinde archana.m.shinde@intel.com

@amshinde
Copy link
Copy Markdown
Member Author

amshinde commented Apr 2, 2018

@sboeuf Can you take a look at this?

@egernst
Copy link
Copy Markdown
Member

egernst commented Apr 2, 2018

@bergwolf @laijs - PTAL?

@amshinde
Copy link
Copy Markdown
Member Author

amshinde commented Apr 2, 2018

@sboeuf I have reworked the PR a bit associating a block device to a virtcontainers mount. PTAL.

@sboeuf
Copy link
Copy Markdown

sboeuf commented Apr 2, 2018

@amshinde thx, I will !

}

// Attach this block device, all other devices passed in the config have been attached at this point
if err := b.attach(c.pod.hypervisor, c); err != nil {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is being attached but where are we detaching (unplugging) this device from the VM ? We need to make sure the device is properly detached when the container is stopped(destroyed).

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sboeuf This has been taken care of now.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you need to check for duplication? Or is it OK to attach/detach the same device multiple times?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bergwolf I had not seen any issues in attaching/detaching same device files multiple times to a VM. However I tried mounting the device files at two different locations and am seeing some sync issues. I think it would be necessary to keep track at a pod level if the device has already been passed and not pass it again. But I am thinking of opening a separate PR to handle that issue, as this is an existing issue for block devices passed with --device as well. What do you think @bergwolf ?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@amshinde I'm not sure that I followed you. AFAIK, when we mount the same block device at different mount points, they share the same superblock data structure in the kernel, which makes sure data is always consistent. What sync issues did you see with that kind of setup?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bergwolf Right now if a device file say "/dev/sdb" is attached twice, I did not see any issues in attaching it twice and detaching it. However, since the file is attached twice, it may appear twice inside the VM. (say as "/dev/vda" and "/dev/vdb"). Sync issues will arise if you mount those two.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@amshinde @bergwolf I have performed the same testing last week and I can confirm that if you pass a device/volume to a container A, and you modify the content of the mounted filesystem by adding a new file for instance, a new container B created after this will see the change. But unfortunately, after both container are running, any change is only reported back to the host but not sync'ed back to the guest.

break
}
}
return nil
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems there is no need to have this function returning an error since you return only nil at the end.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am performing some error checking now to return an error if a mount is not found in the list.

// BlockDevice represents block device that is attached to the
// VM in case this mount is a block device file or a directory
// backed by a block device.
BlockDevice *BlockDevice
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just wondering if a volume device is always gonna be a block device ? Or can it be something else ?

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You didn't answer this question.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Volumes will mostly be a block based device, I suppose you can pass a character device here, but then we dont do any special handling, just pass it through 9p.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In runv, we also support NFS volumes and ceph rbd volumes. I guess we can add a structure similar to agent's storage that can describe them all. But it does not block this PR and can be left for future improvements.

@WeiZhang555
Copy link
Copy Markdown
Member

Check if a volume passed to the container with -v is a block device
file

I like the design, so good to me. Will check the implementation later 👍

@jshachm
Copy link
Copy Markdown
Member

jshachm commented Apr 3, 2018

@WeiZhang555 @sboeuf
A silly question:
Is rollback method needed when add device failed into container ?

@sboeuf
Copy link
Copy Markdown

sboeuf commented Apr 3, 2018

@jshachm not a silly question! You're right! The device being added is going to be part of the container creation, meaning it will also need to be part of the rollback.

@amshinde
Copy link
Copy Markdown
Member Author

amshinde commented Apr 3, 2018

@jshachm @sboeuf I had missed unplugging the device for kata agent, but now I make sure I do so. The block device associated with a volume is unplugged with all other devices associated with a container.

@sboeuf
Copy link
Copy Markdown

sboeuf commented Apr 4, 2018

@amshinde could you rebase this PR please ? There are some conflicts on the first commit.

@amshinde
Copy link
Copy Markdown
Member Author

amshinde commented Apr 4, 2018

@sboeuf Rebase done.

@sboeuf
Copy link
Copy Markdown

sboeuf commented Apr 4, 2018

@amshinde thanks I'll do another review!

Copy link
Copy Markdown

@sboeuf sboeuf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment about optimization but looks good otherwise !

// all hotplugged devices are unplugged, so this needs be done
// after devices passed with --device are handled.
volumeStorages := k.handleBlockVolumes(c)
if err := k.replaceOCIMountsForStorages(ociSpec, volumeStorages); err != nil {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fact that those functions are called in sequence makes me think you could factorize them into a unique function. This would allow for a more optimized loop when you're going over every container mount. You would not to have to loop twice for the same thing, reducing the complexity.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sboeuf The first function loops over the the virtcontainers mounts and the second one over the OCI mounts. So we would need them in any case if I understand correctly.

// BlockDevice represents block device that is attached to the
// VM in case this mount is a block device file or a directory
// backed by a block device.
BlockDevice *BlockDevice
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You didn't answer this question.

@bergwolf
Copy link
Copy Markdown
Member

bergwolf commented Apr 7, 2018

Can you let the runtime cli make the decision to pass the mount source as shared 9pfs or as a block device, because when passing it as a block device, it is mounted both in the host and in the guest, which is not supported by any of the upstream Linux block-based file systems, and we potentially face data corruptions and kernel crashes by doing it. So I would expect there is a cli configure option to toggle it and I suggest we set the default as off with a warning if someone wants to enable it.

@WeiZhang555
Copy link
Copy Markdown
Member

@bergwolf
A runtime CLI won't work because the volume is passed through OCI spec.

because when passing it as a block device, it is mounted both in the host and in the guest, which is not supported by any of ...

And actually this is also not true, we can choose NOT mount the volume in host, and pass the block directly to the kata-runtime. Mounting volume in host isn't a MUST-DO step for volume to work.

@bergwolf
Copy link
Copy Markdown
Member

bergwolf commented Apr 8, 2018

@WeiZhang555

And actually this is also not true, we can choose NOT mount the volume in host, and pass the block directly to the kata-runtime. Mounting volume in host isn't a MUST-DO step for volume to work.

Then you have to call something like docker -v <block-device>:/path. I'm fine with it as long as we do not use it to handle something like docker -v <volume-name-or-path>:/path that would mount the block device on the host.

@WeiZhang555
Copy link
Copy Markdown
Member

WeiZhang555 commented Apr 8, 2018

@bergwolf

Then you have to call something like docker -v :/path. I'm fine with it as long as we do not use it to handle something like docker -v :/path that would mount the block device on the host.

Exactly.
My understanding is docker run -v <volume-name-or-path>:/path isn't involved in this PR's scenario. Am I right? @amshinde

}

// Attach this block device, all other devices passed in the config have been attached at this point
if err := b.attach(c.pod.hypervisor, c); err != nil {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you need to check for duplication? Or is it OK to attach/detach the same device multiple times?

}
}

func (c *Container) checkBlockDeviceSupport() bool {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we are going to have different semantics with different underlying hypervisor? Might just reject the usage if the hypervisor does not support passing block devices like this, at least we are sticking with a constant API semantics.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Different hypervisors would indicate if they support block devices using the capabilities() interface, whereas the DisableBlockDeviceUse is meant to be a user config to use block devices/ 9p.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My point is that we need to make sure the kata library API is constant across different hypervisors. E.g., when we get the same OCI spec, we should deliver the same container fs layout, no matter which hypervisor is chosen. So if DisableBlockDeviceUse is set to false, and the hypervisor does not support it, we should just fail the API instead of trying to work around it.

Copy link
Copy Markdown
Member Author

@amshinde amshinde Apr 10, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The approach so far has been to fallback to 9p incase the hypervisor does not support it. The block device usage is an optimization after all. You dont agee? We pass the rootfs as well as through 9p in case of devicemapper if the hypervisor does not support block devices.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, as a side effect of the performance optimization, we are delivering different container fs layout than runc which can be seen as less-OCI complaint. I think that is OK because 9pfs really sucks. However, one would at least expect kata runtime to deliver the same fs layout given the same OCI spec IMO. That's why I'm asking to fail the case if hypervisor does not support block devices.

// BlockDevice represents block device that is attached to the
// VM in case this mount is a block device file or a directory
// backed by a block device.
BlockDevice *BlockDevice
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In runv, we also support NFS volumes and ceph rbd volumes. I guess we can add a structure similar to agent's storage that can describe them all. But it does not block this PR and can be left for future improvements.

@amshinde
Copy link
Copy Markdown
Member Author

amshinde commented Apr 9, 2018

@bergwolf @WeiZhang555 Yes this PR only handles docker -v <block-device>:/path case. The other scenario docker run -v <volume-name-or-path>:/path is not involved in this PR.

@amshinde
Copy link
Copy Markdown
Member Author

amshinde commented Apr 9, 2018

@bergwolf I am using a cli config option to pass the volume as 9pfs or block-device listed here

disable_block_device_use = @DEFDISABLEBLOCK@

This is what we used for the rootfs, I suppose it makes sense to introduce another option specific to volumes being passed as 9p/block device.

@bergwolf
Copy link
Copy Markdown
Member

@amshinde I think disable_block_device_use is enough for this. No need for another cli option.

@bergwolf
Copy link
Copy Markdown
Member

@amshinde we might also want to document the behavior somewhere in the documentation repo, because we are delivering different results when compared to runc for docker run -v <block-dev>:/path. kata-containers/documentation/pull/48 might be a candidate place to do it. cc @jodh-intel

@sboeuf
Copy link
Copy Markdown

sboeuf commented Apr 11, 2018

@amshinde this PR needs some rebasing.
@bergwolf what is missing for this PR to be merged ?

Check if a volume passed to the container with -v is a block device
file, and if so pass the block device by hotplugging it to the VM
instead of passing this as a 9pfs volume. This would give us
better performance.

Add block device associated with a volume to the list of
container devices, so that it is detached with all other devices
when the container is stopped with detachDevices()

Fixes kata-containers#137

Signed-off-by: Archana Shinde <archana.m.shinde@intel.com>
@amshinde
Copy link
Copy Markdown
Member Author

@bergwolf @sboeuf I have opened a separate issue to address the issue of attaching a device just once when the device appears in the devices/volumes of several containers in a pod sandbox. I think this is a complex issue in itself and I plan to handle this in a separate PR.

@bergwolf
Copy link
Copy Markdown
Member

bergwolf commented Apr 12, 2018

I'm fine with handling device/storage sharing in #202. @amshinde please fix CI. Other than that,
LGTM

Approved with PullApprove

@amshinde
Copy link
Copy Markdown
Member Author

@sboeuf Can you merge this ?

@sboeuf
Copy link
Copy Markdown

sboeuf commented Apr 12, 2018

Merging as only centos is failing (which is expected for now).

@sboeuf sboeuf merged commit ca25177 into kata-containers:master Apr 12, 2018
@sboeuf sboeuf removed the review label Apr 12, 2018
@amshinde amshinde deleted the block-volumes branch July 11, 2019 22:26
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants