virtcontainers: reimplement sandbox cgroup#1189
virtcontainers: reimplement sandbox cgroup#1189GabyCT merged 5 commits intokata-containers:masterfrom
Conversation
009a12a to
2da5814
Compare
|
@WeiZhang555 please take a look |
|
@jcvenegas please take a look |
|
/test |
WeiZhang555
left a comment
There was a problem hiding this comment.
Thanks for the effort! I left some comments, great work!
| for _, i := range tids.vcpus { | ||
| if i <= 0 { | ||
| continue | ||
| } |
There was a problem hiding this comment.
Adding only the vcpus threads is an optimization, which can benefit to small chunks IO performance according to our previous test.
It's OK to remove this if it makes the handling logic too complicated, I'm just wishing to share the performance tuning result to community.
There was a problem hiding this comment.
sure I can check that, what tests did you run?
There was a problem hiding this comment.
You can test with some network benchmark tools such as netperf, or disk IO benchmark tools such as FIO or iozone. Try to send/receive with small chunks(e.g. 4k package), for one VM with 1 cpu, you can compare W/ or W/O "cpu cgroup of 1 cpu constraint", I believe you can see how the performance is different.
| // first move all processes in subgroup to parent in case live process blocks | ||
| // deletion of cgroup | ||
| if err := s.cgroup.sandboxSub.MoveTo(s.cgroup.commonParent); err != nil { | ||
| return fmt.Errorf("failed to clear cgroup processes") |
There was a problem hiding this comment.
In my previous test, a left-over process (zoombie or removed too slowly) can block the cgroup's deletion which will cause cgroup dir left on disk after sandbox destroyed.
Have you tested this case? I don't see similar code in deleteCgroups
There was a problem hiding this comment.
I ran a simple test, and seems like cgroups are removed properly
There was a problem hiding this comment.
I noticed lefacy cgroup occasinally before, without specified test case.
I think you can emulate the behaviour by adding some other pid from host OS to that cgroup, and when you delete the container, the cgroup dir will be left. This is a easy emulation not the real case.
There was a problem hiding this comment.
yes, cgroups are removed, I filed an issue to test it once this pr land
There was a problem hiding this comment.
|
|
||
| // CgroupPath is the cgroup hierarchy where sandbox's processes | ||
| // including the hypervisor are placed. | ||
| CgroupPath string `json:"cgroupPath,omitempty"` |
There was a problem hiding this comment.
This can bring potential broken compatibility.
Suppose sandboxA is created by old kata-runtime and runs for some time, after we upgrade kata-runtime, and when using the new kata-runtime to delete the old container, it will search state.json file from disk and end up to find nothing, The cgroup dir can fail to delete and garbage left.
That's why https://github.com/kata-containers/runtime/pull/883 is important.
The influence is not that big caused by this modification, so I can accept the modification so far.
2da5814 to
747636e
Compare
|
@WeiZhang555 changes applied, thanks |
747636e to
df502f1
Compare
| func (c *Container) newCgroups() error { | ||
| ann := c.GetAnnotations() | ||
|
|
||
| config, ok := ann[annotations.ConfigJSONKey] |
There was a problem hiding this comment.
We use in many places the config json, at some point better to move it from annotations, virtcontainers is not agnostic at all.
There was a problem hiding this comment.
Agree. We can definitely save the config in Container . This way we can save many disk read and json Unmarshaling.
jcvenegas
left a comment
There was a problem hiding this comment.
I already tested cpuset with k8s and docker they are working as I expected, I added a quick review, I'll do a second round later.
| sandboxSub, err := parent.New(s.id, &specs.LinuxResources{}) | ||
| // V1Constraints returns the cgroups that are compatible with th VC architecture | ||
| // and hypervisor, constraints can be applied to these cgroups. | ||
| func V1Constraints() ([]cgroups.Subsystem, error) { |
There was a problem hiding this comment.
I dont fully get the naming of the cunftions it is because of cgroups v1? I prefer remove that prefix if does not add any information.
There was a problem hiding this comment.
This implementation is based on containerd cgroups https://github.com/kata-containers/runtime/blob/master/vendor/github.com/containerd/cgroups/v1.go#L28
| fields = strings.Split(text, " ") | ||
| // safe as mountinfo encodes mountpoints with spaces as \040. | ||
| index = strings.Index(text, " - ") | ||
| postSeparatorFields = strings.Fields(text[index+3:]) |
There was a problem hiding this comment.
copy&paste of a containerd private function https://github.com/kata-containers/runtime/blob/master/vendor/github.com/containerd/cgroups/v1.go#L49
| if numPostFields == 0 { | ||
| return "", fmt.Errorf("Found no fields post '-' in %q", text) | ||
| } | ||
| if postSeparatorFields[0] == "cgroup" { |
There was a problem hiding this comment.
Index access without check len
There was a problem hiding this comment.
| ) | ||
| // this is an error as we can't detect if the mount is for "cgroup" | ||
| if numPostFields == 0 { | ||
| return "", fmt.Errorf("Found no fields post '-' in %q", text) |
There was a problem hiding this comment.
Not really useful error (at least for me)
| } | ||
| if postSeparatorFields[0] == "cgroup" { | ||
| // check that the mount is properly formated. | ||
| if numPostFields < 3 { |
There was a problem hiding this comment.
| if numPostFields < 3 { | ||
| return "", fmt.Errorf("Error found less than 3 fields post '-' in %q", text) | ||
| } | ||
| return filepath.Dir(fields[4]), nil |
There was a problem hiding this comment.
Index access with out check length
There was a problem hiding this comment.
| return filepath.Dir(fields[4]), nil | ||
| } | ||
| } | ||
| return "", cgroups.ErrMountPointNotExist |
There was a problem hiding this comment.
I wonder if cgroups api does not provide a function like this.
From this file what is string is supose to return ?
cat /proc/self/mountinfo | grep cgroup
27 20 0:24 / /sys/fs/cgroup ro,nosuid,nodev,noexec shared:4 - tmpfs tmpfs ro,seclabel,mode=755
28 27 0:25 / /sys/fs/cgroup/unified rw,nosuid,nodev,noexec,relatime shared:5 - cgroup2 cgroup2 rw,seclabel,nsdelegate
29 27 0:26 / /sys/fs/cgroup/systemd rw,nosuid,nodev,noexec,relatime shared:6 - cgroup cgroup rw,seclabel,xattr,name=systemd
33 27 0:30 / /sys/fs/cgroup/cpu,cpuacct rw,nosuid,nodev,noexec,relatime shared:7 - cgroup cgroup rw,seclabel,cpu,cpuacct
34 27 0:31 / /sys/fs/cgroup/freezer rw,nosuid,nodev,noexec,relatime shared:8 - cgroup cgroup rw,seclabel,freezer
35 27 0:32 / /sys/fs/cgroup/blkio rw,nosuid,nodev,noexec,relatime shared:9 - cgroup cgroup rw,seclabel,blkio
36 27 0:33 / /sys/fs/cgroup/cpuset rw,nosuid,nodev,noexec,relatime shared:10 - cgroup cgroup rw,seclabel,cpuset
37 27 0:34 / /sys/fs/cgroup/devices rw,nosuid,nodev,noexec,relatime shared:11 - cgroup cgroup rw,seclabel,devices
38 27 0:35 / /sys/fs/cgroup/perf_event rw,nosuid,nodev,noexec,relatime shared:12 - cgroup cgroup rw,seclabel,perf_event
39 27 0:36 / /sys/fs/cgroup/hugetlb rw,nosuid,nodev,noexec,relatime shared:13 - cgroup cgroup rw,seclabel,hugetlb
40 27 0:37 / /sys/fs/cgroup/memory rw,nosuid,nodev,noexec,relatime shared:14 - cgroup cgroup rw,seclabel,memory
41 27 0:38 / /sys/fs/cgroup/net_cls,net_prio rw,nosuid,nodev,noexec,relatime shared:15 - cgroup cgroup rw,seclabel,net_cls,net_prio
42 27 0:39 / /sys/fs/cgroup/pids rw,nosuid,nodev,noexec,relatime shared:16 - cgroup cgroup rw,seclabel,pids
There was a problem hiding this comment.
it does, but it's private https://github.com/kata-containers/runtime/blob/master/vendor/github.com/containerd/cgroups/v1.go#L49
df502f1 to
37a8448
Compare
|
/test |
eb36e74 to
14e5ea3
Compare
|
/test |
|
ping @raravena80 |
|
@devimc did you mean to ping @WeiZhang555? 😄 |
|
@raravena80 nop, this needs review 😄 |
raravena80
left a comment
There was a problem hiding this comment.
Looks fine to me, I assume it passed all tests. Did you address @WeiZhang555's comment? I believe I'm not an approver.
|
@raravena80 thanks |
14e5ea3 to
ca4b9cf
Compare
|
/test |
ca4b9cf to
498c7bc
Compare
|
/test |
89e8893 to
01c86c3
Compare
|
Shimv2 tests are failing with this change I'm trying to debug it but I don't see logs in the journal. @lifupan any thoughts? |
5dcce06 to
746f6ad
Compare
|
/test |
|
ping @kata-containers/runtime |
746f6ad to
107faaf
Compare
|
/test |
|
@WeiZhang555 @lifupan @jcvenegas please take a look |
| // Add shim into cgroup | ||
| if c.process.Pid > 0 { | ||
| if err := cgroup.Add(cgroups.Process{Pid: c.process.Pid}); err != nil { | ||
| return fmt.Errorf("Could not add PID %d to cgroup %v: %v", c.process.Pid, spec.Linux.CgroupsPath, err) |
There was a problem hiding this comment.
What about deleted cgroup if the operation was not successful ? In a defer maybe?
There was a problem hiding this comment.
I'd prefer to leave it there and delete it once delete command is used, otherwise we'll face some issues with tools that monitor the cgroups
There was a problem hiding this comment.
In that case, I wonder should not delay c.state.CgroupPath = spec.Linux.CgroupsPath to the end ? If return intermediately we dont save the path in the state.
cpu cgroups are container's specific hence all containers even the sandbox should be able o create, delete and update their cgroups. The cgroup crated matches with the cgroup path passed by the containers manager. fixes kata-containers#1117 fixes kata-containers#1118 fixes kata-containers#1021 Signed-off-by: Julio Montes <julio.montes@intel.com>
All containers run in different cgroups even the sandbox, with this new implementation the sandbox cpu cgroup wil be equal to the sum of all its containers and the hypervisor process will be placed there impacting to the containers running in the sandbox (VM). The default number of vcpus is used when the sandbox has no constraints. For example, if default_vcpus is 2, then quota will be 200000 and period 100000. **c-ray test** http://www.futuretech.blinkenlights.nl/c-ray.html ``` +=============================================+ | | 6 threads 6cpus | 1 thread 1 cpu | +=============================================+ | current | 40 seconds | 122 seconds | +============================================== | new | 37 seconds | 124 seconds | +============================================== ``` current = current cgroups implementation new = new cgroups implementation **workload** ```yaml apiVersion: v1 kind: Pod metadata: name: c-ray annotations: io.kubernetes.cri.untrusted-workload: "true" spec: restartPolicy: Never containers: - name: c-ray-1 image: docker.io/devimc/c-ray:latest imagePullPolicy: IfNotPresent args: ["-t", "6", "-s", "1600x1200", "-r", "8", "-i", "/c-ray-1.1/sphfract", "-o", "/tmp/output.ppm"] resources: limits: cpu: 6 - name: c-ray-2 image: docker.io/devimc/c-ray:latest imagePullPolicy: IfNotPresent args: ["-t", "1", "-s", "1600x1200", "-r", "8", "-i", "/c-ray-1.1/sphfract", "-o", "/tmp/output.ppm"] resources: limits: cpu: 1 ``` fixes kata-containers#1153 Signed-off-by: Julio Montes <julio.montes@intel.com>
container is killed by force, container's state MUST change its state to stop immediately to avoid leaving it in a bad state. fixes kata-containers#1088 Signed-off-by: Julio Montes <julio.montes@intel.com>
To avoid long timeouts, the runtime shouldn't try to talk with the proxy when it's not running. Signed-off-by: Julio Montes <julio.montes@intel.com>
107faaf to
58d2785
Compare
|
lgtm |
|
/test |
|
Oh no... I wanted to review this, and I was expecting more approval for this PR since that is a pretty important/big change... |
| // container was killed by force, container MUST change its state | ||
| // as soon as possible just in case one of below operations fail leaving | ||
| // the containers in a bad state. | ||
| if err := c.setContainerState(types.StateStopped); err != nil { |
There was a problem hiding this comment.
How is it related to the PR? I'm not saying this is not fixing something, but it's been hidden into the PR that is supposed to fix cgroups, not the lifecycle of a container. This is the type of changes that can have a big impact and that is good to get reviewed standalone through its own PR.
| span.SetTag("request", request) | ||
| defer span.Finish() | ||
|
|
||
| if k.state.ProxyPid > 0 { |
There was a problem hiding this comment.
Same thing here, I don't think this is related to cgroups, and this should not be part the same PR.
| } | ||
|
|
||
| func (q *qemu) pid() int { | ||
| data, err := ioutil.ReadFile(q.pidFile()) |
There was a problem hiding this comment.
The PID has been stored through its own file? Don't you think we could improve this by having the PID part of the qemu.state structure? This would be saved as part of the state.json file associated with the VM, and it would avoid yet another file to be created.


All containers run in different cgroups even the sandbox, with this new
implementation the sandbox cpu cgroup wil be equal to the sum of all its
containers and the hypervisor process will be placed there impacting to the
containers running in the sandbox (VM). The default number of vcpus is
used when the sandbox has no constraints. For example, if default_vcpus
is 2, then quota will be 200000 and period 100000.
c-ray test
http://www.futuretech.blinkenlights.nl/c-ray.html
current = current cgroups implementation
new = new cgroups implementation
workload
fixes #1125
Signed-off-by: Julio Montes julio.montes@intel.com