Always enable cgroup namespace for containers
Fix #3734
In cgroupv2 hierrachy, cgroup setup for nested containers (i.e. docker) are incorrect without enabling cgroup namespace. This enables cgroup namespace for all containers to fix the incorrect cgroup setup.
Thanks for the PR! I've enabled the CI, although it might not test the change. I'll investigate.
A progress update: when I enabled this locally I wasn't able to get dockerd + containerd to start in a container. It might be a problem in my setup. I'll investigate more and propose a test case if the problem persists.
@djs55 I think this might be related to something I'm looking at as well related to docker. Ignoring for the moment that the example on master doesn't build (something is wrong with the container image that gets stored, and so the resultant image doesn't appear to actually even have docker in it...), docker fails with the error below in my testing which I believe is related to cgroups:
ctr -n services.linuxkit task exec -exec-id debug docker docker run crccheck/hello-world
Unable to find image 'crccheck/hello-world:latest' locally
latest: Pulling from crccheck/hello-world
e685c5c858e3: Pulling fs layer
7bf3c383dbcd: Pulling fs layer
7bf3c383dbcd: Verifying Checksum
7bf3c383dbcd: Download complete
e685c5c858e3: Pull complete
7bf3c383dbcd: Pull complete
Digest: sha256:0404ca69b522f8629d7d4e9034a7afe0300b713354e8bf12ec9657581cf59400
Status: Downloaded newer image for crccheck/hello-world:latest
docker: Error response from daemon: OCI runtime create failed: container_linux.go:380: starting container process caused: process_linux.go:545: container init caused: process_linux.go:508: setting cgroup config for procHooks process caused: failed to write "c 5:1 rwm": write /sys/fs/cgroup/devices/docker/18ee78ce6bc9e313c760dc1a1428ed719a980dbb96eadd4ba55b527458e66aa2/devices.allow: operation not permitted: unknown.
time="2022-01-13T07:58:34Z" level=error msg="error waiting for container: context canceled"
@the-maldridge interesting! I'll take a look at the example on master when I get a moment. It would be good to fix it and make sure the end-to-end tests are properly testing it.
I've seen a very similar error with devices.allow, but only with cgroup v1 and it seemed to be transient (!). My understanding is that this controller is used to prevent unauthorised containers running mknod to grant themselves access to the physical hardware. In cgroup v1 the controller (documented here) is configured by writing to the devices.allow file. I think in cgroup v2 it was removed and replaced with eBPF programs, so on a working cgroup v2 system I see:
dave@m1 ~ % docker run -it --privileged -v /sys/fs/bpf:/sys/fs/bpf -v /sys/fs/cgroup:/sys/fs/cgroup djs55/bpftool cgroup tree
CgroupPath
ID AttachType AttachFlags Name
/sys/fs/cgroup/006-metadata
21 device multi
/sys/fs/cgroup/011-bridge
36 device multi
/sys/fs/cgroup/dhcpcd
53 device multi
...
So I guess we should make sure the example works in both cgroup v1 and (default) cgroup v2 mode, if possible.
Well this is far from transient, it happens without fail on every machine and instance I try this image on. I'm happy to pull logs or whatever else might be helpful to get this figured out because this is preventing docker from working and that is preventing me from updating things.
I have done some more checking. I'm not really sure what's going on, but I figured I'd add more information. FWIW all this information is obtained from a system built with a patched linuxkit that includes this PR.
The docker error is the same above, it cannot setup the container due to the controller being in the "wrong" spot. I really think this may be a case where docker wants to see the host cgroups paths since its talking at least as far as its concerned to a host containerd.
node1:/# ctr -n services.linuxkit task exec -exec-id debug docker ls /sys/fs/cgroup/devices
cgroup.clone_children
cgroup.procs
devices.allow
devices.deny
devices.list
notify_on_release
tasks
node1:/# ctr -n services.linuxkit task exec -exec-id debug docker ls /sys/fs/cgroup/devices/docker
ls: /sys/fs/cgroup/devices/docker: No such file or directory
node1:/# ls /sys/fs/cgroup/devices/
000-sysctl cgroup.clone_children devices.deny logwrite sshd
001-sysfs cgroup.procs devices.list nomad tasks
002-rngd_boot cgroup.sane_behavior dhcpcd notify_on_release vault
003-dhcpcd_boot consul docker openntpd
004-metadata coredns emissary release_agent
acpid devices.allow getty rngd
node1:/# ls /sys/fs/cgroup/devices/docker/
cgroup.clone_children devices.allow devices.list tasks
cgroup.procs devices.deny notify_on_release
node1:/# ctr -n services.linuxkit task exec -exec-id debug docker docker run --rm -i crccheck/hello-world
Unable to find image 'crccheck/hello-world:latest' locally
latest: Pulling from crccheck/hello-world
e685c5c858e3: Pulling fs layer
7bf3c383dbcd: Pulling fs layer
7bf3c383dbcd: Verifying Checksum
7bf3c383dbcd: Download complete
e685c5c858e3: Verifying Checksum
e685c5c858e3: Download complete
e685c5c858e3: Pull complete
7bf3c383dbcd: Pull complete
Digest: sha256:0404ca69b522f8629d7d4e9034a7afe0300b713354e8bf12ec9657581cf59400
Status: Downloaded newer image for crccheck/hello-world:latest
docker: Error response from daemon: OCI runtime create failed: container_linux.go:380: starting container process caused: process_linux.go:545: container init caused: process_linux.go:508: setting cgroup config for procHooks process caused: failed to write "c 5:1 rwm": write /sys/fs/cgroup/devices/docker/37de90b430b4e77cb718c7105e1a62a22a0c47ee33cd259c0f2cea1722f454ba/devices.allow: operation not permitted: unknown.
This really does look like something is wrong with the way that docker is interacting with the way the host cgroups work, and it seems to have changed between v0.8 and here (though so too has a lot of other stuff).
Further research shows that the breaking component between v0.8 and here is runc. This is where I have to temporarily admit defeat as I don't understand enough of how runc and containerd slot together to fully understand what's going on here. All I know is that older versions of runc work and newer versions do not. I'll take a look at the runc changelog when I have time to see if I can find a specific issue.
Further progress!
Any version of runc past v1.0.0-rc90 breaks linuxkit. Not sure if its worth running a bisect through runc or not as runc appears to have introduced go modules at some point during that time which makes it tricky to build a nice one-line bisect.
A complete bisect finds that opencontainers/runc 60e21ec is the first bad commit. I don't see what to do from here, but a lot of work after that point in the runc tree has to do with cgroupsv2, and edits to the way it handles cgroups in general.
@the-maldridge I suspect the problem causing the docker: Error response from daemon: OCI runtime create failed: container_linux.go:380: starting container process caused: process_linux.go:545: container init caused: process_linux.go:508: setting cgroup config for procHooks process caused: failed to write "c 5:1 rwm": write /sys/fs/cgroup/devices/docker/18ee78ce6bc9e313c760dc1a1428ed719a980dbb96eadd4ba55b527458e66aa2/devices.allow: operation not permitted: unknown. error is that the docker service does not have access to the console device. I added this to my docker service definition and the problem went away:
devices:
- path: "/dev/console"
type: "c"
major: 5
minor: 1
mode: "0666"
- path: all
type: b