-
Notifications
You must be signed in to change notification settings - Fork 18.9k
Process capabilities cannot be retained when starting a container as non-root with --security-opt=no-new-privileges #45491
Description
Description
When using docker as a runtime in kubernetes, the capabilities specified in the container's security context (in the pod yaml manifests) are not respected if running as non-root user:
securityContext:
allowPrivilegeEscalation: false
capabilities:
add:
- NET_BIND_SERVICE
drop:
- ALL
privileged: false
runAsGroup: 107
runAsNonRoot: true
runAsUser: 107$ k exec -ti virt-launcher-testvm-XXXX -- bash
bash-5.1$ grep Cap /proc/1/status
CapInh: 0000000000000000
CapPrm: 0000000000000000 # permitted caps zeroed
CapEff: 0000000000000000 # effective caps zeroed
CapBnd: 0000000000000400 # cap_net_bind_service
CapAmb: 0000000000000000In KubeVirt project we had several similar issues reported: kubevirt/kubevirt#9465
This can be easily reproduced with minikube. Other runtimes (containerd and crio) handle the capabilities correctly:
CapInh: 0000000000000000
CapPrm: 0000000000000400 # cap_net_bind_service
CapEff: 0000000000000400 # cap_net_bind_service
CapBnd: 0000000000000400 # cap_net_bind_service
CapAmb: 0000000000000000
I briefly looked at the sources. Though I am not 100% confident that this snippet is actually causing the problem, but the bellow code looked suspicious to me:
Lines 31 to 35 in c651a53
| // Do not set Effective and Permitted capabilities for non-root users, | |
| // to match what execve does. | |
| s.Process.Capabilities = &specs.LinuxCapabilities{ | |
| Bounding: caplist, | |
| } |
It was introduced by this commit 349aeea (and refactored in 0d9a37d).
Reproduce
$ minikube start --driver=kvm2
$ k create -f https://github.com/kubevirt/kubevirt/releases/download/v0.59.0/kubevirt-operator.yaml
$ k create -f https://github.com/kubevirt/kubevirt/releases/download/v0.59.0/kubevirt-cr.yaml
$ wget https://kubevirt.io/labs/manifests/vm.yaml
$ vim vm.yaml # add annotation `kubevirt.io/keep-launcher-alive-after-failure: "true"`
$ k create -f vm.yaml
$ k edit vm testvm # set `running: true`
$ k logs -f virt-launcher-testvm-XXXX
...
{"component":"virt-launcher","level":"error","msg":"failed to start virtqemud","pos":"libvirt_helper.go:250","reason":"fork/exec /usr/sbin/virtqemud: errno 0","timestamp":"2023-05-08T09:34:32.370373Z"}
panic: fork/exec /usr/sbin/virtqemud: errno 0
...
$ k exec -ti virt-launcher-testvm-XXXX -- bash
bash-5.1$ grep Cap /proc/1/status
CapInh: 0000000000000000
CapPrm: 0000000000000000 # permitted caps zeroed
CapEff: 0000000000000000 # effective caps zeroed
CapBnd: 0000000000000400 # cap_net_bind_service
CapAmb: 0000000000000000Expected behavior
Effective/permitted caps should be set correctly:
CapPrm: 0000000000000400
CapEff: 0000000000000400
docker version
Client:
Version: 20.10.23
API version: 1.41
Go version: go1.18.10
Git commit: 7155243
Built: Thu Jan 19 17:30:35 2023
OS/Arch: linux/amd64
Context: default
Experimental: true
Server: Docker Engine - Community
Engine:
Version: 20.10.23
API version: 1.41 (minimum version 1.12)
Go version: go1.18.10
Git commit: 6051f14
Built: Thu Jan 19 17:36:08 2023
OS/Arch: linux/amd64
Experimental: false
containerd:
Version: v1.7.0
GitCommit: 1fbd70374134b891f97ce19c70b6e50c7b9f4e0d
runc:
Version: 1.1.5
GitCommit: f19387a6bec4944c770f7668ab51c4348d9c2f38
docker-init:
Version: 0.19.0
GitCommit: de40ad0docker info
Client:
Context: default
Debug Mode: false
Server:
Containers: 34
Running: 28
Paused: 0
Stopped: 6
Images: 14
Server Version: 20.10.23
Storage Driver: overlay2
Backing Filesystem: extfs
Supports d_type: true
Native Overlay Diff: true
userxattr: false
Logging Driver: json-file
Cgroup Driver: cgroupfs
Cgroup Version: 1
Plugins:
Volume: local
Network: bridge host ipvlan macvlan null overlay
Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
Swarm: inactive
Runtimes: io.containerd.runc.v2 io.containerd.runtime.v1.linux runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 1fbd70374134b891f97ce19c70b6e50c7b9f4e0d
runc version: f19387a6bec4944c770f7668ab51c4348d9c2f38
init version: de40ad0
Security Options:
seccomp
Profile: default
Kernel Version: 5.10.57
Operating System: Buildroot 2021.02.12
OSType: linux
Architecture: x86_64
CPUs: 2
Total Memory: 3.22GiB
Name: minikube
ID: 462Q:TJOC:6UQE:VT5O:7XAO:AS3J:5M6Q:VOT3:HXV2:HTVP:4TFY:4W7K
Docker Root Dir: /var/lib/docker
Debug Mode: false
Registry: https://index.docker.io/v1/
Labels:
provider=kvm2
Experimental: false
Insecure Registries:
10.96.0.0/12
127.0.0.0/8
Live Restore Enabled: false
Product License: Community Engine
WARNING: No blkio throttle.read_bps_device support
WARNING: No blkio throttle.write_bps_device support
WARNING: No blkio throttle.read_iops_device support
WARNING: No blkio throttle.write_iops_device supportAdditional Info
This can also be reproduced without KubeVirt:
$ k apply -f - <<EOF
apiVersion: v1
kind: Pod
metadata:
name: sleeper
spec:
restartPolicy: Never
terminationGracePeriodSeconds: 30
containers:
- name: sleeper
image: busybox
securityContext:
allowPrivilegeEscalation: false
capabilities:
add:
- NET_BIND_SERVICE
drop:
- ALL
privileged: false
runAsGroup: 107
runAsNonRoot: true
runAsUser: 107
command:
- /bin/sh
- "-euxc"
- |
sleep infinity
EOF
$ k exec -ti sleeper -- sh
~ $ ps aux
PID USER TIME COMMAND
1 107 0:00 /bin/sh -euxc sleep infinity
13 107 0:00 sh
19 107 0:00 ps aux
~ $ grep Cap /proc/1/status
CapInh: 0000000000000000
CapPrm: 0000000000000000
CapEff: 0000000000000000
CapBnd: 0000000000000400
CapAmb: 0000000000000000