Skip to content

Process capabilities cannot be retained when starting a container as non-root with --security-opt=no-new-privileges #45491

@vasiliy-ul

Description

@vasiliy-ul

Description

When using docker as a runtime in kubernetes, the capabilities specified in the container's security context (in the pod yaml manifests) are not respected if running as non-root user:

    securityContext:
      allowPrivilegeEscalation: false
      capabilities:
        add:
        - NET_BIND_SERVICE
        drop:
        - ALL
      privileged: false
      runAsGroup: 107
      runAsNonRoot: true
      runAsUser: 107
$ k exec -ti virt-launcher-testvm-XXXX -- bash
bash-5.1$ grep Cap /proc/1/status 
CapInh:	0000000000000000
CapPrm:	0000000000000000 # permitted caps zeroed
CapEff:	0000000000000000 # effective caps zeroed
CapBnd:	0000000000000400 # cap_net_bind_service
CapAmb:	0000000000000000

In KubeVirt project we had several similar issues reported: kubevirt/kubevirt#9465

This can be easily reproduced with minikube. Other runtimes (containerd and crio) handle the capabilities correctly:

CapInh:	0000000000000000
CapPrm:	0000000000000400 # cap_net_bind_service
CapEff:	0000000000000400 # cap_net_bind_service
CapBnd:	0000000000000400 # cap_net_bind_service
CapAmb:	0000000000000000

I briefly looked at the sources. Though I am not 100% confident that this snippet is actually causing the problem, but the bellow code looked suspicious to me:

moby/oci/oci.go

Lines 31 to 35 in c651a53

// Do not set Effective and Permitted capabilities for non-root users,
// to match what execve does.
s.Process.Capabilities = &specs.LinuxCapabilities{
Bounding: caplist,
}

It was introduced by this commit 349aeea (and refactored in 0d9a37d).

Reproduce

$ minikube start --driver=kvm2
$ k create -f https://github.com/kubevirt/kubevirt/releases/download/v0.59.0/kubevirt-operator.yaml
$ k create -f https://github.com/kubevirt/kubevirt/releases/download/v0.59.0/kubevirt-cr.yaml
$ wget https://kubevirt.io/labs/manifests/vm.yaml
$ vim vm.yaml # add annotation `kubevirt.io/keep-launcher-alive-after-failure: "true"`
$ k create -f vm.yaml
$ k edit vm testvm # set `running: true`
$ k logs -f virt-launcher-testvm-XXXX
...
{"component":"virt-launcher","level":"error","msg":"failed to start virtqemud","pos":"libvirt_helper.go:250","reason":"fork/exec /usr/sbin/virtqemud: errno 0","timestamp":"2023-05-08T09:34:32.370373Z"}
panic: fork/exec /usr/sbin/virtqemud: errno 0
...
$ k exec -ti virt-launcher-testvm-XXXX -- bash
bash-5.1$ grep Cap /proc/1/status 
CapInh:	0000000000000000
CapPrm:	0000000000000000 # permitted caps zeroed
CapEff:	0000000000000000 # effective caps zeroed
CapBnd:	0000000000000400 # cap_net_bind_service
CapAmb:	0000000000000000

Expected behavior

Effective/permitted caps should be set correctly:

CapPrm:	0000000000000400
CapEff:	0000000000000400

docker version

Client:
 Version:           20.10.23
 API version:       1.41
 Go version:        go1.18.10
 Git commit:        7155243
 Built:             Thu Jan 19 17:30:35 2023
 OS/Arch:           linux/amd64
 Context:           default
 Experimental:      true

Server: Docker Engine - Community
 Engine:
  Version:          20.10.23
  API version:      1.41 (minimum version 1.12)
  Go version:       go1.18.10
  Git commit:       6051f14
  Built:            Thu Jan 19 17:36:08 2023
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          v1.7.0
  GitCommit:        1fbd70374134b891f97ce19c70b6e50c7b9f4e0d
 runc:
  Version:          1.1.5
  GitCommit:        f19387a6bec4944c770f7668ab51c4348d9c2f38
 docker-init:
  Version:          0.19.0
  GitCommit:        de40ad0

docker info

Client:
 Context:    default
 Debug Mode: false

Server:
 Containers: 34
  Running: 28
  Paused: 0
  Stopped: 6
 Images: 14
 Server Version: 20.10.23
 Storage Driver: overlay2
  Backing Filesystem: extfs
  Supports d_type: true
  Native Overlay Diff: true
  userxattr: false
 Logging Driver: json-file
 Cgroup Driver: cgroupfs
 Cgroup Version: 1
 Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
 Swarm: inactive
 Runtimes: io.containerd.runc.v2 io.containerd.runtime.v1.linux runc
 Default Runtime: runc
 Init Binary: docker-init
 containerd version: 1fbd70374134b891f97ce19c70b6e50c7b9f4e0d
 runc version: f19387a6bec4944c770f7668ab51c4348d9c2f38
 init version: de40ad0
 Security Options:
  seccomp
   Profile: default
 Kernel Version: 5.10.57
 Operating System: Buildroot 2021.02.12
 OSType: linux
 Architecture: x86_64
 CPUs: 2
 Total Memory: 3.22GiB
 Name: minikube
 ID: 462Q:TJOC:6UQE:VT5O:7XAO:AS3J:5M6Q:VOT3:HXV2:HTVP:4TFY:4W7K
 Docker Root Dir: /var/lib/docker
 Debug Mode: false
 Registry: https://index.docker.io/v1/
 Labels:
  provider=kvm2
 Experimental: false
 Insecure Registries:
  10.96.0.0/12
  127.0.0.0/8
 Live Restore Enabled: false
 Product License: Community Engine

WARNING: No blkio throttle.read_bps_device support
WARNING: No blkio throttle.write_bps_device support
WARNING: No blkio throttle.read_iops_device support
WARNING: No blkio throttle.write_iops_device support

Additional Info

This can also be reproduced without KubeVirt:

$ k apply -f - <<EOF
apiVersion: v1
kind: Pod
metadata:
  name: sleeper
spec:
  restartPolicy: Never
  terminationGracePeriodSeconds: 30
  containers:
  - name: sleeper
    image: busybox
    securityContext:
      allowPrivilegeEscalation: false
      capabilities:
        add:
        - NET_BIND_SERVICE
        drop:
        - ALL
      privileged: false
      runAsGroup: 107
      runAsNonRoot: true
      runAsUser: 107
    command:
    - /bin/sh
    - "-euxc"
    - |
      sleep infinity
EOF
$ k exec -ti sleeper -- sh
~ $ ps aux
PID   USER     TIME  COMMAND
    1 107       0:00 /bin/sh -euxc sleep infinity 
   13 107       0:00 sh
   19 107       0:00 ps aux
~ $ grep Cap /proc/1/status
CapInh:	0000000000000000
CapPrm:	0000000000000000
CapEff:	0000000000000000
CapBnd:	0000000000000400
CapAmb:	0000000000000000

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions