Skip to content

'runc exec' errors with 'failed to setns into net namespace: Operation not permitted' #4390

@thundergolfer

Description

@thundergolfer

Description

At modal.com we run a custom multi-tenant container runtime which can use runc or runsc (gVisor). For us runsc exec is working but we're hitting a failure on doing runc exec which I've debugged for a long time and can't root cause.

Doing runc exec ta-01J5P4BZS64CE57EXK048QMNE1 bash fails because of EPERM on attempting to enter the runc container's network namespace.

Using sudo strace -ft runc exec -cap CAP_SYS_ADMIN ta-01J5P4BZS64CE57EXK048QMNE1 bash I can see that specifically it's failing on the setns syscall like this:

[pid 1021859] 20:17:39 setns(11, CLONE_NEWNET) = -1 EPERM (Operation not permitted

Oddly running sudo nsenter --all --target=267854 ls from the same terminal works. If I strace that command I can see that it makes the same syscalls as runc exec albeit in a different order.

17:30:23 openat(AT_FDCWD, "/proc/267854/ns/user", O_RDONLY) = 3
17:30:23 openat(AT_FDCWD, "/proc/267854/ns/cgroup", O_RDONLY) = 4
17:30:23 openat(AT_FDCWD, "/proc/267854/ns/ipc", O_RDONLY) = 5
17:30:23 openat(AT_FDCWD, "/proc/267854/ns/uts", O_RDONLY) = 6
17:30:23 openat(AT_FDCWD, "/proc/267854/ns/net", O_RDONLY) = 7
17:30:23 openat(AT_FDCWD, "/proc/267854/ns/pid", O_RDONLY) = 8
17:30:23 openat(AT_FDCWD, "/proc/267854/ns/mnt", O_RDONLY) = 9
17:30:23 setgroups(0, NULL)             = 0
17:30:23 setns(4, CLONE_NEWCGROUP)      = 0
17:30:23 close(4)                       = 0
17:30:23 setns(5, CLONE_NEWIPC)         = 0
17:30:23 close(5)                       = 0
17:30:23 setns(6, CLONE_NEWUTS)         = 0
17:30:23 close(6)                       = 0
17:30:23 setns(7, CLONE_NEWNET)         = 0
17:30:23 close(7)                       = 0
17:30:23 setns(8, CLONE_NEWPID)         = 0
17:30:23 close(8)                       = 0
17:30:23 setns(9, CLONE_NEWNS)          = 0
17:30:23 close(9)                       = 0
17:30:23 setns(3, CLONE_NEWUSER)        = 0
17:30:23 close(3)                       = 0

Things I've looked into:

  • Capabilities — I'm running with sudo so this shouldn't be a problem
  • Namespace hierarchy — looks correct
  • Seccomp — doesn't appear active
  • SELinux — is disabled
  • AppArmor — is disabled

I'm stuck on figuring out what's wrong here. My next move was going to be compiling my own runc to add debugging code into nsexec.c.

Steps to reproduce the issue

I fear this is tricky to reproduce, but I will provide details on what we're doing:

  1. From our container runtime running as root: runc --system-cgroup run ta-123 --bundle $BUNDLE_PATH
    a. config.json given below
  2. From a terminal on the same host: sudo runc --debug exec -c CAP_SYS_ADMIN ta-01J6NQG0GEHAQ07FTVHC4GAS64 ls

The container's network namespace is created from our container runtime with ip netns add ta-123 prior to container creation, and inside a CreateRuntime hook we use the CNI Bridge and Loopback plugins to setup lo and eth0.

config.json
{
  "annotations": {
    "org.systemd.property.IPAccounting": "true"
  },
  "hooks": {
    "createRuntime": [
      {
        "args": ["$OMITTED$"],
        "path": "/usr/bin/python3"
      }
    ],
    "postStop": [
      {
        "args": ["$OMITTED$"],
        "path": "/usr/bin/python3"
      }
    ]
  },
  "hostname": "modal",
  "linux": {
    "cgroupsPath": "modal.slice:container:ta-01J6NG5R94KSZSJAHD2XXMYXMS",
    "gidMappings": [
      {
        "containerID": 0,
        "hostID": 0,
        "size": 1
      },
      {
        "containerID": 1,
        "hostID": 100000,
        "size": 16777216
      }
    ],
    "maskedPaths": [
      "/proc/acpi",
      "/proc/asound",
      "/proc/kcore",
      "/proc/keys",
      "/proc/latency_stats",
      "/proc/timer_list",
      "/proc/timer_stats",
      "/proc/sched_debug",
      "/sys/devices/virtual",
      "/sys/firmware",
      "/proc/scsi"
    ],
    "namespaces": [
      {
        "type": "pid"
      },
      {
        "type": "ipc"
      },
      {
        "type": "uts"
      },
      {
        "type": "mount"
      },
      {
        "type": "user"
      },
      {
        "type": "cgroup"
      },
      {
        "path": "/run/netns/ta-01J6NG5R94KSZSJAHD2XXMYXMS",
        "type": "network"
      }
    ],
    "readonlyPaths": [
      "/proc/bus",
      "/proc/fs",
      "/proc/irq",
      "/proc/sys",
      "/proc/sysrq-trigger"
    ],
    "resources": {
      "cpu": {
        "period": 100000,
        "quota": 412500,
        "shares": 128
      },
      "memory": {
        "reservation": 134217728
      }
    },
    "sysctl": {},
    "uidMappings": [
      {
        "containerID": 0,
        "hostID": 0,
        "size": 1
      },
      {
        "containerID": 1,
        "hostID": 100000,
        "size": 16777216
      }
    ]
  },
  "mounts": [
    {
      "destination": "/proc",
      "source": "proc",
      "type": "proc"
    },
    {
      "destination": "/dev",
      "options": [
        "nosuid",
        "strictatime",
        "mode=755",
        "size=65536k"
      ],
      "source": "tmpfs",
      "type": "tmpfs"
    },
    {
      "destination": "/dev/pts",
      "options": [
        "nosuid",
        "noexec",
        "newinstance",
        "ptmxmode=0666",
        "mode=0620"
      ],
      "source": "devpts",
      "type": "devpts"
    },
    {
      "destination": "/dev/shm",
      "options": [
        "nosuid",
        "noexec",
        "nodev",
        "mode=1777",
        "size=65536k"
      ],
      "source": "shm",
      "type": "tmpfs"
    },
    {
      "destination": "/dev/mqueue",
      "options": [
        "nosuid",
        "noexec",
        "nodev"
      ],
      "source": "mqueue",
      "type": "mqueue"
    },
    {
      "destination": "/sys",
      "options": [
        "nosuid",
        "noexec",
        "nodev",
        "ro",
        "rbind"
      ],
      "source": "/sys",
      "type": "bind"
    },
    {
      "destination": "/sys/fs/cgroup",
      "options": [
        "nosuid",
        "noexec",
        "nodev",
        "relatime",
        "ro"
      ],
      "source": "cgroup",
      "type": "cgroup"
    },
    {
      "destination": "/etc/resolv.conf",
      "options": [
        "ro",
        "rbind",
        "rprivate",
        "nosuid",
        "noexec",
        "nodev"
      ],
      "source": "/opt/container-etc-resolv.conf",
      "type": "bind"
    },
    {
      "destination": "/run/modal.sock",
      "options": [
        "nosuid",
        "nodev",
        "noexec",
        "bind",
        "private"
      ],
      "source": "/run/modal-ta-01J6NG5R94KSZSJAHD2XXMYXMS-388233379.sock",
      "type": "bind"
    }
  ],
  "ociVersion": "1.0.2-dev",
  "process": {
    "args": [
      "/bin/dumb-init",
      "--",
      "python",
      "-u",
      "-R",
      "--check-hash-based-pycs",
      "never",
      "-m",
      "modal._container_entrypoint",
     "Ch10YS0wMUo2Tkc1Ujk0S1NaU0pBSEQyWFhNWVhNUxIZZnUtMGRmNTA1WFU4ZnFrZkp5a2RCQUV3bCIZYXAtd1JBaXM5UlhJWnpFcDhrRXhFT2lMeDqPAQoCZjESAWYaGW1vLVFBUnBqRDNvbDJhWFd1V0pyYUxBclIaGW1vLWNNQkhGWUJLTHZyemowN1lXeWM5NjQiGWltLWN1eWltWlBhSXR3SjFQamowTFB5OTc4AkACSgIiAKgBkE7yAQRydW5j2gIbChlpbS1jdXlpbVpQYUl0d0oxUGpqMExQeTk3wgMA+AMBagRtYWlu"
    ],
    "capabilities": {
      "ambient": [
        "CAP_AUDIT_WRITE",
        "CAP_CHOWN",
        "CAP_DAC_OVERRIDE",
        "CAP_FOWNER",
        "CAP_FSETID",
        "CAP_KILL",
        "CAP_MKNOD",
        "CAP_NET_BIND_SERVICE",
        "CAP_NET_RAW",
        "CAP_SETFCAP",
        "CAP_SETGID",
        "CAP_SETPCAP",
        "CAP_SETUID",
        "CAP_SYS_CHROOT"
      ],
      "bounding": [
        "CAP_AUDIT_WRITE",
        "CAP_CHOWN",
        "CAP_DAC_OVERRIDE",
        "CAP_FOWNER",
        "CAP_FSETID",
        "CAP_KILL",
        "CAP_MKNOD",
        "CAP_NET_BIND_SERVICE",
        "CAP_NET_RAW",
        "CAP_SETFCAP",
        "CAP_SETGID",
        "CAP_SETPCAP",
        "CAP_SETUID",
        "CAP_SYS_CHROOT"
      ],
      "effective": [
        "CAP_AUDIT_WRITE",
        "CAP_CHOWN",
        "CAP_DAC_OVERRIDE",
        "CAP_FOWNER",
        "CAP_FSETID",
        "CAP_KILL",
        "CAP_MKNOD",
        "CAP_NET_BIND_SERVICE",
        "CAP_NET_RAW",
        "CAP_SETFCAP",
        "CAP_SETGID",
        "CAP_SETPCAP",
        "CAP_SETUID",
        "CAP_SYS_CHROOT"
      ],
      "inheritable": [
        "CAP_AUDIT_WRITE",
        "CAP_CHOWN",
        "CAP_DAC_OVERRIDE",
        "CAP_FOWNER",
        "CAP_FSETID",
        "CAP_KILL",
        "CAP_MKNOD",
        "CAP_NET_BIND_SERVICE",
        "CAP_NET_RAW",
        "CAP_SETFCAP",
        "CAP_SETGID",
        "CAP_SETPCAP",
        "CAP_SETUID",
        "CAP_SYS_CHROOT"
      ],
      "permitted": [
        "CAP_AUDIT_WRITE",
        "CAP_CHOWN",
        "CAP_DAC_OVERRIDE",
        "CAP_FOWNER",
        "CAP_FSETID",
        "CAP_KILL",
        "CAP_MKNOD",
        "CAP_NET_BIND_SERVICE",
        "CAP_NET_RAW",
        "CAP_SETFCAP",
        "CAP_SETGID",
        "CAP_SETPCAP",
        "CAP_SETUID",
        "CAP_SYS_CHROOT"
      ]
    },
    "cwd": "/root",
    "env": [
      "PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin",
      "TERM=xterm",
      "SSL_CERT_DIR=/etc/ssl/certs",
      "SOURCE_DATE_EPOCH=1641013200",
      "PIP_NO_CACHE_DIR=off",
      "PYTHONHASHSEED=0",
      "PIP_ROOT_USER_ACTION=ignore",
      "CFLAGS=-g0",
      "PIP_DEFAULT_TIMEOUT=30",
      "BLIS_NUM_THREADS=1",
      "GPG_KEY=A035C8C19219BA821ECEA86B64E628F8D684696D",
      "LANG=C.UTF-8",
      "MKL_NUM_THREADS=1",
      "OMP_NUM_THREADS=1",
      "OPENBLAS_NUM_THREADS=1",
      "PATH=/usr/local/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin",
      "PYTHONPATH=/pkg/:/root/",
    ],
    "noNewPrivileges": true,
    "rlimits": [
      {
        "hard": 65536,
        "soft": 65536,
        "type": "RLIMIT_NOFILE"
      }
    ],
    "terminal": false,
    "user": {
      "gid": 0,
      "uid": 0
    }
  },
  "root": {
    "path": "/tmp/task-data-cTR0Dv/ta-01J6NG5R94KSZSJAHD2XXMYXMS/.tmpZue7Vk/rootfs",
    "readonly": false
  }
}

Describe the results you received and expected

I expect that runc exec will succeed, but it fails on entering the network namespace. Full failure:

sudo runc --debug exec -c CAP_SYS_ADMIN ta-01J6NQG0GEHAQ07FTVHC4GAS64 ip
DEBU[0000] nsexec[1889889]: => nsexec container setup
DEBU[0000] nsexec[1889889]: set process as non-dumpable
DEBU[0000] nsexec-0[1889889]: ~> nsexec stage-0
DEBU[0000] nsexec-0[1889889]: spawn stage-1
DEBU[0000] nsexec-0[1889889]: -> stage-1 synchronisation loop
DEBU[0000] nsexec-1[1889891]: ~> nsexec stage-1
DEBU[0000] nsexec-1[1889891]: setns(0x10000000) into user namespace (with path /proc/1813667/ns/user)
DEBU[0000] nsexec-1[1889891]: setns(0x8000000) into ipc namespace (with path /proc/1813667/ns/ipc)
DEBU[0000] nsexec-1[1889891]: setns(0x4000000) into uts namespace (with path /proc/1813667/ns/uts)
DEBU[0000] nsexec-1[1889891]: setns(0x40000000) into net namespace (with path /proc/1813667/ns/net)
FATA[0000] nsexec-1[1889891]: failed to setns into net namespace: Operation not permitted
FATA[0000] nsexec-0[1889889]: failed to sync with stage-1: next state: Invalid argument

What version of runc are you using?

runc --version
runc version 1.7.19
commit: v1.1.13-0-g58aa920
spec: 1.0.2-dev
go: go1.21.12
libseccomp: 2.5.1

and

./runc.amd64 --version
runc version 1.1.13
commit: v1.1.13-0-g58aa9203-dirty
spec: 1.0.2-dev
go: go1.21.11
libseccomp: 2.5.5

Host OS information

cat /etc/os-release
NAME="Ubuntu"
VERSION="20.04.6 LTS (Focal Fossa)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 20.04.6 LTS"
VERSION_ID="20.04"
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
VERSION_CODENAME=focal
UBUNTU_CODENAME=focal

But also reproduced on Oracle Linux as well.

Host kernel information

Linux ip-10-1-1-198 5.15.0-1068-aws #74~20.04.1-Ubuntu SMP Tue Aug 6 19:32:13 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions