[20.10 backport] seccomp: add support for "clone3" syscall in default policy#42836
[20.10 backport] seccomp: add support for "clone3" syscall in default policy#42836thaJeztah merged 1 commit intomoby:20.10from
Conversation
|
Doh, |
|
hmmmm yes. ISTR there was a recent change in runc related to that (at least, I recall a PR about ignoring defaults) |
|
Yup; fix for that is in v1.0.2. But that seems to indicate we may not need this patch / backport? opencontainers/runc#3173 |
|
Now that I'm reading that error message with a fresh eye, I realize it's complaining that |
|
Scratch that; looking at the implementation there, I see it does take |
eee326b to
0a709f7
Compare
|
I think I found what was wrong. 🤞 |
0a709f7 to
df18e28
Compare
|
Got it -- the only failure on the last run was |
|
FWIW, @lucaskanashiro tested this and reported in https://bugs.launchpad.net/cloud-images/+bug/1943049/comments/25 that it does fix the problem 😄 |
This is a backport of 9f6b562, adapted to avoid the refactoring that happened in d927397. Original commit message is as follows: > If no seccomp policy is requested, then the built-in default policy in > dockerd applies. This has no rule for "clone3" defined, nor any default > errno defined. So when runc receives the config it attempts to determine > a default errno, using logic defined in its commit: > > opencontainers/runc@7a8d716 > > As explained in the above commit message, runc uses a heuristic to > decide which errno to return by default: > > [quote] > The solution applied here is to prepend a "stub" filter which returns > -ENOSYS if the requested syscall has a larger syscall number than any > syscall mentioned in the filter. The reason for this specific rule is > that syscall numbers are (roughly) allocated sequentially and thus newer > syscalls will (usually) have a larger syscall number -- thus causing our > filters to produce -ENOSYS if the filter was written before the syscall > existed. > [/quote] > > Unfortunately clone3 appears to one of the edge cases that does not > result in use of ENOSYS, instead ending up with the historical EPERM > errno. > > Latest glibc (2.33.9000, in Fedora 35 rawhide) will attempt to use > clone3 by default. If it sees ENOSYS then it will automatically > fallback to using clone. Any other errno is treated as a fatal > error. Thus when docker seccomp policy triggers EPERM from clone3, > no fallback occurs and programs are thus unable to spawn threads. > > The clone3 syscall is much more complicated than clone, most notably its > flags are not exposed as a directly argument any more. Instead they are > hidden inside a struct. This means that seccomp filters are unable to > apply policy based on values seen in flags. Thus we can't directly > replicate the current "clone" filtering for "clone3". We can at least > ensure "clone3" returns ENOSYS errno, to trigger fallback to "clone" > at which point we can filter on flags. Signed-off-by: Tianon Gravi <admwiggin@gmail.com> Co-authored-by: Daniel P. Berrangé <berrange@redhat.com>
df18e28 to
567c01f
Compare
|
Re-reviewing this, it felt silly to still have the |
|
Hi, Can you please consider merging this change and expediting a release containing the fix? This clone3 issue has caused a lot of problems over the past couple months. Thanks you for your work. I apologize if I'm being too pushy. -- Maxwell 😀 |
Following commit migrates CI to latest stable `f35`. * Make sure we are using latest docker: Since support for `clone3` was added in default seccomp profile was backported via moby/moby#42836. * Bump podman to `v3.4.0` for `f35` * New versions of `clang-check` adds a check for `-Wunused-but-set-variable` so mark variables with `__attribute__ ((unused))` where its needed. Signed-off-by: Aditya Rajan <arajan@redhat.com>
Following commit migrates CI to latest stable `f35`. * Make sure we are using latest docker: Since support for `clone3` was added in default seccomp profile was backported via moby/moby#42836. * Bump podman to `v3.4.0` for `f35` and skip newly added tests which are not supported in this env. * New versions of `clang-check` adds a check for `-Wunused-but-set-variable` so mark variables with `__attribute__ ((unused))` where its needed. Signed-off-by: Aditya Rajan <arajan@redhat.com>
Following commit migrates CI to latest stable `f35`. * Make sure we are using latest docker: Since support for `clone3` was added in default seccomp profile was backported via moby/moby#42836. * Bump podman to `v3.4.0` for `f35` and skip newly added tests which are not supported in this env. Signed-off-by: Aditya Rajan <arajan@redhat.com>
|
@zhsj brought https://buildd.debian.org/status/fetch.php?pkg=docker.io&arch=mips64el&ver=20.10.10%2Bdfsg1-1&stamp=1636015943&raw=0 to my attention in which I think this test will probably fail on MIPS arches on the version that made it into master, too (for the same reason). Edit: I've raised this further up the chain at opencontainers/runtime-spec#1087 (comment) now |
|
On the plus side, at least this only affects MIPS* and sparc64? 😬 $ git grep -inE 'ENOSYS[[:space:]]+=' | grep linux | column -t
unix/zerrors_linux_386.go:588: ENOSYS = syscall.Errno(0x26)
unix/zerrors_linux_amd64.go:588: ENOSYS = syscall.Errno(0x26)
unix/zerrors_linux_arm.go:594: ENOSYS = syscall.Errno(0x26)
unix/zerrors_linux_arm64.go:585: ENOSYS = syscall.Errno(0x26)
unix/zerrors_linux_mips.go:591: ENOSYS = syscall.Errno(0x59)
unix/zerrors_linux_mips64.go:591: ENOSYS = syscall.Errno(0x59)
unix/zerrors_linux_mips64le.go:591: ENOSYS = syscall.Errno(0x59)
unix/zerrors_linux_mipsle.go:591: ENOSYS = syscall.Errno(0x59)
unix/zerrors_linux_ppc.go:646: ENOSYS = syscall.Errno(0x26)
unix/zerrors_linux_ppc64.go:650: ENOSYS = syscall.Errno(0x26)
unix/zerrors_linux_ppc64le.go:650: ENOSYS = syscall.Errno(0x26)
unix/zerrors_linux_riscv64.go:575: ENOSYS = syscall.Errno(0x26)
unix/zerrors_linux_s390x.go:650: ENOSYS = syscall.Errno(0x26)
unix/zerrors_linux_sparc64.go:639: ENOSYS = syscall.Errno(0x5a)Edit: I've raised this further up the chain at opencontainers/runtime-spec#1087 (comment) now |
|
Ahm.. yes, so we could adjust the test and use a different moby/profiles/seccomp/default_linux.go Lines 613 to 619 in 2b70006 |
|
But I guess the "portable" way would be to use a string for this, and have the conversion done by the runtime that sets the actual profile |
…MIPS The value of ENOSYS differs between MIPS and non-MIPS architectures. While this is not problematic for the embedded seccomp profile, it prevents the profile from being saved as a portable JSON file that can be used for both architectures. To work around this situation, we include conditional rules for both arches. and hard-code the value for ENOSYS in both. For more details, refer to moby#42836 (comment) and opencontainers/runtime-spec#1087 (comment) A test was added, which can be run with: go test --tags=seccomp -run TestLoadConditionalClone3 -v ./profiles/seccomp/ === RUN TestLoadConditionalClone3 === RUN TestLoadConditionalClone3/clone3_default_amd64 === RUN TestLoadConditionalClone3/clone3_default_mips === RUN TestLoadConditionalClone3/clone3_cap_sys_admin_amd64 === RUN TestLoadConditionalClone3/clone3_cap_sys_admin_mips --- PASS: TestLoadConditionalClone3 (0.01s) --- PASS: TestLoadConditionalClone3/clone3_default_amd64 (0.00s) --- PASS: TestLoadConditionalClone3/clone3_default_mips (0.00s) --- PASS: TestLoadConditionalClone3/clone3_cap_sys_admin_amd64 (0.00s) --- PASS: TestLoadConditionalClone3/clone3_cap_sys_admin_mips (0.00s) PASS ok github.com/docker/docker/profiles/seccomp 0.015s Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
…MIPS The value of ENOSYS differs between MIPS and non-MIPS architectures. While this is not problematic for the embedded seccomp profile, it prevents the profile from being saved as a portable JSON file that can be used for both architectures. To work around this situation, we include conditional rules for both arches. and hard-code the value for ENOSYS in both. For more details, refer to moby#42836 (comment) and opencontainers/runtime-spec#1087 (comment) A test was added, which can be run with: go test --tags=seccomp -run TestLoadConditionalClone3 -v ./profiles/seccomp/ === RUN TestLoadConditionalClone3 === RUN TestLoadConditionalClone3/clone3_default_amd64 === RUN TestLoadConditionalClone3/clone3_default_mips === RUN TestLoadConditionalClone3/clone3_cap_sys_admin_amd64 === RUN TestLoadConditionalClone3/clone3_cap_sys_admin_mips --- PASS: TestLoadConditionalClone3 (0.01s) --- PASS: TestLoadConditionalClone3/clone3_default_amd64 (0.00s) --- PASS: TestLoadConditionalClone3/clone3_default_mips (0.00s) --- PASS: TestLoadConditionalClone3/clone3_cap_sys_admin_amd64 (0.00s) --- PASS: TestLoadConditionalClone3/clone3_cap_sys_admin_mips (0.00s) PASS ok github.com/docker/docker/profiles/seccomp 0.015s Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
Fedora containers fail with QEMU 6.1.0 and Docker 20.10.9. It can be worked around with option '--security-opt seccomp=unconfined', but that is not allowed on GitHub Actions. Anyway, it is fixed in Docker 20.10.10. * moby/moby#42963 * moby/moby#42836
glibc 2.34 uses the clone3 syscall, which is not part of the seccomp filters that moby ships on older versions. While as a workaround you might be able to run containers with `--privileged`, it's the better call to just run a more recent Docker runtime. References: - docker/buildx#772 - moby/buildkit#2379 - moby/moby#42836 - NixOS/nixpkgs#170900
This is a backport of 9f6b562 (#42681, see also #42680), adapted to avoid the refactoring that happened in d927397 (#42005).
fixes #42963
fixes #42876