seccomp: Explicitly block socketcall to prevent AF_ALG filter bypass#13330
seccomp: Explicitly block socketcall to prevent AF_ALG filter bypass#13330vvoland wants to merge 4 commits into
Conversation
Add a comment explaining the purpose of the socket rules and noting that on 32-bit x86, socket() goes through socketcall(2) which is allowed unconditionally, so these arg filters only apply to the direct socket syscall. Signed-off-by: Paweł Gronowski <pawel.gronowski@docker.com>
AF_ALG (address family 38) exposes the Linux kernel crypto API to userspace via socket(2). Containers have no legitimate need for this interface under the default profile, and leaving it accessible widens the kernel attack surface unnecessarily (see https://copy.fail/). The previous socket rule used a single "arg0 != AF_VSOCK" condition. Adding a second OpNotEqual for AF_ALG does not work because seccomp evaluates multiple argument conditions within a single rule as a logical AND against the same argument index. Instead, restructure the socket allowlist into three range-based rules that cover every domain except AF_ALG (38) and AF_VSOCK (40): 1. Allow socket when arg0 < 38 (all families below AF_ALG) 2. Allow socket when arg0 == 39 (the single family between them) 3. Allow socket when arg0 > 40 (all families above AF_VSOCK) Domains 38 and 40 match none of the three rules and fall through to the default SCMP_ACT_ERRNO action. Signed-off-by: Paweł Gronowski <pawel.gronowski@docker.com>
The socket rules depend on AF_ALG and AF_VSOCK being exactly 38 and 40 with a single family between them. Add compile-time array size checks that will fail the build if these constants ever change. Signed-off-by: Paweł Gronowski <pawel.gronowski@docker.com>
The socket arg filters that block AF_ALG and AF_VSOCK only apply to the direct socket(2) syscall. On architectures with the legacy socketcall(2) multiplexer (i386, s390, MIPS o32), libseccomp auto-generates a socketcall(SYS_SOCKET) -> ALLOW companion for each socket ALLOW rule. This companion only checks the socketcall sub-command number, not the address family (behind a pointer BPF cannot dereference), bypassing the AF_ALG block for 32-bit binaries. Move socketcall from the unconditional allow list to an explicit ERRNO(ENOSYS) deny rule placed before the socket allow rules. ENOSYS must be used instead of EPERM because the deny errno must differ from DefaultErrnoRet (EPERM): runc skips calling seccomp_rule_add() when a rule's action matches the default action, so an EPERM deny is never passed to libseccomp and the auto-generated socketcall ALLOW path survives unchallenged. Since Linux 4.3, all affected architectures provide direct socket syscalls and modern glibc/musl already use them. Port of moby/profiles#21 Signed-off-by: Paweł Gronowski <pawel.gronowski@docker.com>
There was a problem hiding this comment.
Pull request overview
Hardens the default seccomp profile to prevent 32-bit socketcall(2) from bypassing the existing socket(2) address-family filters (notably the AF_ALG block), aligning containerd’s default profile behavior with the referenced moby/profiles changes.
Changes:
- Removes
socketcallfrom the unconditional allow list and adds an explicitERRNO(ENOSYS)deny rule forsocketcall. - Keeps the AF_ALG / AF_VSOCK exclusions effective by ensuring the
socket(2)allow rules remain range-based (including changing the last rule to> AF_VSOCK). - Adds compile-time assertions to lock in assumptions used by the socket-family filtering logic.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| // socketcall(2) is explicitly denied to prevent bypassing the socket | ||
| // address family filters below on architectures where socketcall is | ||
| // supported (i386, s390, MIPS o32). | ||
| // Seccomp cannot inspect socketcall's pointer argument, so allowing it | ||
| // would let an attacker open AF_ALG sockets via socketcall(SYS_SOCKET, | ||
| // ...). Since Linux 4.3 all affected architectures provide direct | ||
| // socket syscalls, so modern userspace is not impacted. | ||
| // | ||
| // ENOSYS (not EPERM) is used because the errno must differ from | ||
| // DefaultErrnoRet; otherwise both runc and libseccomp treat the rule | ||
| // as identical to the default action and silently omit it from the | ||
| // generated BPF, which lets libseccomp's auto-generated | ||
| // socketcall(SYS_SOCKET) -> ALLOW path survive unchallenged. | ||
| { | ||
| Names: []string{"socketcall"}, | ||
| Action: specs.ActErrno, | ||
| ErrnoRet: &nosys, | ||
| }, |
| { | ||
| Names: []string{"socketcall"}, | ||
| Action: specs.ActErrno, | ||
| ErrnoRet: &nosys, | ||
| }, |
|
Noting that the only new commit in this PR is 27093dc; the other 3 are part of #13327 already. (This would be a good test of stacked PRs if we had access to them.) |
|
FWIW; some issues reported related to the blocking; not sure if it's this patch or the other one; see |
|
PR needs rebase. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
The socket arg filters that block AF_ALG and AF_VSOCK only apply to the direct socket(2) syscall. On architectures with the legacy socketcall(2) multiplexer (i386, s390, MIPS o32), libseccomp auto-generates a socketcall(SYS_SOCKET) -> ALLOW companion for each socket ALLOW rule. This companion only checks the socketcall sub-command number, not the address family (behind a pointer BPF cannot dereference), bypassing the AF_ALG block for 32-bit binaries.
Move socketcall from the unconditional allow list to an explicit ERRNO(ENOSYS) deny rule placed before the socket allow rules. ENOSYS must be used instead of EPERM because the deny errno must differ from DefaultErrnoRet (EPERM): runc skips calling seccomp_rule_add() when a rule's action matches the default action, so an EPERM deny is never passed to libseccomp and the auto-generated socketcall ALLOW path survives unchallenged.
Since Linux 4.3, all affected architectures provide direct socket syscalls and modern glibc/musl already use them.