Skip to content

seccomp: Explicitly block socketcall to prevent AF_ALG filter bypass#13330

Draft
vvoland wants to merge 4 commits into
containerd:mainfrom
vvoland:block-socketcall
Draft

seccomp: Explicitly block socketcall to prevent AF_ALG filter bypass#13330
vvoland wants to merge 4 commits into
containerd:mainfrom
vvoland:block-socketcall

Conversation

@vvoland

@vvoland vvoland commented May 1, 2026

Copy link
Copy Markdown
Contributor

The socket arg filters that block AF_ALG and AF_VSOCK only apply to the direct socket(2) syscall. On architectures with the legacy socketcall(2) multiplexer (i386, s390, MIPS o32), libseccomp auto-generates a socketcall(SYS_SOCKET) -> ALLOW companion for each socket ALLOW rule. This companion only checks the socketcall sub-command number, not the address family (behind a pointer BPF cannot dereference), bypassing the AF_ALG block for 32-bit binaries.

Move socketcall from the unconditional allow list to an explicit ERRNO(ENOSYS) deny rule placed before the socket allow rules. ENOSYS must be used instead of EPERM because the deny errno must differ from DefaultErrnoRet (EPERM): runc skips calling seccomp_rule_add() when a rule's action matches the default action, so an EPERM deny is never passed to libseccomp and the auto-generated socketcall ALLOW path survives unchallenged.

Since Linux 4.3, all affected architectures provide direct socket syscalls and modern glibc/musl already use them.

vvoland added 4 commits April 30, 2026 22:44
Add a comment explaining the purpose of the socket rules and noting that
on 32-bit x86, socket() goes through socketcall(2) which is allowed
unconditionally, so these arg filters only apply to the direct socket
syscall.

Signed-off-by: Paweł Gronowski <pawel.gronowski@docker.com>
AF_ALG (address family 38) exposes the Linux kernel crypto API to
userspace via socket(2). Containers have no legitimate need for this
interface under the default profile, and leaving it accessible widens
the kernel attack surface unnecessarily (see https://copy.fail/).

The previous socket rule used a single "arg0 != AF_VSOCK" condition.
Adding a second OpNotEqual for AF_ALG does not work because seccomp
evaluates multiple argument conditions within a single rule as a
logical AND against the same argument index.

Instead, restructure the socket allowlist into three range-based rules
that cover every domain except AF_ALG (38) and AF_VSOCK (40):

1. Allow socket when arg0 < 38   (all families below AF_ALG)
2. Allow socket when arg0 == 39  (the single family between them)
3. Allow socket when arg0 > 40   (all families above AF_VSOCK)

Domains 38 and 40 match none of the three rules and fall through to
the default SCMP_ACT_ERRNO action.

Signed-off-by: Paweł Gronowski <pawel.gronowski@docker.com>
The socket rules depend on AF_ALG and AF_VSOCK being exactly 38 and 40
with a single family between them. Add compile-time array size checks
that will fail the build if these constants ever change.

Signed-off-by: Paweł Gronowski <pawel.gronowski@docker.com>
The socket arg filters that block AF_ALG and AF_VSOCK only apply to the
direct socket(2) syscall. On architectures with the legacy socketcall(2)
multiplexer (i386, s390, MIPS o32), libseccomp auto-generates a
socketcall(SYS_SOCKET) -> ALLOW companion for each socket ALLOW rule.
This companion only checks the socketcall sub-command number, not the
address family (behind a pointer BPF cannot dereference), bypassing the
AF_ALG block for 32-bit binaries.

Move socketcall from the unconditional allow list to an explicit
ERRNO(ENOSYS) deny rule placed before the socket allow rules. ENOSYS
must be used instead of EPERM because the deny errno must differ from
DefaultErrnoRet (EPERM): runc skips calling seccomp_rule_add() when a
rule's action matches the default action, so an EPERM deny is never
passed to libseccomp and the auto-generated socketcall ALLOW path
survives unchallenged.

Since Linux 4.3, all affected architectures provide direct socket
syscalls and modern glibc/musl already use them.

Port of moby/profiles#21

Signed-off-by: Paweł Gronowski <pawel.gronowski@docker.com>

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Hardens the default seccomp profile to prevent 32-bit socketcall(2) from bypassing the existing socket(2) address-family filters (notably the AF_ALG block), aligning containerd’s default profile behavior with the referenced moby/profiles changes.

Changes:

  • Removes socketcall from the unconditional allow list and adds an explicit ERRNO(ENOSYS) deny rule for socketcall.
  • Keeps the AF_ALG / AF_VSOCK exclusions effective by ensuring the socket(2) allow rules remain range-based (including changing the last rule to > AF_VSOCK).
  • Adds compile-time assertions to lock in assumptions used by the socket-family filtering logic.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +437 to +454
// socketcall(2) is explicitly denied to prevent bypassing the socket
// address family filters below on architectures where socketcall is
// supported (i386, s390, MIPS o32).
// Seccomp cannot inspect socketcall's pointer argument, so allowing it
// would let an attacker open AF_ALG sockets via socketcall(SYS_SOCKET,
// ...). Since Linux 4.3 all affected architectures provide direct
// socket syscalls, so modern userspace is not impacted.
//
// ENOSYS (not EPERM) is used because the errno must differ from
// DefaultErrnoRet; otherwise both runc and libseccomp treat the rule
// as identical to the default action and silently omit it from the
// generated BPF, which lets libseccomp's auto-generated
// socketcall(SYS_SOCKET) -> ALLOW path survive unchallenged.
{
Names: []string{"socketcall"},
Action: specs.ActErrno,
ErrnoRet: &nosys,
},
Comment on lines +450 to +454
{
Names: []string{"socketcall"},
Action: specs.ActErrno,
ErrnoRet: &nosys,
},
@samuelkarp

Copy link
Copy Markdown
Member

Noting that the only new commit in this PR is 27093dc; the other 3 are part of #13327 already. (This would be a good test of stacked PRs if we had access to them.)

@thaJeztah

Copy link
Copy Markdown
Member

FWIW; some issues reported related to the blocking; not sure if it's this patch or the other one; see

@k8s-ci-robot

Copy link
Copy Markdown

PR needs rebase.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Status: Needs Triage

Development

Successfully merging this pull request may close these issues.

5 participants