Implement seccomp notify support#2682
Conversation
98e32e0 to
5a620a8
Compare
5a620a8 to
c8e2d6f
Compare
|
CI failure is a known (but rare) flake, see #2425 (restarted the job) |
c8e2d6f to
98b5487
Compare
98b5487 to
1751ec3
Compare
1751ec3 to
c25bb2b
Compare
|
opencontainers/runtime-spec#1074 merged now |
|
Hi! I'm a coworker of @mauriciovasquezbernal, I'll coordinate with him to fix the conflicts in this PR and move it to a regular PR (not draft) when ready! :) |
c25bb2b to
e7d0927
Compare
|
Hi! I’ve rebased the PR and fixed all the conflicts, all tests should be green too. The OCI runtime-spec change this patch implements was recently merged and I see that the spec change is already merged in runc too. Should we mark this PR for 1.0 milestone? Currently it is in the post 1.0 milestone. I’m keeping the PR as a draft for now, as the libseccomp-golang PR is not yet merged and we need it for this PR (not only for tests, we use some consts in core runc too). However, changes on the libseccomp PR will have minimal (most likely none) consequences to this code. So, we keep it as draft until that PR is merged, but please review whenever you can :) |
e7d0927 to
30ae27b
Compare
30ae27b to
ee8e1cb
Compare
|
Updated, fixed a conflict in go.sum too |
ee8e1cb to
15343c8
Compare
|
Rebased on top of latest master |
libcontainer/process_linux.go
Outdated
| // receive seccomp-fd | ||
| pidfd, _, errno := unix.Syscall(unix.SYS_PIDFD_OPEN, uintptr(childPid), 0, 0) | ||
| if errno != 0 { | ||
| return newSystemErrorWithCause(errno, "performing SYS_PIDFD_OPEN syscall") | ||
| } | ||
| defer unix.Close(int(pidfd)) | ||
|
|
||
| seccompFd, _, errno := unix.Syscall(unix.SYS_PIDFD_GETFD, pidfd, uintptr(sync.Fd), 0) | ||
| if errno != 0 { | ||
| return newSystemErrorWithCause(errno, "performing SYS_PIDFD_GETFD syscall") | ||
| } | ||
| defer unix.Close(int(seccompFd)) | ||
|
|
||
| if p.config.Config.Seccomp.ListenerPath == "" { | ||
| return newSystemError(errors.New("listenerPath is not set")) | ||
| } |
There was a problem hiding this comment.
Perhaps it makes sense to consolidate this code since it's repeated twice. Something like
seccompFd, err := recvSeccompFd(childPid, sync.Fd)maybe?
There was a problem hiding this comment.
Yeap, good call. I'm moving only these bits to the new function and not the if p.config.Config.Seccomp.ListenerPath == "" part as we need p for that and it would be called with p being different structs. Also, IMHO, semantically seems better to keep both things separate
libcontainer/process_linux.go
Outdated
| func sendContainerProcessState(listenerPath string, state *specs.ContainerProcessState, fd int) error { | ||
| conn, err := net.Dial("unix", listenerPath) | ||
| if err != nil { | ||
| return fmt.Errorf("cannot connect to %q: %v\n", listenerPath, err) |
There was a problem hiding this comment.
\nshould be removed.- Please to use %w for error.
There was a problem hiding this comment.
- I think
net.Dialalready includes path/address in the error message, as well as the error (dial), so there's no need to wrap it.
There was a problem hiding this comment.
I think net.Dial already includes path/address in the error message, as well as the error (dial), so there's no need to wrap it.
Correct:
$ cat main.go
package main
import (
"fmt"
"net"
)
func main() {
_, err := net.Dial("unix", "/fdsjkfhdjk")
if err != nil {
fmt.Printf("%s\n", err)
}
}
$ go run main.go
dial unix /fdsjkfhdjk: connect: no such file or directory
|
@AkihiroSuda this needs your still-LGTM after a rebase. @cyphar PTAL |
|
@AkihiroSuda if it helps, the rebase were to solve very simple conflicts (the makefile has a new target now, needed to adjust for that, some bats helpers conflicts but trivial, and things like that) |
cyphar
left a comment
There was a problem hiding this comment.
Looked through it and it all seems very reasonable -- kolyshkin has probably reviewed this several times more thoroughly than I have. 😉
I only have one question about the seemingly arbitrary write restriction.
| if call.Name == "write" { | ||
| return -1, errors.New("SCMP_ACT_NOTIFY cannot be used for the write syscall") | ||
| } |
There was a problem hiding this comment.
Is there a reason for this restriction? I can imagine how it might be unsafe (or unwise) to do this on paper, but is it really necessary to disallow this within runc?
There was a problem hiding this comment.
Yes, basically it will hang for ever if we allow it. We need the write syscall to write the seccomp fd plain number over the pipe to the parent (we write to the pipe here), then the parent gets the fd with pifd_getfd and sends it over the unix socket to the seccomp agent.
If the write syscall has the notify action, we can't never send a msg to the parent, which will mean the seccomp agent won't get the seccomp fd and the write will be blocked forever.
However read/close (functions that we also use to sync with parent via the pipe) are not restricted as it is possible to send the seccomp fd to the parent, the seccomp agent will get the seccomp fd and the first few notifications it will get if read/close are handled are for the ones done in syncParentSeccomp() (and quite trivially allow to proceed syscalls on fds that are not known by the agent, for example).
Please note the parent is able to proceed, while we are blocked in read/close, because the parent is another process and not subject to the seccomp filter.
I will add a comment to explain this in the code, if that sounds good to you @cyphar
There was a problem hiding this comment.
Ah of course. Yeah that's a bit sad :/ -- but we can always figure out some workaround for this in the future. (I wonder how the LXC folks worked around this problem -- though I think they don't have a similar write-based synchronisation model since they're written in C which means you don't need to do so much busy work.)
There was a problem hiding this comment.
@cyphar I've asked Christian Brauner in the past. It seems they use clone3 sharing the fds table and use a fixed number for the seccomp notify fd. Then, they get it with pidfd_getfd().
Not sure how LXC uses a fixed fd number for seccomp notify, though.
There was a problem hiding this comment.
You could just pick an arbitrarily high fd and use dup3 to clone it to the right fd number. It might be interesting to look into whether we could do it with pidfd_getfd though since we'd need to have a fallback for quite a while, it'd probably complicate the code a bit too much...
There was a problem hiding this comment.
Oh, right, that was easy. Yeah, for runc I think is simpler to use the pipe, that is the main sync mechanism used tree-wide. We are using pidfd_getfd here, though: we sync with the parent to let them know when the fd is ready and send the fd plain number, then the parent gets it with pidfd_getfd
There was a problem hiding this comment.
Ah, I forgot that we are requiring Linux 5.7 or newer for the feature -- pidfd_getfd isn't available for the earlier kernels with seccomp notify support.
There was a problem hiding this comment.
And 5.7 is also enforced by libseccomp-golang, that only supports notification with tsync (requires libseccomp api level 6). This is as the vast majority of go programs will need tsync too. Here is the conversation we had with upstream about this when writing the patches for libseccomp-golang
Before 5.7, notify and tsync where mutually exclusive (torvalds/linux@7a0df7f). From Linux 5.7 introduced a new API to enable seccomp notification with tsync together in this commit torvalds/linux@5189149.
The notify support has been merged in libseccomp-golang in this PR: seccomp/libseccomp-golang#59 Also, we update to new API of libseccomp-golang so code doesn't break. Signed-off-by: Alban Crequy <alban@kinvolk.io> Signed-off-by: Rodrigo Campos <rodrigo@kinvolk.io> Co-authored-by: Rodrigo Campos <rodrigo@kinvolk.io>
SendFds is a helper function for sending a set of file descriptors and a message over a unix domain socket. Signed-off-by: Mauricio Vásquez <mauricio@kinvolk.io>
This commit implements support for the SCMP_ACT_NOTIFY action. It requires libseccomp-2.5.0 to work but runc still works with older libseccomp if the seccomp policy does not use the SCMP_ACT_NOTIFY action. A new synchronization step between runc[INIT] and runc run is introduced to pass the seccomp fd. runc run fetches the seccomp fd with pidfd_get from the runc[INIT] process and sends it to the seccomp agent using SCM_RIGHTS. As suggested by @kolyshkin, we also make writeSync() a wrapper of writeSyncWithFd() and wrap the error there. To avoid pointless errors, we made some existing code paths just return the error instead of re-wrapping it. If we don't do it, error will look like: writing syncT <act>: writing syncT: <err> By adjusting the code path, now they just look like this writing syncT <act>: <err> Signed-off-by: Alban Crequy <alban@kinvolk.io> Signed-off-by: Rodrigo Campos <rodrigo@kinvolk.io> Co-authored-by: Rodrigo Campos <rodrigo@kinvolk.io>
Extend the SetupSeccomp tests by adding the following cases: - Test nil config - Test empty config - Test bad action and architecture - Test all possible actions Signed-off-by: Mauricio Vásquez <mauricio@kinvolk.io>
Implement sample seccomp agent. It's also used in integration tests in the following commit. Instructions how to use it in contrib/cmd/seccompagent/README.md Signed-off-by: Alban Crequy <alban@kinvolk.io> Signed-off-by: Rodrigo Campos <rodrigo@kinvolk.io> Co-authored-by: Rodrigo Campos <rodrigo@kinvolk.io>
Test KILL and ERRNO actions. Signed-off-by: Mauricio Vásquez <mauricio@kinvolk.io>
Add functional test to check seccomp notify end-to-end. This test uses the sample seccomp agent from the contrib/cmd folder. Signed-off-by: Mauricio Vásquez <mauricio@kinvolk.io> Signed-off-by: Rodrigo Campos <rodrigo@kinvolk.io> Co-authored-by: Rodrigo Campos <rodrigo@kinvolk.io>
00772ca
bcadefb to
00772ca
Compare
|
Rebased to fix conflicts (a simple conflict with 64358c4 that moved some code we are touching) and added a comment to explain the write restriction @cyphar commented on (see #2682 (comment)). @kolyshkin @AkihiroSuda @dqminh can you please re-LGTM this? :) @cyphar thanks for the comment! I guess this is ready to merge? Or is there anything else I can do? |
kolyshkin
left a comment
There was a problem hiding this comment.
(I wish github had a way to see the difference between old and new commits, e.g. similar to what gerrit does)
Trusting the changes are mostly cosmetical, still LGTM
|
@cyphar @kolyshkin @AkihiroSuda qminh @vbatts thanks for your time and reviews! :) |
|
@kolyshkin I've updated the release notes for this. I realized I forgot to mention the seccomp agent example in the runc source code! Updated to mention that now. |
The runtime spec we are using has support for these 3 fields[1], but moby doesn't have them in its seccomp struct. This patch just adds and copies them when they are in the profile. DefaultErrnoRet is implemented in the runc version moby is using (it is implemented since runc-rc95[2]) but if we create a container without this moby patch, we don't see an error nor the expected behavior. This is not clear for the user (the profile they specify is valid, the syntax is ok, but the wrong behavior is seen). This is because the DefaultErrnoRet field is not copied to the config passed ultimately to runc (i.e. is like the field was not specified). With this patch, we see the expected behavior. The other two fileds are in the runtime-spec but not yet in runc (a PR is open and targets 1.1.0 milestone). However, I took the liberty to copy them now too for two reasons: 1. If we don't add them now and end up using a runc version that supports them, then the error that the user will see is not clear at all: docker: Error response from daemon: OCI runtime create failed: container_linux.go:380: starting container process caused: listenerPath is not set: unknown. And it is not obvious to debug for the user, as the field _is_ set in the profile they specify (just not copied by moby to the profile moby specifies ultimately to runc). 2. When using a runc without seccomp notify support (like today), the error we see is the same with and without this moby patch (when using a seccomp profile with the new fields): docker: Error response from daemon: OCI runtime create failed: string SCMP_ACT_NOTIFY is not a valid action for seccomp: unknown. Then, it seems like a clear win to add them now: we don't have to do it later (that implies not clear errors to the user if we forget, like we did with DefaultErrnoRet) and the user sees the exact same error when using a runc version that doesn't support these fields. [1]: Note we are vendoring version 1c3f411f041711bbeecf35ff7e93461ea6789220 and this version has these 3 fields https://github.com/opencontainers/runtime-spec/blob/1c3f411f041711bbeecf35ff7e93461ea6789220/config-linux.md#seccomp [2]: opencontainers/runc#2954 [3]: opencontainers/runc#2682 Signed-off-by: Rodrigo Campos <rodrigo@kinvolk.io>
Add support for notify seccomp actions. Seccomp profiles can now specify the SCMP_ACT_NOTIFY action that notifies a user space application (seccomp agent) when such syscall is performed. The seccomp agent can, in turn, decide to perform the syscall on behalf of the container and set the syscall return values or even inject file descriptors to the container. This, for example, allows for unprivileged containers to run some privileged syscalls that the seccomp agent performs on its behalf.
The format and more details can be found in the OCI runtime specification, in the seccomp section. If you are a developer and want more information, check out the example agent in runc, the linux kernel documentation, libseccomp-golang and Christian Brauner's blog post. To use this feature you need Linux >= 5.7 and libseccomp >= 2.5.0.