runc-shim-v2: emit init exit after exec exits by laurazard · Pull Request #9775 · containerd/containerd

laurazard · 2024-02-06T17:19:56Z

The Problem

Long form: Communication is hard. In 5cd6210, we reworked the v2 runc shim event handling to reduce lock contention, since before, in order to preserve exit event order, we were just putting big locks around s.Create and s.processExits, which broke down under load (see #8557).

In reality, the difficult part of preserving exit event order is "early exits" – we never want a TaskExit event to be emitted for a task before it's TaskStarted or TaskExecStarted, but if a process exits very quickly after start, it's incredibly likely that we will receive an exit event from runc and emit it before we've had the time to emit the start event for the same task. To deal with that without incurring such a performance hit as we did before, we started keeping track of "running" processes (processes for which we've emitted a start event and that have not yet exited), which let us change our exit processing logic – if we receive an exit event for a PID which we're not tracking in the running processes map, it's likely due to an early exit and we can handle it properly – in this case, having s.Start calls subscribe to exits, so that when they are about to emit a TaskStart event, they can check if we've received an exit event in the meantime and handle it appropriately.

This worked great, but as we found in #9719, the extra processing that we introduced for exec exits handling made it so that in certain circumstances, even though we always receive an exit event for Execs before we receive the Init process's (this is relevant, and the causes for this are discussed in #9719 (comment)), we end up emitting the Init's exit first (since we're not holding off on processing that before we emit a Start event) and only emitting the exec exit afterwards, which causes issues (and does not represent reality).

The Fix

We introduce ~~more :'(~~ state into the shim in the form of s.pendingExecs, which holds a map of (init) PID -> *WaitGroup. Whenever we add an exec, in the call to s.preStart, we (first add a WaitGroup for the Init's PID if one does not already exist) and then, call Add(1) on that WaitGroup. When handleStarted() is called, we call Done on that WaitGroup.

Whenever we are about to process an exit for a running process (we can find our process in s.running), if the process is an InitProcess, we check if there are pending execs for this container in s.pendingExecs: if there is a WaitGroup for our PID, we're still processing an exec Start, and need to wait for that to finish and that exec's TaskExit to be emitted before processing this particular exit, so we launch a goroutine and wait on this WaitGroup. When the pending execs are processed, we unblock the waitgroup and process the init process exit. After emitting the exit for the Init process, we delete the entry for that Init PID from the map.

Note

As an aside, my initial intuition was that, since we know that we will always get the exits for the execs before the init, we'd be able to restrict this change to s.processExits(), by doing something like:

Receive exit for exec, exec isn't currently running, so register somewhere that we should delay processing this container's InitProcess until we process this exit

But instead, I had to make changes to s.preStart() and register the WaitGroup there. This is because the issue we're addressing here is caused exactly by early exits, which are characterised by the fact that they're not in s.running, so we can't connect the exit's PID -> container/process, so we can't find it's container's init process/it's PID, and can't make the connections we need. As such, we need to register the fact we're starting an exec for a given container in s.preStart().

k8s-ci-robot · 2024-02-06T17:19:59Z

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

For a given container, as long as the init process is the init process of that PID namespace, we always receive the exits for execs before we receive them for the init process. It's important that we uphold this invariant for the outside world by always emitting a TastExit event for a container's exec before we emit one for the init process because this is the expected behavior from callers, and changing this creates issues - such as Docker, which will delete the container after receiving a TaskExit for the init process, and then not be able to handle the exec's exit after having deleted the container (see: containerd#9719). Since 5cd6210, if an exec is starting at the same time that an init exits, if the exec is an "early exit" i.e. we haven't emitted a TaskStart for it/put it in `s.running` by the time we receive it's exit, we notify concurrent calls to `s.Start()` of the exit and continue processing exits, which will cause us to process the Init's exit before the exec, and emit it, which we don't want to do. This commit introduces `s.pendingExecs` where we can register that we're going to start an exec before we do so, and when processing an exit for an Init process, we can block emitting that exit until we've processed the exec. Signed-off-by: Laura Brehm <laurabrehm@hey.com>

laurazard · 2024-02-13T18:20:17Z

cc @corhere

corhere · 2024-02-13T20:53:07Z