runc-shim-v2: emit init exit after exec exits#9775
runc-shim-v2: emit init exit after exec exits#9775laurazard wants to merge 1 commit intocontainerd:mainfrom
Conversation
|
Skipping CI for Draft Pull Request. |
66aa693 to
7a6502b
Compare
For a given container, as long as the init process is the init process of that PID namespace, we always receive the exits for execs before we receive them for the init process. It's important that we uphold this invariant for the outside world by always emitting a TastExit event for a container's exec before we emit one for the init process because this is the expected behavior from callers, and changing this creates issues - such as Docker, which will delete the container after receiving a TaskExit for the init process, and then not be able to handle the exec's exit after having deleted the container (see: containerd#9719). Since 5cd6210, if an exec is starting at the same time that an init exits, if the exec is an "early exit" i.e. we haven't emitted a TaskStart for it/put it in `s.running` by the time we receive it's exit, we notify concurrent calls to `s.Start()` of the exit and continue processing exits, which will cause us to process the Init's exit before the exec, and emit it, which we don't want to do. This commit introduces `s.pendingExecs` where we can register that we're going to start an exec before we do so, and when processing an exit for an Init process, we can block emitting that exit until we've processed the exec. Signed-off-by: Laura Brehm <laurabrehm@hey.com>
7a6502b to
972ea79
Compare
|
cc @corhere |
| // lifecycleMu. | ||
| exitSubscribers map[*map[int][]runcC.Exit]struct{} | ||
|
|
||
| pendingExecsLock sync.Mutex |
There was a problem hiding this comment.
Why can't lifecycleMu be used to guard pendingExecs? If the new mutex is required, what's the lock hierarchy w.r.t. mu and lifecycleMu? How can we be sure deadlocks can't happen?
| if !init { | ||
| s.pendingExecsLock.Lock() | ||
| if _, ok := s.pendingExecs[c.Pid()]; !ok { | ||
| s.pendingExecs[c.Pid()] = &sync.WaitGroup{} | ||
| } | ||
| s.pendingExecs[c.Pid()].Add(1) | ||
| s.pendingExecsLock.Unlock() | ||
| } |
There was a problem hiding this comment.
I despise boolean flag parameters which alter the behaviour of a function. It's usually a smell that it's really two distinct functions in a trenchcoat. And the flag is not necessary. The handleStarted closure does not care about the flag so all that matters is that the pendingExecs[c.Pid()] is incremented while lifecycleMu is held, before the exec has been launched. That's easy enough: instead of locking lifecycleMu inside preStart, require that callers hold lifecycleMu when calling preStart. Then Start can be written like so:
s.lifecycleMu.Lock()
var cinit *runc.Container
if r.ExecID == "" {
cinit = container
} else {
s.pendingExecsLock.Lock()
wg, ok := s.pendingExecs[container.Pid()]
if !ok {
wg = &sync.WaitGroup{}
s.pendingExecs[container.Pid()] = wg
}
wg.Add(1)
s.pendingExecsLock.Unlock()
}
handleStarted, cleanup := s.preStart(cinit)
s.lifecycleMu.Unlock()And to prove my claim that preStart with the flag parameter is really two functions in a trenchcoat:
func (s *service) preStartExec(cinit *runc.Container) (handleStarted func(*runc.Container, process.Process, bool), cleanup func()) {
s.lifecycleMu.Lock()
defer s.lifecycleMu.Unlock()
s.pendingExecsLock.Lock()
wg, ok := s.pendingExecs[container.Pid()]
if !ok {
wg = &sync.WaitGroup{}
s.pendingExecs[container.Pid()] = wg
}
wg.Add(1)
s.pendingExecsLock.Unlock()
return s.preStart(nil)
}| go func() { | ||
| wg.Wait() |
There was a problem hiding this comment.
I'm not entirely sold on the use of goroutines and WaitGroups for this. The handleStarted closure which wakes this goroutine with wg.Done() could instead check that it is the last pending exec for an exited init process and synchronously dispatch the init process' TaskExit event. (Is the use of goroutines and WaitGroups the reason why pendingExecs can't be guarded with lifecycleMu?)
The handleStarted closure knows the init process' PID and is subscribed to exit events already; it could trivially check if the init process had exited while the exec was starting. The only new information it needs to know is whether or not it is handling an early exit for the last remaining exec in the container. Similarly, processExit could use the same information to know to suppress the handleProcessExit call for an init process with outstanding execs. That could be implemented e.g. by adding a counter field to the struct currently named containerProcess. I think it's feasible to fix exit ordering without any new goroutines, mutexes, or service struct fields, by leveraging the existing exitSubscribers and running fields.
There was a problem hiding this comment.
The
handleStartedclosure knows the init process' PID and is subscribed to exit events already; it could trivially check if the init process had exited while the exec was starting
But the problem isn't letting the handleStarted closure check if the init process has exited while the exec is starting, the problem is knowing whether to delay emitting the init process exit or not/know whether there might be pending execs.
Similarly,
processExitcould use the same information to know to suppress the handleProcessExit call for an init process with outstanding execs. That could be implemented e.g. by adding a counter field to the struct currently named containerProcess.
That might be viable, I guess we'd increment containerProcessPendingExecs before starting and decrement after processing the exec, and we could use that in processExits to know whether we can process that exit or not (is this a WaitGroup with extra steps?).
However, I don't understand how we can do that without any new goroutines, .... We have to launch a goroutine for that init process exit so we can block until it can be processed and continue processing other exits in the meantime, no? Otherwise we have to make severe modifications to this code. Another question is: if we remove the WaitGroup, will we hand craft some implementation using channels/a sync.Cond/whatever to let our waiting init process exits that they should exit, or do something else entirely?
Ohhhh nevermind, I see! I had a vague idea while looking at all of this that there should be a better way but I didn't figure it out, but this sounds viable! I'll try it out.
| func (s *service) handleProcessExitL(e runcC.Exit, c *runc.Container, p process.Process) { | ||
| s.mu.Lock() | ||
| defer s.mu.Unlock() | ||
| s.handleProcessExit(e, c, p) | ||
| } | ||
|
|
||
| // s.mu must be locked when calling handleProcessExit |
There was a problem hiding this comment.
No, it does not. I'm probably not going to win the argument in this PR, but that won't stop me from reminding everyone in every PR which touches this file.
There was a problem hiding this comment.
Oh! I've missed and/or forgot that discussion. Yes, this was something that looked off to me again while working on this but I got distracted with other things and ended up forgetting about it. I'd rather not block this change on that (apparently contentious) topic, but I'd be happy to open another PR so we can revisit this lock.
| s.handleProcessExitL(e, c, p) | ||
|
|
||
| s.pendingExecsLock.Lock() | ||
| delete(s.pendingExecs, p.Pid()) |
There was a problem hiding this comment.
Why is the delete done after the s.handleProcessExitL? If the pid is still in s.pendingExecs, what stopping this condition from getting hit multiple times?
alternative to containerd#9775 Signed-off-by: Laura Brehm <laurabrehm@hey.com>
alternative to containerd#9775 Signed-off-by: Laura Brehm <laurabrehm@hey.com>
alternative to containerd#9775 Signed-off-by: Laura Brehm <laurabrehm@hey.com>
|
superseded by #9828 |
The Problem
TLDR: #9719
Long form: Communication is hard. In 5cd6210, we reworked the v2 runc shim event handling to reduce lock contention, since before, in order to preserve exit event order, we were just putting big locks around
s.Createands.processExits, which broke down under load (see #8557).In reality, the difficult part of preserving exit event order is "early exits" – we never want a
TaskExitevent to be emitted for a task before it'sTaskStartedorTaskExecStarted, but if a process exits very quickly after start, it's incredibly likely that we will receive an exit event fromruncand emit it before we've had the time to emit the start event for the same task. To deal with that without incurring such a performance hit as we did before, we started keeping track of "running" processes (processes for which we've emitted a start event and that have not yet exited), which let us change our exit processing logic – if we receive an exit event for a PID which we're not tracking in the running processes map, it's likely due to an early exit and we can handle it properly – in this case, havings.Startcalls subscribe to exits, so that when they are about to emit aTaskStartevent, they can check if we've received an exit event in the meantime and handle it appropriately.This worked great, but as we found in #9719, the extra processing that we introduced for exec exits handling made it so that in certain circumstances, even though we always receive an exit event for Execs before we receive the Init process's (this is relevant, and the causes for this are discussed in #9719 (comment)), we end up emitting the Init's exit first (since we're not holding off on processing that before we emit a Start event) and only emitting the exec exit afterwards, which causes issues (and does not represent reality).
The Fix
We introduce
more :'(state into the shim in the form ofs.pendingExecs, which holds a map of(init) PID -> *WaitGroup. Whenever we add an exec, in the call tos.preStart, we (first add a WaitGroup for the Init's PID if one does not already exist) and then, callAdd(1)on that WaitGroup. WhenhandleStarted()is called, we callDoneon that WaitGroup.Whenever we are about to process an exit for a running process (we can find our process in
s.running), if the process is anInitProcess, we check if there are pending execs for this container ins.pendingExecs: if there is aWaitGroupfor our PID, we're still processing an execStart, and need to wait for that to finish and that exec'sTaskExitto be emitted before processing this particular exit, so we launch a goroutine and wait on this WaitGroup. When the pending execs are processed, we unblock the waitgroup and process the init process exit. After emitting the exit for the Init process, we delete the entry for that Init PID from the map.Note
As an aside, my initial intuition was that, since we know that we will always get the exits for the execs before the init, we'd be able to restrict this change to
s.processExits(), by doing something like:But instead, I had to make changes to
s.preStart()and register the WaitGroup there. This is because the issue we're addressing here is caused exactly by early exits, which are characterised by the fact that they're not ins.running, so we can't connect the exit's PID -> container/process, so we can't find it's container's init process/it's PID, and can't make the connections we need. As such, we need to register the fact we're starting an exec for a given container ins.preStart().