Skip to content

runc-shim-v2: emit init exit after exec exits#9775

Closed
laurazard wants to merge 1 commit intocontainerd:mainfrom
laurazard:fix-exec-exit-order
Closed

runc-shim-v2: emit init exit after exec exits#9775
laurazard wants to merge 1 commit intocontainerd:mainfrom
laurazard:fix-exec-exit-order

Conversation

@laurazard
Copy link
Copy Markdown
Member

@laurazard laurazard commented Feb 6, 2024

The Problem

TLDR: #9719

Long form: Communication is hard. In 5cd6210, we reworked the v2 runc shim event handling to reduce lock contention, since before, in order to preserve exit event order, we were just putting big locks around s.Create and s.processExits, which broke down under load (see #8557).

In reality, the difficult part of preserving exit event order is "early exits" – we never want a TaskExit event to be emitted for a task before it's TaskStarted or TaskExecStarted, but if a process exits very quickly after start, it's incredibly likely that we will receive an exit event from runc and emit it before we've had the time to emit the start event for the same task. To deal with that without incurring such a performance hit as we did before, we started keeping track of "running" processes (processes for which we've emitted a start event and that have not yet exited), which let us change our exit processing logic – if we receive an exit event for a PID which we're not tracking in the running processes map, it's likely due to an early exit and we can handle it properly – in this case, having s.Start calls subscribe to exits, so that when they are about to emit a TaskStart event, they can check if we've received an exit event in the meantime and handle it appropriately.

This worked great, but as we found in #9719, the extra processing that we introduced for exec exits handling made it so that in certain circumstances, even though we always receive an exit event for Execs before we receive the Init process's (this is relevant, and the causes for this are discussed in #9719 (comment)), we end up emitting the Init's exit first (since we're not holding off on processing that before we emit a Start event) and only emitting the exec exit afterwards, which causes issues (and does not represent reality).

The Fix

We introduce more :'( state into the shim in the form of s.pendingExecs, which holds a map of (init) PID -> *WaitGroup. Whenever we add an exec, in the call to s.preStart, we (first add a WaitGroup for the Init's PID if one does not already exist) and then, call Add(1) on that WaitGroup. When handleStarted() is called, we call Done on that WaitGroup.

Whenever we are about to process an exit for a running process (we can find our process in s.running), if the process is an InitProcess, we check if there are pending execs for this container in s.pendingExecs: if there is a WaitGroup for our PID, we're still processing an exec Start, and need to wait for that to finish and that exec's TaskExit to be emitted before processing this particular exit, so we launch a goroutine and wait on this WaitGroup. When the pending execs are processed, we unblock the waitgroup and process the init process exit. After emitting the exit for the Init process, we delete the entry for that Init PID from the map.


Note

As an aside, my initial intuition was that, since we know that we will always get the exits for the execs before the init, we'd be able to restrict this change to s.processExits(), by doing something like:

  • Receive exit for exec, exec isn't currently running, so register somewhere that we should delay processing this container's InitProcess until we process this exit

But instead, I had to make changes to s.preStart() and register the WaitGroup there. This is because the issue we're addressing here is caused exactly by early exits, which are characterised by the fact that they're not in s.running, so we can't connect the exit's PID -> container/process, so we can't find it's container's init process/it's PID, and can't make the connections we need. As such, we need to register the fact we're starting an exec for a given container in s.preStart().

@laurazard laurazard self-assigned this Feb 6, 2024
@k8s-ci-robot
Copy link
Copy Markdown

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

For a given container, as long as the init process is the init process
of that PID namespace, we always receive the exits for execs before we
receive them for the init process.

It's important that we uphold this invariant for the outside world by
always emitting a TastExit event for a container's exec before we emit
one for the init process because this is the expected behavior from
callers, and changing this creates issues - such as Docker, which will
delete the container after receiving a TaskExit for the init process,
and then not be able to handle the exec's exit after having deleted
the container (see: containerd#9719).

Since 5cd6210, if an exec is starting
at the same time that an init exits, if the exec is an "early exit"
i.e. we haven't emitted a TaskStart for it/put it in `s.running` by the
time we receive it's exit, we notify concurrent calls to `s.Start()` of
the exit and continue processing exits, which will cause us to process
the Init's exit before the exec, and emit it, which we don't want to do.

This commit introduces `s.pendingExecs` where we can register that we're
going to start an exec before we do so, and when processing an exit for
an Init process, we can block emitting that exit until we've processed
the exec.

Signed-off-by: Laura Brehm <laurabrehm@hey.com>
@laurazard laurazard marked this pull request as ready for review February 13, 2024 18:16
@laurazard
Copy link
Copy Markdown
Member Author

cc @corhere

@laurazard laurazard added cherry-pick/1.6.x cherry-pick/1.7.x Change to be cherry picked to release/1.7 branch labels Feb 13, 2024
// lifecycleMu.
exitSubscribers map[*map[int][]runcC.Exit]struct{}

pendingExecsLock sync.Mutex
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why can't lifecycleMu be used to guard pendingExecs? If the new mutex is required, what's the lock hierarchy w.r.t. mu and lifecycleMu? How can we be sure deadlocks can't happen?

Comment on lines +163 to +170
if !init {
s.pendingExecsLock.Lock()
if _, ok := s.pendingExecs[c.Pid()]; !ok {
s.pendingExecs[c.Pid()] = &sync.WaitGroup{}
}
s.pendingExecs[c.Pid()].Add(1)
s.pendingExecsLock.Unlock()
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I despise boolean flag parameters which alter the behaviour of a function. It's usually a smell that it's really two distinct functions in a trenchcoat. And the flag is not necessary. The handleStarted closure does not care about the flag so all that matters is that the pendingExecs[c.Pid()] is incremented while lifecycleMu is held, before the exec has been launched. That's easy enough: instead of locking lifecycleMu inside preStart, require that callers hold lifecycleMu when calling preStart. Then Start can be written like so:

	s.lifecycleMu.Lock()
	var cinit *runc.Container
	if r.ExecID == "" {
		cinit = container
	} else {
		s.pendingExecsLock.Lock()
		wg, ok := s.pendingExecs[container.Pid()]
		if !ok {
			wg = &sync.WaitGroup{}
			s.pendingExecs[container.Pid()] = wg
		}
		wg.Add(1)
		s.pendingExecsLock.Unlock()
	}
	handleStarted, cleanup := s.preStart(cinit)
	s.lifecycleMu.Unlock()

And to prove my claim that preStart with the flag parameter is really two functions in a trenchcoat:

func (s *service) preStartExec(cinit *runc.Container) (handleStarted func(*runc.Container, process.Process, bool), cleanup func()) {
	s.lifecycleMu.Lock()
	defer s.lifecycleMu.Unlock()
	
	s.pendingExecsLock.Lock()
	wg, ok := s.pendingExecs[container.Pid()]
	if !ok {
		wg = &sync.WaitGroup{}
		s.pendingExecs[container.Pid()] = wg
	}
	wg.Add(1)
	s.pendingExecsLock.Unlock()
	return s.preStart(nil)
}

Comment on lines +686 to +687
go func() {
wg.Wait()
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not entirely sold on the use of goroutines and WaitGroups for this. The handleStarted closure which wakes this goroutine with wg.Done() could instead check that it is the last pending exec for an exited init process and synchronously dispatch the init process' TaskExit event. (Is the use of goroutines and WaitGroups the reason why pendingExecs can't be guarded with lifecycleMu?)

The handleStarted closure knows the init process' PID and is subscribed to exit events already; it could trivially check if the init process had exited while the exec was starting. The only new information it needs to know is whether or not it is handling an early exit for the last remaining exec in the container. Similarly, processExit could use the same information to know to suppress the handleProcessExit call for an init process with outstanding execs. That could be implemented e.g. by adding a counter field to the struct currently named containerProcess. I think it's feasible to fix exit ordering without any new goroutines, mutexes, or service struct fields, by leveraging the existing exitSubscribers and running fields.

Copy link
Copy Markdown
Member Author

@laurazard laurazard Feb 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The handleStarted closure knows the init process' PID and is subscribed to exit events already; it could trivially check if the init process had exited while the exec was starting

But the problem isn't letting the handleStarted closure check if the init process has exited while the exec is starting, the problem is knowing whether to delay emitting the init process exit or not/know whether there might be pending execs.

Similarly, processExit could use the same information to know to suppress the handleProcessExit call for an init process with outstanding execs. That could be implemented e.g. by adding a counter field to the struct currently named containerProcess.

That might be viable, I guess we'd increment containerProcessPendingExecs before starting and decrement after processing the exec, and we could use that in processExits to know whether we can process that exit or not (is this a WaitGroup with extra steps?).

However, I don't understand how we can do that without any new goroutines, .... We have to launch a goroutine for that init process exit so we can block until it can be processed and continue processing other exits in the meantime, no? Otherwise we have to make severe modifications to this code. Another question is: if we remove the WaitGroup, will we hand craft some implementation using channels/a sync.Cond/whatever to let our waiting init process exits that they should exit, or do something else entirely?

Ohhhh nevermind, I see! I had a vague idea while looking at all of this that there should be a better way but I didn't figure it out, but this sounds viable! I'll try it out.

Comment on lines +670 to 676
func (s *service) handleProcessExitL(e runcC.Exit, c *runc.Container, p process.Process) {
s.mu.Lock()
defer s.mu.Unlock()
s.handleProcessExit(e, c, p)
}

// s.mu must be locked when calling handleProcessExit
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, it does not. I'm probably not going to win the argument in this PR, but that won't stop me from reminding everyone in every PR which touches this file.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh! I've missed and/or forgot that discussion. Yes, this was something that looked off to me again while working on this but I got distracted with other things and ended up forgetting about it. I'd rather not block this change on that (apparently contentious) topic, but I'd be happy to open another PR so we can revisit this lock.

s.handleProcessExitL(e, c, p)

s.pendingExecsLock.Lock()
delete(s.pendingExecs, p.Pid())
Copy link
Copy Markdown
Member

@dmcgowan dmcgowan Feb 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is the delete done after the s.handleProcessExitL? If the pid is still in s.pendingExecs, what stopping this condition from getting hit multiple times?

laurazard added a commit to laurazard/containerd that referenced this pull request Feb 15, 2024
alternative to containerd#9775

Signed-off-by: Laura Brehm <laurabrehm@hey.com>
laurazard added a commit to laurazard/containerd that referenced this pull request Feb 15, 2024
alternative to containerd#9775

Signed-off-by: Laura Brehm <laurabrehm@hey.com>
laurazard added a commit to laurazard/containerd that referenced this pull request Feb 15, 2024
alternative to containerd#9775

Signed-off-by: Laura Brehm <laurabrehm@hey.com>
@laurazard
Copy link
Copy Markdown
Member Author

superseded by #9828

@laurazard laurazard closed this Feb 15, 2024
@austinvazquez austinvazquez removed cherry-pick/1.6.x cherry-pick/1.7.x Change to be cherry picked to release/1.7 branch labels Nov 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants