runc-shim: process exec exits before init (w/o extra locks!)#9828
Merged
dmcgowan merged 1 commit intocontainerd:mainfrom Mar 4, 2024
Merged
runc-shim: process exec exits before init (w/o extra locks!)#9828dmcgowan merged 1 commit intocontainerd:mainfrom
dmcgowan merged 1 commit intocontainerd:mainfrom
Conversation
|
Skipping CI for Draft Pull Request. |
Member
Author
|
/test all |
5983800 to
6321806
Compare
Member
Author
|
(I still have a couple of refactors pointed out in #9775 left to apply) |
corhere
reviewed
Feb 15, 2024
6321806 to
f781fa6
Compare
dmcgowan
reviewed
Feb 16, 2024
dmcgowan
reviewed
Feb 16, 2024
dmcgowan
reviewed
Feb 16, 2024
f781fa6 to
4e9b8f7
Compare
4e9b8f7 to
013f612
Compare
corhere
reviewed
Feb 20, 2024
013f612 to
b21a96f
Compare
Member
Author
|
/retest |
3557417 to
a831ee7
Compare
Member
Author
|
@thaJeztah @dmcgowan can you TAL? |
Member
Author
|
/retest |
cpuguy83
requested changes
Mar 1, 2024
For a given container, as long as the init process is the init process of that PID namespace, we always receive the exits for execs before we receive them for the init process. It's important that we uphold this invariant for the outside world by always emitting a TastExit event for a container's exec before we emit one for the init process because this is the expected behavior from callers, and changing this creates issues - such as Docker, which will delete the container after receiving a TaskExit for the init process, and then not be able to handle the exec's exit after having deleted the container (see: containerd#9719). Since 5cd6210, if an exec is starting at the same time that an init exits, if the exec is an "early exit" i.e. we haven't emitted a TaskStart for it/put it in `s.running` by the time we receive it's exit, we notify concurrent calls to `s.Start()` of the exit and continue processing exits, which will cause us to process the Init's exit before the exec, and emit it, which we don't want to do. This commit introduces a map `s.pendingExecs` to keep track of the number of pending execs keyed by container, which allows us to skip processing exits for inits if there are pending execs, and instead have the closure returned by `s.preStart` handle the init exit after emitting the exec's exit. Signed-off-by: Laura Brehm <laurabrehm@hey.com>
a831ee7 to
892dc54
Compare
dmcgowan
approved these changes
Mar 1, 2024
cpuguy83
approved these changes
Mar 1, 2024
Member
Author
|
/retest |
This was referenced Mar 5, 2024
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
fixes #9719
alternative to #9775 without extra locks/goroutines/etc (ty for the idea @corhere!).
The Problem
TLDR: #9719
Long form: Communication is hard. In 5cd6210, we reworked the v2 runc shim event handling to reduce lock contention, since before, in order to preserve exit event order, we were just putting big locks around
s.Createands.processExits, which broke down under load (see #8557).In reality, the difficult part of preserving exit event order is "early exits" – we never want a
TaskExitevent to be emitted for a task before it'sTaskStartedorTaskExecStarted, but if a process exits very quickly after start, it's incredibly likely that we will receive an exit event fromruncand emit it before we've had the time to emit the start event for the same task. To deal with that without incurring such a performance hit as we did before, we started keeping track of "running" processes (processes for which we've emitted a start event and that have not yet exited), which let us change our exit processing logic – if we receive an exit event for a PID which we're not tracking in the running processes map, it's likely due to an early exit and we can handle it properly – in this case, havings.Startcalls subscribe to exits, so that when they are about to emit aTaskStartevent, they can check if we've received an exit event in the meantime and handle it appropriately.This worked great, but as we found in #9719, the extra processing that we introduced for exec exits handling made it so that in certain circumstances, even though we always receive an exit event for Execs before we receive the Init process's (this is relevant, and the causes for this are discussed in #9719 (comment)), we end up emitting the Init's exit first (since we're not holding off on processing that before we emit a Start event) and only emitting the exec exit afterwards, which causes issues (and does not represent reality).
The Fix
Credits to @corhere who came up with this approach to ensuring exit ordering – more info in #9775 (comment).
The fix boils down to:
PendingExecCountertocontainerProcesswhich we use to keep track of how many execs we're currently processing (i.e. for which we have calleds.preStart()but have not yet called the returnedhandleStarted()closure).s.processExits(), if a process has pending execs, we skip processing the exit.handleStarted()closure, if we're processing an exec and this exec is the last pending exec for an init process, check whether the init process has exited in the meantime, and if so process it's exit (after processing the relevant exec's exit).