Skip to content

runc-shim: process exec exits before init (w/o extra locks!)#9828

Merged
dmcgowan merged 1 commit intocontainerd:mainfrom
laurazard:fix-exec-exit-better
Mar 4, 2024
Merged

runc-shim: process exec exits before init (w/o extra locks!)#9828
dmcgowan merged 1 commit intocontainerd:mainfrom
laurazard:fix-exec-exit-better

Conversation

@laurazard
Copy link
Copy Markdown
Member

@laurazard laurazard commented Feb 15, 2024

fixes #9719

alternative to #9775 without extra locks/goroutines/etc (ty for the idea @corhere!).


The Problem

TLDR: #9719

Long form: Communication is hard. In 5cd6210, we reworked the v2 runc shim event handling to reduce lock contention, since before, in order to preserve exit event order, we were just putting big locks around s.Create and s.processExits, which broke down under load (see #8557).

In reality, the difficult part of preserving exit event order is "early exits" – we never want a TaskExit event to be emitted for a task before it's TaskStarted or TaskExecStarted, but if a process exits very quickly after start, it's incredibly likely that we will receive an exit event from runc and emit it before we've had the time to emit the start event for the same task. To deal with that without incurring such a performance hit as we did before, we started keeping track of "running" processes (processes for which we've emitted a start event and that have not yet exited), which let us change our exit processing logic – if we receive an exit event for a PID which we're not tracking in the running processes map, it's likely due to an early exit and we can handle it properly – in this case, having s.Start calls subscribe to exits, so that when they are about to emit a TaskStart event, they can check if we've received an exit event in the meantime and handle it appropriately.

This worked great, but as we found in #9719, the extra processing that we introduced for exec exits handling made it so that in certain circumstances, even though we always receive an exit event for Execs before we receive the Init process's (this is relevant, and the causes for this are discussed in #9719 (comment)), we end up emitting the Init's exit first (since we're not holding off on processing that before we emit a Start event) and only emitting the exec exit afterwards, which causes issues (and does not represent reality).

The Fix

Credits to @corhere who came up with this approach to ensuring exit ordering – more info in #9775 (comment).

The fix boils down to:

  • Add a field PendingExecCounter to containerProcess which we use to keep track of how many execs we're currently processing (i.e. for which we have called s.preStart() but have not yet called the returned handleStarted() closure).
  • In s.processExits(), if a process has pending execs, we skip processing the exit.
  • In the returned handleStarted() closure, if we're processing an exec and this exec is the last pending exec for an init process, check whether the init process has exited in the meantime, and if so process it's exit (after processing the relevant exec's exit).

@k8s-ci-robot
Copy link
Copy Markdown

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@laurazard
Copy link
Copy Markdown
Member Author

/test all

@laurazard laurazard force-pushed the fix-exec-exit-better branch 4 times, most recently from 5983800 to 6321806 Compare February 15, 2024 14:17
@laurazard laurazard changed the title ⚠️ DNM - runc-shim: process exec exits before init (alternative w/o locks!) runc-shim: process exec exits before init (w/o extra locks!) Feb 15, 2024
@laurazard laurazard marked this pull request as ready for review February 15, 2024 14:19
@laurazard laurazard self-assigned this Feb 15, 2024
@laurazard
Copy link
Copy Markdown
Member Author

(I still have a couple of refactors pointed out in #9775 left to apply)

@laurazard laurazard added cherry-pick/1.6.x cherry-pick/1.7.x Change to be cherry picked to release/1.7 branch labels Feb 15, 2024
@laurazard laurazard force-pushed the fix-exec-exit-better branch from 6321806 to f781fa6 Compare February 15, 2024 17:31
@laurazard laurazard force-pushed the fix-exec-exit-better branch from 013f612 to b21a96f Compare February 21, 2024 03:31
@laurazard
Copy link
Copy Markdown
Member Author

/retest

@laurazard laurazard requested a review from corhere February 21, 2024 13:53
@laurazard laurazard requested a review from dmcgowan February 21, 2024 13:53
Copy link
Copy Markdown
Member

@corhere corhere left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! 🎉

@laurazard laurazard force-pushed the fix-exec-exit-better branch 2 times, most recently from 3557417 to a831ee7 Compare March 1, 2024 14:35
@laurazard
Copy link
Copy Markdown
Member Author

@thaJeztah @dmcgowan can you TAL?

@laurazard
Copy link
Copy Markdown
Member Author

/retest

@containerd containerd deleted a comment from k8s-ci-robot Mar 1, 2024
For a given container, as long as the init process is the init process
of that PID namespace, we always receive the exits for execs before we
receive them for the init process.

It's important that we uphold this invariant for the outside world by
always emitting a TastExit event for a container's exec before we emit
one for the init process because this is the expected behavior from
callers, and changing this creates issues - such as Docker, which will
delete the container after receiving a TaskExit for the init process,
and then not be able to handle the exec's exit after having deleted
the container (see: containerd#9719).

Since 5cd6210, if an exec is starting
at the same time that an init exits, if the exec is an "early exit"
i.e. we haven't emitted a TaskStart for it/put it in `s.running` by the
time we receive it's exit, we notify concurrent calls to `s.Start()` of
the exit and continue processing exits, which will cause us to process
the Init's exit before the exec, and emit it, which we don't want to do.

This commit introduces a map `s.pendingExecs` to keep track of the
number of pending execs keyed by container, which allows us to skip
processing exits for inits if there are pending execs, and instead
have the closure returned by `s.preStart` handle the init exit after
emitting the exec's exit.

Signed-off-by: Laura Brehm <laurabrehm@hey.com>
@laurazard laurazard force-pushed the fix-exec-exit-better branch from a831ee7 to 892dc54 Compare March 1, 2024 16:43
@laurazard
Copy link
Copy Markdown
Member Author

/retest

@dmcgowan dmcgowan added this pull request to the merge queue Mar 4, 2024
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Mar 4, 2024
@dmcgowan dmcgowan added this pull request to the merge queue Mar 4, 2024
Merged via the queue into containerd:main with commit d3d4c5d Mar 4, 2024
@laurazard laurazard added cherry-picked/1.6.x PR commits are cherry-picked into release/1.6 branch cherry-picked/1.7.x PR commits are cherry-picked into release/1.7 branch and removed cherry-pick/1.6.x cherry-pick/1.7.x Change to be cherry picked to release/1.7 branch labels Mar 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cherry-picked/1.6.x PR commits are cherry-picked into release/1.6 branch cherry-picked/1.7.x PR commits are cherry-picked into release/1.7 branch size/L

Projects

None yet

Development

Successfully merging this pull request may close these issues.

TaskExit event can be sent for an exec process after TaskExit is sent for the init process

5 participants