virtcontainers: Return the appropriate container status by sboeuf · Pull Request #950 · kata-containers/runtime

sboeuf · 2018-11-28T21:57:24Z

When our runtime is asked for the container status, we also handle
the scenario where the container is stopped if the shim process for
that container on the host has terminated.

In the current implementation, we retrieve the container status
before stopping the container, causing a wrong status to be returned.
The wait for the original go-routine's completion was done in a defer
within the caller of statusContainers(), resulting in the
statusContainer()'s values to return the pre-stopped value.

This bug is first observed when updating to docker v18.09/containerd
v1.2.0. With the current implementation, containerd-shim receives the
TaskExit when it detects kata-shim is terminating. When checking the
container state, however, it does not get the expected "stopped" value.

The following commit resolves the described issue by simplifying the
locking used around the status container calls. Originally
StatusContainer would request a read lock. If we needed to update the
container status in statusContainer, we'd start a go-routine which
would request a read-write lock, waiting for the original read lock to
be released. Can't imagine a bug could linger in this logic. We now
just request a read-write lock in the caller (StatusContainer),
skipping the need for a separate go-routine and defer. This greatly
simplifies the logic, and removes the original bug.

Fixes #926

Signed-off-by: Sebastien Boeuf sebastien.boeuf@intel.com

sboeuf · 2018-11-28T21:59:59Z

@jodh-intel @devimc @amshinde @chavafg PTAL, this PR fixes the issue seen with Docker 18.09. Other than that, we don't need to modify anything else, the fact that containerd removed the call to kill --all is not a big deal, but we were returning the wrong container status, which was not appreciated by containerd, and eventually delete was never called.

sboeuf · 2018-11-28T22:02:52Z

/test

devimc

nice finding!

lgtm

egernst · 2018-11-29T01:37:12Z

Suggested commit message edit:

When our runtime is asked for the container status, we also handle the
scenario where the container is stopped if the shim process for that
container on the host has terminated.

In the current implementation, we retrieve the container status before stopping
the container, causing a wrong status to be returned.  The wait for the original
go-routine's completion was done in a defer within the caller of statusContainers(),
resulting in the statusContainer()'s values to return the pre-stopped value. 
 
This bug is first observed when updating to docker v18.09/containerd v1.2.0. With
the current implementation, containerd-shim receives the TaskExit when it detects
kata-shim is terminating. When checking the container state, however, it does not get
the expected "stopped" value.

The following commit resolves the described issue by simplifying the locking used
around the status container calls. Originally StatusContainer would request a read
lock.  If we needed to update the container status in statusContainer, we'd start a
go-routine which would request a read-write lock, waiting for the original read lock
to be released.  Can't imagine a bug could linger in this logic.  We now just request a
read-write lock in the caller (StatusContainer), skipping the need for a separate go-routine
and defer. This greatly simplifies the logic, and removes the original bug.

egernst

Looks good, but please update the commit message for a bit more detail.

When our runtime is asked for the container status, we also handle the scenario where the container is stopped if the shim process for that container on the host has terminated. In the current implementation, we retrieve the container status before stopping the container, causing a wrong status to be returned. The wait for the original go-routine's completion was done in a defer within the caller of statusContainers(), resulting in the statusContainer()'s values to return the pre-stopped value. This bug is first observed when updating to docker v18.09/containerd v1.2.0. With the current implementation, containerd-shim receives the TaskExit when it detects kata-shim is terminating. When checking the container state, however, it does not get the expected "stopped" value. The following commit resolves the described issue by simplifying the locking used around the status container calls. Originally StatusContainer would request a read lock. If we needed to update the container status in statusContainer, we'd start a go-routine which would request a read-write lock, waiting for the original read lock to be released. Can't imagine a bug could linger in this logic. We now just request a read-write lock in the caller (StatusContainer), skipping the need for a separate go-routine and defer. This greatly simplifies the logic, and removes the original bug. Fixes kata-containers#926 Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>

sboeuf · 2018-11-29T04:15:43Z

/test

sboeuf · 2018-11-29T04:15:58Z

@egernst commit message updated! Thanks!

jodh-intel · 2018-11-29T09:10:06Z

@sboeuf - nice!!

Can you find a way to add a unit test for this change though? Given the importance of this fix, it would be good to know we have a test to assert the correct behaviour.

jodh-intel · 2018-11-29T16:57:36Z

@sboeuf - my bad - s/unit test/any type of test/ ;)

jodh-intel · 2018-11-29T16:59:22Z

ftr, I can confirm this PR fixes the issue with Docker 18.09 (tested on ubuntu bionic). The CI will be testing using docker 18.06.1 (from https://github.com/kata-containers/runtime/blob/master/versions.yaml#L154) so we can say that it works with both versions.

sboeuf · 2018-11-29T17:06:45Z

@jodh-intel yes that's a good idea, I think we can write a simple integration test using kata-runtime state after we run a simple workload that naturally exits after a few seconds. The result should be stopped directly, where before it would have returned running.
Reference: kata-containers/tests#956

jodh-intel · 2018-11-29T17:11:30Z

Thanks @sboeuf - let's land this to allow you to write that test... ;)

sboeuf · 2018-11-29T17:12:33Z

Thanks @jodh-intel! I'll write this today!

devimc approved these changes Nov 28, 2018

View reviewed changes

egernst approved these changes Nov 29, 2018

View reviewed changes

egernst approved these changes Nov 29, 2018 •

edited

Loading

View reviewed changes

sboeuf force-pushed the sboeuf/fix_docker_18_09 branch from b88ed15 to fa9b15d Compare November 29, 2018 04:15

sboeuf mentioned this pull request Nov 29, 2018

Check kata-runtime status after the container terminated kata-containers/tests#956

Closed

jodh-intel merged commit 5857523 into kata-containers:master Nov 29, 2018

Conversation

sboeuf commented Nov 28, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sboeuf commented Nov 28, 2018

Uh oh!

sboeuf commented Nov 28, 2018

Uh oh!

devimc left a comment

Choose a reason for hiding this comment

Uh oh!

egernst commented Nov 29, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

egernst left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

sboeuf commented Nov 29, 2018

Uh oh!

sboeuf commented Nov 29, 2018

Uh oh!

jodh-intel commented Nov 29, 2018

Uh oh!

jodh-intel commented Nov 29, 2018

Uh oh!

jodh-intel commented Nov 29, 2018

Uh oh!

sboeuf commented Nov 29, 2018

Uh oh!

jodh-intel commented Nov 29, 2018

Uh oh!

sboeuf commented Nov 29, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

sboeuf commented Nov 28, 2018 •

edited

Loading

egernst commented Nov 29, 2018 •

edited

Loading