Skip to content
This repository was archived by the owner on May 12, 2021. It is now read-only.

virtcontainers: Return the appropriate container status#950

Merged
jodh-intel merged 1 commit intokata-containers:masterfrom
sboeuf:sboeuf/fix_docker_18_09
Nov 29, 2018
Merged

virtcontainers: Return the appropriate container status#950
jodh-intel merged 1 commit intokata-containers:masterfrom
sboeuf:sboeuf/fix_docker_18_09

Conversation

@sboeuf
Copy link
Copy Markdown

@sboeuf sboeuf commented Nov 28, 2018

When our runtime is asked for the container status, we also handle
the scenario where the container is stopped if the shim process for
that container on the host has terminated.

In the current implementation, we retrieve the container status
before stopping the container, causing a wrong status to be returned.
The wait for the original go-routine's completion was done in a defer
within the caller of statusContainers(), resulting in the
statusContainer()'s values to return the pre-stopped value.

This bug is first observed when updating to docker v18.09/containerd
v1.2.0. With the current implementation, containerd-shim receives the
TaskExit when it detects kata-shim is terminating. When checking the
container state, however, it does not get the expected "stopped" value.

The following commit resolves the described issue by simplifying the
locking used around the status container calls. Originally
StatusContainer would request a read lock. If we needed to update the
container status in statusContainer, we'd start a go-routine which
would request a read-write lock, waiting for the original read lock to
be released. Can't imagine a bug could linger in this logic. We now
just request a read-write lock in the caller (StatusContainer),
skipping the need for a separate go-routine and defer. This greatly
simplifies the logic, and removes the original bug.

Fixes #926

Signed-off-by: Sebastien Boeuf sebastien.boeuf@intel.com

@sboeuf
Copy link
Copy Markdown
Author

sboeuf commented Nov 28, 2018

@jodh-intel @devimc @amshinde @chavafg PTAL, this PR fixes the issue seen with Docker 18.09. Other than that, we don't need to modify anything else, the fact that containerd removed the call to kill --all is not a big deal, but we were returning the wrong container status, which was not appreciated by containerd, and eventually delete was never called.

@sboeuf
Copy link
Copy Markdown
Author

sboeuf commented Nov 28, 2018

/test

Copy link
Copy Markdown

@devimc devimc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice finding!

lgtm

@egernst
Copy link
Copy Markdown
Member

egernst commented Nov 29, 2018

Suggested commit message edit:

When our runtime is asked for the container status, we also handle the
scenario where the container is stopped if the shim process for that
container on the host has terminated.

In the current implementation, we retrieve the container status before stopping
the container, causing a wrong status to be returned.  The wait for the original
go-routine's completion was done in a defer within the caller of statusContainers(),
resulting in the statusContainer()'s values to return the pre-stopped value. 
 
This bug is first observed when updating to docker v18.09/containerd v1.2.0. With
the current implementation, containerd-shim receives the TaskExit when it detects
kata-shim is terminating. When checking the container state, however, it does not get
the expected "stopped" value.

The following commit resolves the described issue by simplifying the locking used
around the status container calls. Originally StatusContainer would request a read
lock.  If we needed to update the container status in statusContainer, we'd start a
go-routine which would request a read-write lock, waiting for the original read lock
to be released.  Can't imagine a bug could linger in this logic.  We now just request a
read-write lock in the caller (StatusContainer), skipping the need for a separate go-routine
and defer. This greatly simplifies the logic, and removes the original bug.

Copy link
Copy Markdown
Member

@egernst egernst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, but please update the commit message for a bit more detail.

egernst
egernst approved these changes Nov 29, 2018
When our runtime is asked for the container status, we also handle
the scenario where the container is stopped if the shim process for
that container on the host has terminated.

In the current implementation, we retrieve the container status
before stopping the container, causing a wrong status to be returned.
The wait for the original go-routine's completion was done in a defer
within the caller of statusContainers(), resulting in the
statusContainer()'s values to return the pre-stopped value.

This bug is first observed when updating to docker v18.09/containerd
v1.2.0. With the current implementation, containerd-shim receives the
TaskExit when it detects kata-shim is terminating. When checking the
container state, however, it does not get the expected "stopped" value.

The following commit resolves the described issue by simplifying the
locking used around the status container calls. Originally
StatusContainer would request a read lock. If we needed to update the
container status in statusContainer, we'd start a go-routine which
would request a read-write lock, waiting for the original read lock to
be released.  Can't imagine a bug could linger in this logic. We now
just request a read-write lock in the caller (StatusContainer),
skipping the need for a separate go-routine and defer. This greatly
simplifies the logic, and removes the original bug.

Fixes kata-containers#926

Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
@sboeuf sboeuf force-pushed the sboeuf/fix_docker_18_09 branch from b88ed15 to fa9b15d Compare November 29, 2018 04:15
@sboeuf
Copy link
Copy Markdown
Author

sboeuf commented Nov 29, 2018

/test

@sboeuf
Copy link
Copy Markdown
Author

sboeuf commented Nov 29, 2018

@egernst commit message updated! Thanks!

@jodh-intel
Copy link
Copy Markdown

@sboeuf - nice!!

Can you find a way to add a unit test for this change though? Given the importance of this fix, it would be good to know we have a test to assert the correct behaviour.

@jodh-intel
Copy link
Copy Markdown

@sboeuf - my bad - s/unit test/any type of test/ ;)

@jodh-intel
Copy link
Copy Markdown

ftr, I can confirm this PR fixes the issue with Docker 18.09 (tested on ubuntu bionic). The CI will be testing using docker 18.06.1 (from https://github.com/kata-containers/runtime/blob/master/versions.yaml#L154) so we can say that it works with both versions.

@sboeuf
Copy link
Copy Markdown
Author

sboeuf commented Nov 29, 2018

@jodh-intel yes that's a good idea, I think we can write a simple integration test using kata-runtime state after we run a simple workload that naturally exits after a few seconds. The result should be stopped directly, where before it would have returned running.
Reference: kata-containers/tests#956

@jodh-intel
Copy link
Copy Markdown

Thanks @sboeuf - let's land this to allow you to write that test... ;)

@jodh-intel jodh-intel merged commit 5857523 into kata-containers:master Nov 29, 2018
@sboeuf
Copy link
Copy Markdown
Author

sboeuf commented Nov 29, 2018

Thanks @jodh-intel! I'll write this today!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Moving to Docker 18.09 does not delete the VM and proxy when the container terminates

4 participants