virtcontainers: Return the appropriate container status#950
virtcontainers: Return the appropriate container status#950jodh-intel merged 1 commit intokata-containers:masterfrom
Conversation
|
@jodh-intel @devimc @amshinde @chavafg PTAL, this PR fixes the issue seen with Docker 18.09. Other than that, we don't need to modify anything else, the fact that containerd removed the call to |
|
/test |
|
Suggested commit message edit: |
egernst
left a comment
There was a problem hiding this comment.
Looks good, but please update the commit message for a bit more detail.
When our runtime is asked for the container status, we also handle the scenario where the container is stopped if the shim process for that container on the host has terminated. In the current implementation, we retrieve the container status before stopping the container, causing a wrong status to be returned. The wait for the original go-routine's completion was done in a defer within the caller of statusContainers(), resulting in the statusContainer()'s values to return the pre-stopped value. This bug is first observed when updating to docker v18.09/containerd v1.2.0. With the current implementation, containerd-shim receives the TaskExit when it detects kata-shim is terminating. When checking the container state, however, it does not get the expected "stopped" value. The following commit resolves the described issue by simplifying the locking used around the status container calls. Originally StatusContainer would request a read lock. If we needed to update the container status in statusContainer, we'd start a go-routine which would request a read-write lock, waiting for the original read lock to be released. Can't imagine a bug could linger in this logic. We now just request a read-write lock in the caller (StatusContainer), skipping the need for a separate go-routine and defer. This greatly simplifies the logic, and removes the original bug. Fixes kata-containers#926 Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
b88ed15 to
fa9b15d
Compare
|
/test |
|
@egernst commit message updated! Thanks! |
|
@sboeuf - nice!! Can you find a way to add a unit test for this change though? Given the importance of this fix, it would be good to know we have a test to assert the correct behaviour. |
|
@sboeuf - my bad - s/unit test/any type of test/ ;) |
|
ftr, I can confirm this PR fixes the issue with Docker 18.09 (tested on ubuntu bionic). The CI will be testing using docker 18.06.1 (from https://github.com/kata-containers/runtime/blob/master/versions.yaml#L154) so we can say that it works with both versions. |
|
@jodh-intel yes that's a good idea, I think we can write a simple integration test using |
|
Thanks @sboeuf - let's land this to allow you to write that test... ;) |
|
Thanks @jodh-intel! I'll write this today! |
When our runtime is asked for the container status, we also handle
the scenario where the container is stopped if the shim process for
that container on the host has terminated.
In the current implementation, we retrieve the container status
before stopping the container, causing a wrong status to be returned.
The wait for the original go-routine's completion was done in a defer
within the caller of statusContainers(), resulting in the
statusContainer()'s values to return the pre-stopped value.
This bug is first observed when updating to docker v18.09/containerd
v1.2.0. With the current implementation, containerd-shim receives the
TaskExit when it detects kata-shim is terminating. When checking the
container state, however, it does not get the expected "stopped" value.
The following commit resolves the described issue by simplifying the
locking used around the status container calls. Originally
StatusContainer would request a read lock. If we needed to update the
container status in statusContainer, we'd start a go-routine which
would request a read-write lock, waiting for the original read lock to
be released. Can't imagine a bug could linger in this logic. We now
just request a read-write lock in the caller (StatusContainer),
skipping the need for a separate go-routine and defer. This greatly
simplifies the logic, and removes the original bug.
Fixes #926
Signed-off-by: Sebastien Boeuf sebastien.boeuf@intel.com