libcontainer/container_linux: Consider process state (running, zombie, etc.) in runType#1489
Conversation
So we can extract more than the start time with a single read. Signed-off-by: W. Trevor King <wking@tremily.us>
107bc2c to
407e33a
Compare
|
Hooray, Travis is happy :). Can someone more familiar with the checkpoint tooling comment on whether the appropriate post-checkpoint state is |
|
@wking Thank you for the patch. I looked at it without details and it looks good for me. I'm going to review it carefully a bit later.
I am not sure that I understand what do you mean here. Could you elaborate? |
|
On Tue, Jun 20, 2017 at 11:17:25AM -0700, Andrew Vagin wrote:
> Can someone more familiar with the checkpoint tooling comment on
> whether the appropriate post-checkpoint state is Stopped (with a
> zombie process?).
I am not sure that I understand what do you mean here. Could you
elaborate?
The change I make to checkpoint_test.go [1] is after a container has
been checkpointed (via container.Checkpoint [2]). When the state is
checked [3], the container process was a zombie, which, with this PR,
means the status was ‘Stopped’ (before this PR, a zombie container
process was reported as ‘Running’ [4]). Interestingly, there's an
earlier call to container.Checkpoint with different options in that
test [5], and in that case the container process is ‘Running’ both
before and after this PR [6]. I don't understand container.Checkpoint
well enough to know why the call in [2] (but not the call in [5])
seems to be zombi-fying the container process (the difference may be
PreDump being set in the earlier call [7]). And since I don't
understand why the container process is (at least sometimes) being
zombi-fied, I'm not sure how reliable my checkpoint_test.go change [1]
is.
[1]: https://github.com/opencontainers/runc/pull/1489/files#diff-4dcc5378706c75d50b2b7c827d8546e8L183
[2]: https://github.com/opencontainers/runc/blob/035b57895dcd7284a61aacfb9dbfd545a1a425ba/libcontainer/integration/checkpoint_test.go#L173
[3]: https://github.com/opencontainers/runc/blob/035b57895dcd7284a61aacfb9dbfd545a1a425ba/libcontainer/integration/checkpoint_test.go#L178
[4]: https://github.com/opencontainers/runc/blob/035b57895dcd7284a61aacfb9dbfd545a1a425ba/libcontainer/integration/checkpoint_test.go#L183
[5]: https://github.com/opencontainers/runc/blob/035b57895dcd7284a61aacfb9dbfd545a1a425ba/libcontainer/integration/checkpoint_test.go#L145
[6]: https://github.com/opencontainers/runc/blob/035b57895dcd7284a61aacfb9dbfd545a1a425ba/libcontainer/integration/checkpoint_test.go#L155
[7]: https://github.com/opencontainers/runc/blob/035b57895dcd7284a61aacfb9dbfd545a1a425ba/libcontainer/integration/checkpoint_test.go#L141
|
|
My reproducer for the problem from is still working. You need to stop the init process, I use gdb for that, and then start tests. |
|
Oops, I forgot to compile runc. Actually everything works as expected with this patch. |
|
@wking you changes in checkpoint_test.go looks correct.
container.Checkpoint(preDumpOpts) doesn't stop a container, this call enables memory tracker and dumps memory for processes without stopping them. Then we call container.Checkpoint() in a second time and this call stops processes and dumps them. The big part of memory is already dumped, so we have to dump only changed memory. This optimization allows us to reduce a downtime of a container, when it is migrated from one host to another host: https://criu.org/Iterative_migration |
And convert the various start-time properties from strings to uint64s. This removes all internal consumers of the deprecated GetProcessStartTime function. Signed-off-by: W. Trevor King <wking@tremily.us>
And Stat_t.PID and Stat_t.Name while we're at it. Then use the new .State property in runType to distinguish between running and zombie/dead processes, since kill(2) does not [1]. With this change we no longer claim Running status for zombie/dead processes. I've also removed the kill(2) call from runType. It was originally added in 13841ef (new-api: return the Running state only if the init process is alive, 2014-12-23), but we've been accessing /proc/[pid]/stat since 14e95b2 (Make state detection precise, 2016-07-05, opencontainers#930), and with the /stat access the kill(2) check is redundant. I also don't see much point to the previously-separate doesInitProcessExist, so I've inlined that logic in runType. It would be nice to distinguish between "/proc/[pid]/stat doesn't exist" and errors parsing its contents, but I've skipped that for the moment. The Running -> Stopped change in checkpoint_test.go is because the post-checkpoint process is a zombie, and with this commit zombie processes are Stopped (and no longer Running). [1]: opencontainers#1483 (comment) Signed-off-by: W. Trevor King <wking@tremily.us>
| exist, err := c.doesInitProcessExist(pid) | ||
| if !exist || err != nil { | ||
| return Stopped, err | ||
| if stat.StartTime != c.initProcessStartTime || stat.State == system.Zombie || stat.State == system.Dead { |
There was a problem hiding this comment.
Should there be a test with the state set as a Zombie or Dead to make sure that it gets picked off? Apologies if I missed it, and not a hold up for this PR, suggestion for later if missing.
There was a problem hiding this comment.
Should there be a test with the state set as a Zombie or Dead to make sure that it gets picked off?
I'm not sure what you mean. Can you give more detail on the case you feel is not covered?
There was a problem hiding this comment.
Should there be a test with the state set as a Zombie or Dead to make sure that it gets picked off?
Ah, I think you mean a unit test where we run this on a dead/zombie process. The checkpoint_test.go change is doing that already, although there's too much going on there for it to be a unit test. If you have another way to manufacture a zombie process for testing, I'm happy to add a more focused test.
There was a problem hiding this comment.
I was indeed thinking of a unit test, trying to ensure more code coverage. I unfortunately don't have a good way to emulate a zombie/dead process. I was just thinking of faking it out in a test to change the state of the process. I didn't see that happening for Dead/Zombie in the in the checkpoint_tests.go in particular, only that the process wasn't Stopped. Anywho, no need to address now, especially with this PR, just a point to mull.
|
LGTM, shoulda done this when I commented earlier today. |
As @avagin pointed out, we're currently failing to distinguish between running (stopped, disk-sleeping, etc.) and dead/zombie processes when determining the container state. This PR rerolls our runType approach to make that distinction.
Now that we need two values from
/proc/[pid]/stat, I've rerolledlibcontainer/system/procto use aStat_tobject (like thesys/x/unixStat_tAPI). Then we can get both the start-time and process state from the same filesystem read.Additional details in the commit messages for the curious.