daemon: containerStop: fix ordering of "stop" and "die" events#50220
daemon: containerStop: fix ordering of "stop" and "die" events#50220thaJeztah merged 1 commit intomoby:masterfrom
Conversation
353a256 to
c0486dc
Compare
|
@vvoland updated with your suggestion; PTAL 👍 |
integration/container/stop_test.go
Outdated
| inspect, err := apiClient.ContainerInspect(ctx, cID) | ||
| assert.NilError(t, err) | ||
| assert.Check(t, is.Equal(false, inspect.State.Running)) |
There was a problem hiding this comment.
IMHO, it's better to use same API (/containers/json) as mentioned in #50133 to ensure this pull request doesn't introduce regression for #50133:
| inspect, err := apiClient.ContainerInspect(ctx, cID) | |
| assert.NilError(t, err) | |
| assert.Check(t, is.Equal(false, inspect.State.Running)) | |
| containers, err := apiClient.ContainerList(ctx, containertypes.ListOptions{ | |
| All: true, | |
| Filters: filters.NewArgs(filters.Arg("id", cID)), | |
| }) | |
| assert.NilError(t, err) | |
| assert.Assert(t, is.Len(containers, 1)) | |
| assert.Check(t, containers[0].State != containertypes.StateRunning) |
There was a problem hiding this comment.
Thanks! so my intent here was to just have a sanity check for the container to be stopped before we start checking the events, not so much to verify #50133
For that one, we should probably still look at adding a test specific to that.
integration/container/stop_test.go
Outdated
| timeOut := 1 | ||
| err := apiClient.ContainerStop(ctx, cID, containertypes.StopOptions{Timeout: &timeOut}) |
There was a problem hiding this comment.
We should either set a bigger timeout for Windows (like
moby/integration/container/stop_test.go
Lines 94 to 97 in c0486dc
Otherwise this will end up being flaky.
There was a problem hiding this comment.
Hm... yup can do.
But thinking about this; it actually makes me curious now; everything in containerStop looks to be coded to have a promise that "once the function returned, the container is stopped". Or at least; there's checks in multiple places where we have an explicit "wait" for the container to reach the desired state;
Lines 98 to 101 in c0486dc
Why do we need those checks? Have we been working around a bug? If stop was meant to be async, then I would've expected it to return immediately ("we signaled it to stop; check back later"), but we don't; the api call blocks until it (should be) exited (and the state (thus docker inspect?) to be persisted to disk?)
docker run -qdit --rm --name foo busybox
time curl -XPOST --unix-socket /var/run/docker.sock http://local/v1.51/containers/foo/stop
real 0m10.366s
user 0m0.004s
sys 0m0.032s
docker run -qdit --rm --name foo busybox
time curl -XPOST --unix-socket /var/run/docker.sock 'http://local/v1.51/containers/foo/stop?t=1'
real 0m1.307s
user 0m0.010s
sys 0m0.007sc0486dc to
ff960ce
Compare
integration/container/stop_test.go
Outdated
| var pollOpts []poll.SettingOp | ||
| if testEnv.DaemonInfo.OSType == "windows" { | ||
| pollOpts = append(pollOpts, poll.WithTimeout(StopContainerWindowsPollTimeout)) | ||
| } | ||
|
|
||
| poll.WaitOn(t, container.IsStopped(ctx, apiClient, cID), pollOpts...) |
There was a problem hiding this comment.
@vvoland added the poll; PTAL
We should probably look at that comment I left; really curious now!
There was a problem hiding this comment.
☝️ per discussion on Slack; @vvoland didn't necessarily mean an actual poll here, just making sure the timeout is long enough for windows; this can probably be an inspect again, if needed at all (will have a look)
There was a problem hiding this comment.
Why do we need to care about this timeout at all? I would use default value, because this timeout is about graceful stop vs kill and in this test (if I'm not wrong) we are OK even if container is killed (i.e. was not able to stop within given/default timeout).
There was a problem hiding this comment.
Yeah, that was my initial thinking behind setting it to 1 second; either it stops gracefully, or kill it after 1 second; both should produce stop -> die -> exited in that order.
There was a problem hiding this comment.
I guess, default timeout would be better (lesser code -> better code). Just need to check if respective image stops correctly (doesn't ignore signal used for stop of container created from image).
ff960ce to
4b2081b
Compare
|
Hm... interesting; Windows still flipped the order somehow? https://github.com/moby/moby/actions/runs/15729094026/job/44326581774?pr=50220 |
282b14e to
147e2a8
Compare
|
Hi @thaJeztah,
Even on Linux I don't see any synchronization guaranteeing the order of events b/w Line 130 in 147e2a8 and Line 76 in d31f67f Both actions are performed in goroutines which can run at the same point of time and both actions are performed without holding any shared lock (like lock on container). Do you? |
|
No, you're right; I made the assumption that the code before this handled the correct order, but an initial look at the Line 199 in d31f67f Which gets unblocked through Line 313 in d31f67f Which 🫠 gets called before statecounters are updated, state is persisted to disk, and the event is sent; Lines 120 to 130 in 147e2a8 And the lock is released before that, so even So, while this PR (AFAIC) doesn't make things worse and mostly brings us back to the situation before #50136, there already was a race, and it just became more apparent with that PR merged. |
|
Let's keep the test, but add a skip with a TODO link for the ticket. |
|
Yeah, was considering that, but didn't want to add a test that we know is flaky and skip from the start; let me just move the test to a draft PR, and use that as starting point for further work; I'll also draft up a tracking ticket to capture what was discussed in this (and the other PR) for future work to look into. I knew some of this logic was convoluted, but it looks like there's just too much logic stacked on top of existing bits over the Years to fix various issues, and it's starting to show. We need to take a few steps back and make an inventory of the logic and rework some of it at least. |
147e2a8 to
7be2e74
Compare
Commit 8e6cd44 added synchronisation to wait for the container's status to be updated in memory. However, since 952902e, a defer was used to produce the container's "stop" event. As a result of the sychronisation that was added, the "die" event would now be produced before the "stop" event. This patch moves the locking inside the defer to restore the previous behavior. Unfortunately the order of events is still not guaranteed, because events are emited from multiple goroutines that don't have synchronisation between them; this is something to look at for follow ups. This patch keeps the status quo and should preserve the old behavior, which was "more" correct in most cases. Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
7be2e74 to
062082e
Compare
Now hoping GitHub recovers because it's been bad whole day |
relates to;
Commit 8e6cd44 added synchronisation to wait for the container's status to be updated in memory. However, since 952902e, a defer was used to produce the container's "stop" event.
As a result of the sychronisation that was added, the "die" event would now be produced before the "stop" event.
This patch moes the locking inside the defer to restore the previous behavior.
Unfortunately the order of events is still not guaranteed, because events are emited from multiple goroutines that don't have synchronisation between them; this is something to look at for follow ups. This patch keeps the status quo and should preserve the old behavior, which was "more" correct in most cases.
- What I did
- How I did it
- How to verify it
- Human readable description for the release notes
- A picture of a cute animal (not mandatory but encouraged)