daemon: containerStop: fix ordering of "stop" and "die" events by thaJeztah · Pull Request #50220 · moby/moby

thaJeztah · 2025-06-17T16:03:47Z

relates to;

fix: daemon: state of stopped container visible to other queries when container is stopped #50136 (comment)
pkg/compose: remove uses of ExecOptions.Detach docker/compose#12950 (comment)

Commit 8e6cd44 added synchronisation to wait for the container's status to be updated in memory. However, since 952902e, a defer was used to produce the container's "stop" event.

As a result of the sychronisation that was added, the "die" event would now be produced before the "stop" event.

This patch moes the locking inside the defer to restore the previous behavior.

Unfortunately the order of events is still not guaranteed, because events are emited from multiple goroutines that don't have synchronisation between them; this is something to look at for follow ups. This patch keeps the status quo and should preserve the old behavior, which was "more" correct in most cases.

- What I did

- How I did it

- How to verify it

- Human readable description for the release notes

- A picture of a cute animal (not mandatory but encouraged)

daemon/stop.go

thaJeztah · 2025-06-17T17:00:36Z

@vvoland updated with your suggestion; PTAL 👍

mabrarov · 2025-06-17T18:12:14Z

integration/container/stop_test.go

+	inspect, err := apiClient.ContainerInspect(ctx, cID)
+	assert.NilError(t, err)
+	assert.Check(t, is.Equal(false, inspect.State.Running))


IMHO, it's better to use same API (/containers/json) as mentioned in #50133 to ensure this pull request doesn't introduce regression for #50133:

Suggested change

inspect, err := apiClient.ContainerInspect(ctx, cID)

assert.NilError(t, err)

assert.Check(t, is.Equal(false, inspect.State.Running))

containers, err := apiClient.ContainerList(ctx, containertypes.ListOptions{

All: true,

Filters: filters.NewArgs(filters.Arg("id", cID)),

})

assert.NilError(t, err)

assert.Assert(t, is.Len(containers, 1))

assert.Check(t, containers[0].State != containertypes.StateRunning)

Thanks! so my intent here was to just have a sanity check for the container to be stopped before we start checking the events, not so much to verify #50133

For that one, we should probably still look at adding a test specific to that.

vvoland · 2025-06-17T18:20:47Z

integration/container/stop_test.go

+	timeOut := 1
+	err := apiClient.ContainerStop(ctx, cID, containertypes.StopOptions{Timeout: &timeOut})


We should either set a bigger timeout for Windows (like

moby/integration/container/stop_test.go

Lines 94 to 97 in c0486dc

var pollOpts []poll.SettingOp

if isWindows {

pollOpts = append(pollOpts, poll.WithTimeout(StopContainerWindowsPollTimeout))

}

) or just leave the default 10s.

Otherwise this will end up being flaky.

Hm... yup can do.

But thinking about this; it actually makes me curious now; everything in containerStop looks to be coded to have a promise that "once the function returned, the container is stopped". Or at least; there's checks in multiple places where we have an explicit "wait" for the container to reach the desired state;

moby/daemon/stop.go

Lines 98 to 101 in c0486dc

if status := <-ctr.Wait(subCtx, containertypes.WaitConditionNotRunning); status.Err() == nil {

// container did exit, so ignore any previous errors and return

return nil

}

Why do we need those checks? Have we been working around a bug? If stop was meant to be async, then I would've expected it to return immediately ("we signaled it to stop; check back later"), but we don't; the api call blocks until it (should be) exited (and the state (thus docker inspect?) to be persisted to disk?)

docker run -qdit --rm --name foo busybox time curl -XPOST --unix-socket /var/run/docker.sock http://local/v1.51/containers/foo/stop real 0m10.366s user 0m0.004s sys 0m0.032s docker run -qdit --rm --name foo busybox time curl -XPOST --unix-socket /var/run/docker.sock 'http://local/v1.51/containers/foo/stop?t=1' real 0m1.307s user 0m0.010s sys 0m0.007s

thaJeztah · 2025-06-17T18:58:34Z

integration/container/stop_test.go

+	var pollOpts []poll.SettingOp
+	if testEnv.DaemonInfo.OSType == "windows" {
+		pollOpts = append(pollOpts, poll.WithTimeout(StopContainerWindowsPollTimeout))
+	}
+
+	poll.WaitOn(t, container.IsStopped(ctx, apiClient, cID), pollOpts...)


@vvoland added the poll; PTAL

We should probably look at that comment I left; really curious now!

☝️ per discussion on Slack; @vvoland didn't necessarily mean an actual poll here, just making sure the timeout is long enough for windows; this can probably be an inspect again, if needed at all (will have a look)

Why do we need to care about this timeout at all? I would use default value, because this timeout is about graceful stop vs kill and in this test (if I'm not wrong) we are OK even if container is killed (i.e. was not able to stop within given/default timeout).

Yeah, that was my initial thinking behind setting it to 1 second; either it stops gracefully, or kill it after 1 second; both should produce stop -> die -> exited in that order.

I guess, default timeout would be better (lesser code -> better code). Just need to check if respective image stops correctly (doesn't ignore signal used for stop of container created from image).

integration/container/stop_test.go

thaJeztah · 2025-06-18T10:44:35Z

Hm... interesting; Windows still flipped the order somehow? https://github.com/moby/moby/actions/runs/15729094026/job/44326581774?pr=50220

=== Failed
=== FAIL: github.com/docker/docker/integration/container TestStopEventsOrder (2.80s)
    stop_test.go:151: assertion failed: 
        --- ←
        +++ →
          []events.Action{
        - 	"stop",
        + 	"die",
        - 	"die",
        + 	"stop",
          }

mabrarov · 2025-06-18T14:47:49Z

Hi @thaJeztah,

Hm... interesting; Windows still flipped the order somehow? https://github.com/moby/moby/actions/runs/15729094026/job/44326581774?pr=50220
=== Failed
=== FAIL: github.com/docker/docker/integration/container TestStopEventsOrder (2.80s)
    stop_test.go:151: assertion failed: 
        --- ←
        +++ →
          []events.Action{
        - 	"stop",
        + 	"die",
        - 	"die",
        + 	"stop",
          }

Even on Linux I don't see any synchronization guaranteeing the order of events b/w

moby/daemon/monitor.go

Line 130 in 147e2a8

daemon.LogContainerEventWithAttributes(c, events.ActionDie, attributes)

and

moby/daemon/stop.go

Line 76 in d31f67f

daemon.LogContainerEvent(ctr, events.ActionStop)

Both actions are performed in goroutines which can run at the same point of time and both actions are performed without holding any shared lock (like lock on container).

Do you?

thaJeztah · 2025-06-18T15:36:48Z

No, you're right; I made the assumption that the code before this handled the correct order, but an initial look at the Wait function; the wait looks to depend on stopWaiters;

moby/container/state.go

Line 199 in d31f67f

s.stopWaiters = append(s.stopWaiters, waitC)

Which gets unblocked through State.SetStopped;

moby/container/state.go

Line 313 in d31f67f

s.notifyAndClear(&s.stopWaiters)

Which 🫠 gets called before statecounters are updated, state is persisted to disk, and the event is sent;

moby/daemon/monitor.go

Lines 120 to 130 in 147e2a8

    
           	c.SetStopped(&ctrExitStatus) 
        
           	if !c.HasBeenManuallyRestarted { 
        
           		defer daemon.autoRemove(&cfg.Config, c) 
        
           	} 
        
           } 
        
           defer c.Unlock() // needs to be called before autoRemove 
        
           daemon.setStateCounter(c) 
        
           checkpointErr := c.CheckpointTo(context.TODO(), daemon.containersReplica) 
        
           daemon.LogContainerEventWithAttributes(c, events.ActionDie, attributes)

And the lock is released before that, so even setStateCounter() could potentially race, as there's no lock when getting the container state (I see it calls State.StateString() without lock).

So, while this PR (AFAIC) doesn't make things worse and mostly brings us back to the situation before #50136, there already was a race, and it just became more apparent with that PR merged.

thaJeztah · 2025-06-18T15:39:05Z

Based on the above, perhaps we should consider removing the test for now, and re-apply it when we looked at the underlying issue in more depth.

I think the change in this PR still is good to take, as it would keep the status quo from before #50136

let me know what you think @vvoland @mabrarov

vvoland · 2025-06-18T18:17:19Z

Let's keep the test, but add a skip with a TODO link for the ticket.

thaJeztah · 2025-06-18T18:40:19Z

Yeah, was considering that, but didn't want to add a test that we know is flaky and skip from the start; let me just move the test to a draft PR, and use that as starting point for further work; I'll also draft up a tracking ticket to capture what was discussed in this (and the other PR) for future work to look into.

I knew some of this logic was convoluted, but it looks like there's just too much logic stacked on top of existing bits over the Years to fix various issues, and it's starting to show. We need to take a few steps back and make an inventory of the logic and rework some of it at least.

Commit 8e6cd44 added synchronisation to wait for the container's status to be updated in memory. However, since 952902e, a defer was used to produce the container's "stop" event. As a result of the sychronisation that was added, the "die" event would now be produced before the "stop" event. This patch moves the locking inside the defer to restore the previous behavior. Unfortunately the order of events is still not guaranteed, because events are emited from multiple goroutines that don't have synchronisation between them; this is something to look at for follow ups. This patch keeps the status quo and should preserve the old behavior, which was "more" correct in most cases. Signed-off-by: Sebastiaan van Stijn <github@gone.nl>

thaJeztah · 2025-06-18T18:55:13Z

rebased and moved the test to WIP: daemon: add test for order of events on container stop #50227

Now hoping GitHub recovers because it's been bad whole day

thaJeztah · 2025-06-19T10:44:05Z

@robmry @vvoland ptal (it's green now 🎉)

thaJeztah added this to the 29.0.0 milestone Jun 17, 2025

thaJeztah added status/2-code-review area/daemon Core Engine kind/bugfix PR's that fix bugs process/cherry-pick/28.x labels Jun 17, 2025

This was referenced Jun 17, 2025

fix: daemon: state of stopped container visible to other queries when container is stopped #50136

Merged

pkg/compose: remove uses of ExecOptions.Detach docker/compose#12950

Merged

vvoland reviewed Jun 17, 2025

View reviewed changes

daemon/stop.go Show resolved Hide resolved

thaJeztah force-pushed the fix_event_ordering branch from 353a256 to c0486dc Compare June 17, 2025 16:59

thaJeztah mentioned this pull request Jun 17, 2025

daemon logs show (*service).Write failed errors when pulling images with compose #50223

Open

mabrarov reviewed Jun 17, 2025

View reviewed changes

vvoland reviewed Jun 17, 2025

View reviewed changes

thaJeztah force-pushed the fix_event_ordering branch from c0486dc to ff960ce Compare June 17, 2025 18:57

thaJeztah commented Jun 17, 2025

View reviewed changes

integration/container/stop_test.go Outdated Show resolved Hide resolved

thaJeztah force-pushed the fix_event_ordering branch from ff960ce to 4b2081b Compare June 18, 2025 09:26

thaJeztah force-pushed the fix_event_ordering branch 2 times, most recently from 282b14e to 147e2a8 Compare June 18, 2025 11:08

vvoland approved these changes Jun 18, 2025

View reviewed changes

thaJeztah force-pushed the fix_event_ordering branch from 147e2a8 to 7be2e74 Compare June 18, 2025 18:42

thaJeztah force-pushed the fix_event_ordering branch from 7be2e74 to 062082e Compare June 18, 2025 18:43

thaJeztah mentioned this pull request Jun 18, 2025

WIP: daemon: add test for order of events on container stop #50227

Draft

vvoland approved these changes Jun 20, 2025

View reviewed changes

thaJeztah merged commit a0f36cc into moby:master Jun 20, 2025
332 of 370 checks passed

thaJeztah deleted the fix_event_ordering branch June 20, 2025 10:44

thaJeztah mentioned this pull request Jun 20, 2025

[28.x backport] daemon: containerStop: fix ordering of "stop" and "die" events #50242

Merged

thaJeztah added process/cherry-picked and removed process/cherry-pick/28.x labels Jun 20, 2025

-	inspect, err := apiClient.ContainerInspect(ctx, cID)
-	assert.NilError(t, err)
-	assert.Check(t, is.Equal(false, inspect.State.Running))
+	containers, err := apiClient.ContainerList(ctx, containertypes.ListOptions{
+		All:     true,
+		Filters: filters.NewArgs(filters.Arg("id", cID)),
+	})
+	assert.NilError(t, err)
+	assert.Assert(t, is.Len(containers, 1))
+	assert.Check(t, containers[0].State != containertypes.StateRunning)

		timeOut := 1
		err := apiClient.ContainerStop(ctx, cID, containertypes.StopOptions{Timeout: &timeOut})

	var pollOpts []poll.SettingOp
	if isWindows {
	pollOpts = append(pollOpts, poll.WithTimeout(StopContainerWindowsPollTimeout))
	}

	if status := <-ctr.Wait(subCtx, containertypes.WaitConditionNotRunning); status.Err() == nil {
	// container did exit, so ignore any previous errors and return
	return nil
	}

Conversation

thaJeztah commented Jun 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

thaJeztah commented Jun 17, 2025

Uh oh!

mabrarov Jun 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

thaJeztah Jun 17, 2025

Choose a reason for hiding this comment

Uh oh!

vvoland Jun 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

thaJeztah Jun 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

thaJeztah Jun 17, 2025

Choose a reason for hiding this comment

Uh oh!

thaJeztah Jun 17, 2025

Choose a reason for hiding this comment

Uh oh!

mabrarov Jun 17, 2025

Choose a reason for hiding this comment

Uh oh!

thaJeztah Jun 18, 2025

Choose a reason for hiding this comment

Uh oh!

mabrarov Jun 18, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

thaJeztah commented Jun 18, 2025

Uh oh!

mabrarov commented Jun 18, 2025 • edited by thaJeztah Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

thaJeztah commented Jun 18, 2025

Uh oh!

thaJeztah commented Jun 18, 2025

Uh oh!

vvoland commented Jun 18, 2025

Uh oh!

thaJeztah commented Jun 18, 2025

Uh oh!

thaJeztah commented Jun 18, 2025

Uh oh!

thaJeztah commented Jun 19, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

thaJeztah commented Jun 17, 2025 •

edited

Loading

mabrarov Jun 17, 2025 •

edited

Loading

vvoland Jun 17, 2025 •

edited

Loading

thaJeztah Jun 17, 2025 •

edited

Loading

mabrarov commented Jun 18, 2025 •

edited by thaJeztah

Loading