fix: daemon: state of stopped container visible to other queries when container is stopped#50136
Conversation
ead31cc to
bdc5cac
Compare
daemon/stop.go
Outdated
| if err := ctr.CommitInMemory(daemon.containersReplica); err != nil { | ||
| log.G(ctx).WithError(err).Warn("could not make state of container visible ot other queries") | ||
| } |
There was a problem hiding this comment.
The CommitToMemory function was added specifically for health-checks, as the healthcheck status is ephemeral, and doesn't have to be persisted to disk;
The other function to persist container state, both to disk and in memory is container.CheckpointTo
Lines 223 to 236 in ee69d8a
I see that's called in various places, such as on container start, pause/unpause, update, and some operations such as connecting and disconnecting from networks (for pause and unpause, I'm actually wondering if that's state to be persisted on disk, but that's a separate discussion).
For container stop it looks like handling ia a bit more involved; container.CheckpointTo is called in daemon.handleContainerExit (which is called in daemon.killWithSignal and daemon.ProcessEvent), and called in daemon.ProcessEvent itself;
Line 32 in ee69d8a
Within daemon.ProcessEvent, that code is handled conditionally, based on whether the event we received from containerd was for the container's main process (not for exec processes inside the container);
Lines 190 to 193 in ee69d8a
I wonder (but would have to look at traces, events or logs) if there's anything in the events processing that could explain why none of those paths were called by the moment the function returns, or if there's possibly some concurrency where multiple calls to container.CheckpointTo are racing (possibly overwriting state with the wrong state?)
There was a problem hiding this comment.
Yes, current state of this PR is quick & dirty.
I'm curious if there is a chance that daemon.Daemon.ContainerStop completes (and respective API endpoint replies. so that subsequent request is sent by the client) before call of (before return from) container.CheckpointTo in daemon.Daemon.handleContainerExit - refer to
Line 120 in 01f442b
and
Line 128 in 01f442b
where the first call includes:
Line 322 in 01f442b
which should trigger:
Line 95 in 01f442b
The lock acquired at daemon/stop.go:97 just helps to enforce ordering b/w daemon.Daemon.handleContainerExit (which holds lock of the same mutex from beginning of method) and daemon.Daemon.containerStop.
There was a problem hiding this comment.
... I can remove daemon/stop.go:99-101 and check if it still works to test my hypothesis (note that fix for race condition issue can not be 100% proved that way 😄)
There was a problem hiding this comment.
I removed CommitToMemory from this PR because it seems to be not needed (refer to previous comments).
Here are results of test ("Reproduce" section from description of #50133) on master branch.
Debugger console output (details about breakpoints where these messages are logged into debugger console are provided at the end of this comment):
Before Container.SetStop in Daemon.handleContainerExit
After container stopped in Daemon.containerStop
Start of containerRouter.getContainersJSON
After Daemon.containersReplica in Daemon.Containers
After Container.CheckpointTo in Daemon.handleContainerExit
Start of containerRouter.getContainersJSON
After Daemon.containersReplica in Daemon.Containers
Test output (test failed):
* Trying /var/run/docker.sock:0...
* Connected to localhost (/var/run/docker.sock) port 80 (#0)
> POST /containers/3bd6f9fa938bbac06b546d6a7e8513736a0a1be66ef7a481d71d24f9a5f9075a/stop HTTP/1.1
> Host: localhost
> User-Agent: curl/7.88.1
> Accept: */*
>
< HTTP/1.1 204 No Content
< Api-Version: 1.50
< Docker-Experimental: false
< Ostype: linux
< Server: Docker/dev (linux)
< Date: Tue, 03 Jun 2025 16:58:27 GMT
<
* Connection #0 to host localhost left intact
* Found bundle for host: 0x5636e4b76440 [serially]
* Can not multiplex, even if we wanted to
* Re-using existing connection #0 with host localhost
> GET /containers/json?all=1&filters=%7B%22name%22%3A%7B%22%5Emy-nginx%24%22%3Atrue%7D%7D HTTP/1.1
> Host: localhost
> User-Agent: curl/7.88.1
> Accept: */*
>
< HTTP/1.1 200 OK
< Api-Version: 1.50
< Content-Type: application/json
< Docker-Experimental: false
< Ostype: linux
< Server: Docker/dev (linux)
< Date: Tue, 03 Jun 2025 16:58:27 GMT
< Content-Length: 954
<
[{"Id":"3bd6f9fa938bbac06b546d6a7e8513736a0a1be66ef7a481d71d24f9a5f9075a","Names":["/my-nginx"],"Image":"nginx:alpine","ImageID":"sha256:6769dc3a703c719c1d2756bda113659be28ae16cf0da58dd5fd823d6b9a050ea","Command":"/docker-entrypoint.sh nginx -g 'daemon off;'","Created":1748969897,"Ports":[{"PrivatePort":80,"Type":"tcp"}],"Labels":{"maintainer":"NGINX Docker Maintainers <docker-maint@nginx.com>"},"State":"running","Status":"Up 10 seconds","HostConfig":{"NetworkMode":"bridge"},"NetworkSettings":{"Networks":{"bridge":{"IPAMConfig":null,"Links":null,"Aliases":null,"MacAddress":"72:d0:71:04:d1:e9","DriverOpts":null,"GwPriority":0,"NetworkID":"b563c5c9b89c462cf82a852a7075743a80e491b15d5c6fc270cd73cd9ef97245","EndpointID":"391be0e2c6ef0fdaaa60af41cfb13937f66c8700c7d10e19313162b1b9639864","Gateway":"172.18.0.1","IPAddress":"172.18.0.2","IPPrefixLen":16,"IPv6Gateway":"","GlobalIPv6Address":"","GlobalIPv6PrefixLen":0,"DNSNames":null}}},"Mounts":[]}]
* Connection #0 to host localhost left intact
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
3bd6f9fa938b nginx:alpine "/docker-entrypoint.…" 10 seconds ago Exited (0) Less than a second ago my-nginx
3bd6f9fa938bbac06b546d6a7e8513736a0a1be66ef7a481d71d24f9a5f9075a
Here are results of test ("Reproduce" section from description of #50133) on source branch of this pull request.
Debugger console output (details about breakpoints where these messages are logged into debugger console are provided at the end of this comment):
Before Container.SetStop in Daemon.handleContainerExit
After Container.CheckpointTo in Daemon.handleContainerExit
After container stopped in Daemon.containerStop
Start of containerRouter.getContainersJSON
After Daemon.containersReplica in Daemon.Containers
Start of containerRouter.getContainersJSON
After Daemon.containersReplica in Daemon.Containers
Test output (test succeeded):
* Trying /var/run/docker.sock:0...
* Connected to localhost (/var/run/docker.sock) port 80 (#0)
> POST /containers/d18b4b0aff9fe6b577a34050492782f41267ad1d36cf78e96ba7b197842d5968/stop HTTP/1.1
> Host: localhost
> User-Agent: curl/7.88.1
> Accept: */*
>
< HTTP/1.1 204 No Content
< Api-Version: 1.50
< Docker-Experimental: false
< Ostype: linux
< Server: Docker/dev (linux)
< Date: Tue, 03 Jun 2025 17:01:52 GMT
<
* Connection #0 to host localhost left intact
* Found bundle for host: 0x5621f5ac4440 [serially]
* Can not multiplex, even if we wanted to
* Re-using existing connection #0 with host localhost
> GET /containers/json?all=1&filters=%7B%22name%22%3A%7B%22%5Emy-nginx%24%22%3Atrue%7D%7D HTTP/1.1
> Host: localhost
> User-Agent: curl/7.88.1
> Accept: */*
>
< HTTP/1.1 200 OK
< Api-Version: 1.50
< Content-Type: application/json
< Docker-Experimental: false
< Ostype: linux
< Server: Docker/dev (linux)
< Date: Tue, 03 Jun 2025 17:01:52 GMT
< Content-Length: 840
<
[{"Id":"d18b4b0aff9fe6b577a34050492782f41267ad1d36cf78e96ba7b197842d5968","Names":["/my-nginx"],"Image":"nginx:alpine","ImageID":"sha256:6769dc3a703c719c1d2756bda113659be28ae16cf0da58dd5fd823d6b9a050ea","Command":"/docker-entrypoint.sh nginx -g 'daemon off;'","Created":1748970102,"Ports":[],"Labels":{"maintainer":"NGINX Docker Maintainers <docker-maint@nginx.com>"},"State":"exited","Status":"Exited (0) Less than a second ago","HostConfig":{"NetworkMode":"bridge"},"NetworkSettings":{"Networks":{"bridge":{"IPAMConfig":null,"Links":null,"Aliases":null,"MacAddress":"","DriverOpts":null,"GwPriority":0,"NetworkID":"e6072ab17d3e68d3b47d9c6935a69b0133e47c9fb61f2864646c899de1ba29d8","EndpointID":"","Gateway":"","IPAddress":"","IPPrefixLen":0,"IPv6Gateway":"","GlobalIPv6Address":"","GlobalIPv6PrefixLen":0,"DNSNames":null}}},"Mounts":[]}]
* Connection #0 to host localhost left intact
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
d18b4b0aff9f nginx:alpine "/docker-entrypoint.…" 10 seconds ago Exited (0) Less than a second ago my-nginx
d18b4b0aff9fe6b577a34050492782f41267ad1d36cf78e96ba7b197842d5968
Breakpoints (without suspend, just output to debugger console):
| Tested branch | Location of breakpoint | Message written to debugger console |
|---|---|---|
| master | daemon/monitor.go:120 | Before Container.SetStop in Daemon.handleContainerExit |
| master | daemon/monitor.go:130 | After Container.CheckpointTo in Daemon.handleContainerExit |
| master | daemon/stop.go:97 | After container stopped in Daemon.containerStop |
| master | api/server/router/container/container_routes.go:77 | Start of containerRouter.getContainersJSON |
| master | daemon/list.go:111 | After Daemon.containersReplica in Daemon.Containers |
| 50133-container_stop_state_sync | daemon/monitor.go:120 | Before Container.SetStop in Daemon.handleContainerExit |
| 50133-container_stop_state_sync | daemon/monitor.go:130 | After Container.CheckpointTo in Daemon.handleContainerExit |
| 50133-container_stop_state_sync | daemon/stop.go:100 | After container stopped in Daemon.containerStop |
| 50133-container_stop_state_sync | api/server/router/container/container_routes.go:77 | Start of containerRouter.getContainersJSON |
| 50133-container_stop_state_sync | daemon/list.go:111 | After Daemon.containersReplica in Daemon.Containers |
I understand that this kind of test is far from the best one, but it is the one I can do quickly.
There was a problem hiding this comment.
Yes, feel free to use this PR for experimentation and to push commits for testing.
I know there's historically been quite some changes around these things; unfortunately some handling has become more complex than desirable (not surprising with 10+ Years of development); possibly some bits should be rewritten from scratch, but there's many corner cases that may have found their way into the existing code, so changing is risky.
Some of the refactoring resulted in the "in memory" store, which was added to prevent some gnarly situations with locking (which could cause, e.g., "docker ps" to get blocked by "docker info" calls), so we need to take care with such cases.
There was a problem hiding this comment.
(marked as unresolved to keep the discussion visible)
bdc5cac to
b71bac4
Compare
7c6fcac to
e7ba4cb
Compare
e7ba4cb to
3d44846
Compare
|
Hi @thaJeztah, Could you please take a look at this PR? Feel free to check #50136 (comment) for details about the root cause of issue which this PR fixes (#50133). I'm not sure why some CI tests fail. If it's possible I appreciate if someone could trigger them one more time. On my local environment (I use development container) some unit and integration tests fail even for the master branch (and the results are different from time to time), so I can not rely on them for regression testing. Thank you. |
3d44846 to
ed75954
Compare
| if status := <-ctr.Wait(subCtx, containertypes.WaitConditionNotRunning); status.Err() == nil { | ||
| // Ensure container status changes are committed by handler of container exit before returning control to the caller | ||
| ctr.Lock() | ||
| defer ctr.Unlock() |
There was a problem hiding this comment.
Just a quick blurb - I started to dig into this PR again to check code-paths, but didn't finish, so let me jot down some thoughts I had while looking;
I started to wonder if there's a bug in ctr.Wait, or at least it looks odd that we have a function that's returning a channel (with likely the purpose to handle synchronisation), but that contract to be "broken" (i.e., it returns before it's fully done).
Perhaps it's "ok" to use the ctr.Lock() -> Unlock to work around that issue as a short-term solution; mostly wondering if we're still tapering over the underlying issue, and if this is something ideally fixed in Wait (or elsewhere).
There was a problem hiding this comment.
Hi @thaJeztah,
I was thinking about the same, but I didn't find a better solution for container.State.Wait.
Ideally, all code which wants to access (e.g. read) state of container should use ctr.Lock() + ctr.Unlock(). It's pretty clear requirement which is already met (even before this PR). daemon.Daemon.containerStop method (within its current implementation) doesn't need to access state of container directly - all it needs is to wait for container to be stopped and this requirement is in place even before this PR (please note that with this PR we make /containers/{id}/stop Docker Engine API slower by introducing additional synchronization - that's the reason I'm worried about regression).
IMHO, the issue is located not in container.State.Wait, but in daemon.Daemon.Containers method and in daemon.Daemon.containersReplica field ("read-only view", which is used by daemon.Daemon.Containers method). I understand the need of daemon.Daemon.containersReplica field, but it looks like this read-only view introduced some cases which lack synchronization with daemon.Daemon.containers field. #50133 is just one of these cases, other can be in /containers/{id}/kill and /containers/{id}/wait API (just replace /containers/{id}/stop with these APIs in #50133). Please take a look at testcontainers/testcontainers-go#3197 which uses /containers/{id}/json API (which reads state of container from daemon.Daemon.containers and not from daemon.Daemon.containersReplica) - this workaround demonstrates that there is no issue (with container.State.Wait) when using daemon.Daemon.containers field with ctr.Lock() + ctr.Unlock().
Maybe we should realize that all Docker Engine API which uses daemon.Daemon.containersReplica field can get outdated (delayed) state of containers and highlight that in API documentation (e.g. split read-only API from read-write API, then describe synchronization b/w API calls). Otherwise, even without concurrent access to Docker Engine API we leave users of this API without understanding of:
- Synchronization points (i.e. where slowness / lock can happen).
- Visibility of state changes.
Alternative solution (the way this PR goes, but only for /containers/{id}/stop API endpoint) is to ensure that daemon.Daemon.containersReplica is updated before completion of any API call which modifies state of container(s).
Thank you.
…es when container is stopped and before API response is sent (fix for moby#50133). Signed-off-by: Marat Abrarov <abrarov@gmail.com>
ed75954 to
8e6cd44
Compare
vvoland
left a comment
There was a problem hiding this comment.
It's good to have a fix for that but... we should really take a closer look at that because this is horrifying 🙈 😅
Thanks for digging and investigating that!
thaJeztah
left a comment
There was a problem hiding this comment.
LGTM, thanks!
we discussed this one in the maintainers call Yesterday; looks like the PR for now should be ok to bring in, but we should probably create a tracking ticket to dig deeper into the flow(s) and see where improvements / fixes can be applied.
| // Ensure container status changes are committed by handler of container exit before returning control to the caller | ||
| ctr.Lock() | ||
| defer ctr.Unlock() |
There was a problem hiding this comment.
Looks like this caused a regression; see docker/compose#12950 (comment)
The ctr.Lock() makes this function wait for the container to die, but because it produces the stop event in a defer, the stop event now gets produced AFTER the die event;
Lines 74 to 78 in 8e6cd44
There was a problem hiding this comment.
Looks like the defer was introduced by me in 952902e - but that change doesn't look critical; mostly cleanup; we can probably inline it.
There was a problem hiding this comment.
daemon: ensuring state of stopped container is visible to other queries when container is stopped using /containers/{id}/stop API and before API response is sent back to the client. Fixes #50133.
- What I did
Made state of stopped container visible to other queries when container is stopped using /containers/{id}/stop Docker Engine API and before response is sent back to the API client.
- How I did it
Waiting until state of stopped container is committed to daemon database (only when container is stopped successfully), i.e. until container stop handler of container monitor completes, before replying to client of /containers/{id}/stop Docker Engine API.
- How to verify it
Refer to "Reproduce" section of #50133.
- Human readable description for the release notes