Skip to content

c8d: Don't remount if the container is already running#45383

Closed
rumpl wants to merge 1 commit intomoby:masterfrom
rumpl:c8d-cp-mount
Closed

c8d: Don't remount if the container is already running#45383
rumpl wants to merge 1 commit intomoby:masterfrom
rumpl:c8d-cp-mount

Conversation

@rumpl
Copy link
Member

@rumpl rumpl commented Apr 24, 2023

- What I did

Opening this PR to start a discussion more than wanting this code to be merged. The code works but maybe someone has a better idea for this case. @AkihiroSuda @thaJeztah @vvoland

The openContainerFS would need some more work, I'm not sure we need to unmount volumes here: https://github.com/moby/moby/blob/master/daemon/containerfs_linux.go#L69, they are unmounted in the finalizer of the struct

This change fixes double-mounting when running an exec or a cp on a running container with the containerd integration enabled. Sharing layers with overlayfs results in an undefined state and things start to randomly break with a "No such file or directory" error.

Graph drivers use reference counting and only mount things once, we can't do this with containerd because we don't know where it mounted the container rootfs. Instead of reference counting we return the /proc/<pid>/root directory for exec and/or cp to read from.

- How I did it

Checking the container state and setting the container basefs to /proc/<pid>/root if the container is running.

- How to verify it

You can use this script:

set -x

ID=$(docker run -d nginx)

docker exec $ID mount | wc -l

cat > test <<EOF
Hello cp
EOF

docker cp test $ID:/test

docker exec $ID touch /etc/default/test-file

docker exec $ID mount | wc -l

docker stop $ID

docker cp test $ID:/test2

docker start $ID

docker exec $ID mount | wc -l

docker exec $ID cat /test
docker exec $ID cat /test2

docker stop $ID
docker rm $ID

With current master branch the output is:

./cp.sh 
++ docker run -d nginx
+ ID=88d265e718b3adf0cc9e4c52e1bb363c95a27c91096ac9f1097927eb0772e0a9
+ docker exec 88d265e718b3adf0cc9e4c52e1bb363c95a27c91096ac9f1097927eb0772e0a9 mount
+ wc -l
24
+ cat
+ docker cp test 88d265e718b3adf0cc9e4c52e1bb363c95a27c91096ac9f1097927eb0772e0a9:/test
Successfully copied 2.05kB to 88d265e718b3adf0cc9e4c52e1bb363c95a27c91096ac9f1097927eb0772e0a9:/test
+ docker exec 88d265e718b3adf0cc9e4c52e1bb363c95a27c91096ac9f1097927eb0772e0a9 touch /etc/default/test-file
touch: cannot touch '/etc/default/test-file': No such file or directory
+ docker exec 88d265e718b3adf0cc9e4c52e1bb363c95a27c91096ac9f1097927eb0772e0a9 mount
+ wc -l
24
+ docker stop 88d265e718b3adf0cc9e4c52e1bb363c95a27c91096ac9f1097927eb0772e0a9
88d265e718b3adf0cc9e4c52e1bb363c95a27c91096ac9f1097927eb0772e0a9
+ docker cp test 88d265e718b3adf0cc9e4c52e1bb363c95a27c91096ac9f1097927eb0772e0a9:/test2
Successfully copied 2.05kB to 88d265e718b3adf0cc9e4c52e1bb363c95a27c91096ac9f1097927eb0772e0a9:/test2
+ docker start 88d265e718b3adf0cc9e4c52e1bb363c95a27c91096ac9f1097927eb0772e0a9
88d265e718b3adf0cc9e4c52e1bb363c95a27c91096ac9f1097927eb0772e0a9
+ docker exec 88d265e718b3adf0cc9e4c52e1bb363c95a27c91096ac9f1097927eb0772e0a9 mount
+ wc -l
24
+ docker exec 88d265e718b3adf0cc9e4c52e1bb363c95a27c91096ac9f1097927eb0772e0a9 cat /test
Hello cp
+ docker exec 88d265e718b3adf0cc9e4c52e1bb363c95a27c91096ac9f1097927eb0772e0a9 cat /test2
Hello cp
+ docker stop 88d265e718b3adf0cc9e4c52e1bb363c95a27c91096ac9f1097927eb0772e0a9
88d265e718b3adf0cc9e4c52e1bb363c95a27c91096ac9f1097927eb0772e0a9
+ docker rm 88d265e718b3adf0cc9e4c52e1bb363c95a27c91096ac9f1097927eb0772e0a9
88d265e718b3adf0cc9e4c52e1bb363c95a27c91096ac9f1097927eb0772e0a9

Notice the line touch: cannot touch '/etc/default/test-file': No such file or directory, the rootfs is in an undefined state...

With this change the error goes away:

$ ./cp.sh
++ docker run -d nginx
+ ID=7e1d0a21f5a743e7b3769bfff16705db6e3c306ba6c79427db7a637a7670ec89
+ docker exec 7e1d0a21f5a743e7b3769bfff16705db6e3c306ba6c79427db7a637a7670ec89 mount
+ wc -l
24
+ cat
+ docker cp test 7e1d0a21f5a743e7b3769bfff16705db6e3c306ba6c79427db7a637a7670ec89:/test
Successfully copied 2.05kB to 7e1d0a21f5a743e7b3769bfff16705db6e3c306ba6c79427db7a637a7670ec89:/test
+ docker exec 7e1d0a21f5a743e7b3769bfff16705db6e3c306ba6c79427db7a637a7670ec89 touch /etc/default/test-file
+ docker exec 7e1d0a21f5a743e7b3769bfff16705db6e3c306ba6c79427db7a637a7670ec89 mount
+ wc -l
24
+ docker stop 7e1d0a21f5a743e7b3769bfff16705db6e3c306ba6c79427db7a637a7670ec89
7e1d0a21f5a743e7b3769bfff16705db6e3c306ba6c79427db7a637a7670ec89
+ docker cp test 7e1d0a21f5a743e7b3769bfff16705db6e3c306ba6c79427db7a637a7670ec89:/test2
Successfully copied 2.05kB to 7e1d0a21f5a743e7b3769bfff16705db6e3c306ba6c79427db7a637a7670ec89:/test2
+ docker start 7e1d0a21f5a743e7b3769bfff16705db6e3c306ba6c79427db7a637a7670ec89
7e1d0a21f5a743e7b3769bfff16705db6e3c306ba6c79427db7a637a7670ec89
+ docker exec 7e1d0a21f5a743e7b3769bfff16705db6e3c306ba6c79427db7a637a7670ec89 mount
+ wc -l
24
+ docker exec 7e1d0a21f5a743e7b3769bfff16705db6e3c306ba6c79427db7a637a7670ec89 cat /test
Hello cp
+ docker exec 7e1d0a21f5a743e7b3769bfff16705db6e3c306ba6c79427db7a637a7670ec89 cat /test2
Hello cp
+ docker stop 7e1d0a21f5a743e7b3769bfff16705db6e3c306ba6c79427db7a637a7670ec89
7e1d0a21f5a743e7b3769bfff16705db6e3c306ba6c79427db7a637a7670ec89
+ docker rm 7e1d0a21f5a743e7b3769bfff16705db6e3c306ba6c79427db7a637a7670ec89
7e1d0a21f5a743e7b3769bfff16705db6e3c306ba6c79427db7a637a7670ec89

You can also look at sudo dmesg -w in parallel, there shouldn't be any overlayfs warnings after this change.

- Description for the changelog

- A picture of a cute animal (not mandatory but encouraged)

@rumpl rumpl added area/runtime Runtime containerd-integration Issues and PRs related to containerd integration labels Apr 24, 2023
@rumpl rumpl added this to the 24.0.0 milestone Apr 24, 2023
Copy link
Contributor

@vvoland vvoland left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, I don't have a better idea for now, and this is already an improvement.

I was wondering if starting stopped container or stopping started container during the cp would cause any problems, but it seems that the container stop/start doesn't happen until the cp actually finishes.

@vvoland vvoland added the kind/bugfix PR's that fix bugs label Apr 24, 2023
This change fixes double-mounting when running an exec or a cp on a
running container with the containerd integration enabled. Sharing
layers with overlayfs results in an undefined state and things start to
randomly break with a "No such file or directory" error.

Graph drivers use reference counting and only mount things once, we
can't do this with containerd because we don't know where it mounted the
container rootfs. Instead of reference counting we return the
`/proc/<pid>/root` directory for exec and/or cp to read from.

Signed-off-by: Djordje Lukic <djordje.lukic@docker.com>
@rumpl rumpl modified the milestones: 24.0.0, v-future Apr 24, 2023
@tianon
Copy link
Member

tianon commented Apr 24, 2023

What if the container pid1 dies while the copy is in progress? 😬

(Thinking something like a docker run --name foo ... sleep 1 + docker cp ... foo:... on a huge file 🙈)

@rumpl
Copy link
Member Author

rumpl commented May 3, 2023

Looks good to me, I don't have a better idea for now, and this is already an improvement.

I was wondering if starting stopped container or stopping started container during the cp would cause any problems, but it seems that the container stop/start doesn't happen until the cp actually finishes.

The container is locked so it won't exit until cp finishes

@rumpl rumpl requested a review from AkihiroSuda May 3, 2023 14:33
if err := mount.All(mounts, root); err != nil {
return fmt.Errorf("failed to mount %s: %w", root, err)
root := ""
if container.State != nil && !container.State.Running && container.Pid == 0 {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a (theoretical?) exercise; looking at what this condition looks like for the "else" branch;

if container.State == nil || (container.State != nil && container.State.Running) || container.Pid != 0 {
	root = fmt.Sprintf("/proc/%d/root", container.Pid)
}

which effectively can be reduce to;

if container.Pid != 0 {
	root = fmt.Sprintf("/proc/%d/root", container.Pid)
}

But (see first example); would also be running if container.State == nil, and container.Pid has "any value"; can those conditions happen? Should we handle those?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

^^ actually;

  • container.Pid is container.State.Pid, so currently it would panic if container.State is nil
  • Given that Running and Paused are not mutually exclusive; which branch should Paused fall into?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So perhaps we need an early return (error) at least if container.State is nil

if err := mount.UnmountAll(root, 0); err != nil {
return fmt.Errorf("failed to unmount %s: %w", root, err)
}
if strings.HasSuffix(root, rootfsMountSuffix) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case we don't unmount, but still reset container.BaseFS; should this be an error condition? (Should it be logged?)

done := make(chan error)
err = unshare.Go(unix.CLONE_NEWNS,
func() error {
if container.State != nil && container.State.Running && container.Pid != 0 {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same condition as mentioned earlier (just checking if ok)

@tianon
Copy link
Member

tianon commented May 3, 2023

Verified this on 45f6f0e (from refs/pull/45383/merge, which thus includes 7995ffa from this PR), and it seems to work as expected:

/ # ( set -eux; time dd if=/dev/urandom of=/tmp/big bs=1G count=10; ls -lh /tmp/big; docker run -dit --name foo bash time sleep 1; date -Is; time docker cp /tmp/big foo:/tmp/; date -Is; docker logs -f foo; docker container inspect --format '{{ .State.StartedAt }}{{ "\n" }}{{ .State.FinishedAt }}' foo; docker start foo; docker exec foo ls -lh /tmp/big )
+ time dd 'if=/dev/urandom' 'of=/tmp/big' 'bs=1G' 'count=10'
10+0 records in
10+0 records out
real	0m 29.70s
user	0m 0.00s
sys	0m 29.67s
+ ls -lh /tmp/big
-rw-r--r--    1 root     root       10.0G May  3 19:00 /tmp/big
+ docker run -dit --name foo bash time sleep 1
417c4d15f4e4472f07fcf0d74fc93cd95ecd27cc42d3e9594b5582b860a2c4f2
+ date -Is
2023-05-03T19:00:03+00:00
+ time docker cp /tmp/big foo:/tmp/
Successfully copied 10.7GB to foo:/tmp/
real	0m 11.05s
user	0m 3.34s
sys	0m 6.41s
+ date -Is
2023-05-03T19:00:14+00:00
+ docker logs -f foo
real	0m 1.00s
user	0m 0.00s
sys	0m 0.00s
+ docker container inspect --format '{{ .State.StartedAt }}{{ "\n" }}{{ .State.FinishedAt }}' foo
2023-05-03T19:00:03.246792584Z
2023-05-03T19:00:04.249160906Z
+ docker start foo
foo
+ docker exec foo ls -lh /tmp/big
-rw-r--r--    1 root     root       10.0G May  3 19:00 /tmp/big

@rumpl rumpl requested a review from corhere May 3, 2023 21:14
@cpuguy83
Copy link
Member

cpuguy83 commented May 3, 2023

Hey, this fixes an issue I'm seeing on my tests where I start a container, cp some stuff to it, then exec in and try to do things.
The upper layer is totally messed up.

@cpuguy83
Copy link
Member

cpuguy83 commented May 3, 2023

Maybe we can have a mount service that ref-counts all this.
We could possibly fix some nested mounting issue we have while doing this as well.

@cpuguy83
Copy link
Member

cpuguy83 commented May 3, 2023

s/mount service/rootfs service/

@corhere
Copy link
Contributor

corhere commented May 8, 2023

/proc/<pid>/root is controlled by the container—it can be manipulated with the chroot and pivot_root syscalls—so I'm uneasy about relying on it. Some daemons chroot themselves for security reasons, which could result in docker cp paths while the container is running not lining up with paths when the container is stopped. And I worry that some enterprising security researcher could find a way to chain this with some other weakness in container sandboxing to trick the daemon into reading or writing outside the container rootfs when an unsuspecting administrator runs a docker cp command.

Graph drivers use reference counting and only mount things once, we can't do this with containerd because we don't know where it mounted the container rootfs.

We do know where the container rootfs is mounted because dockerd is doing the mounting, even when daemon.UsesSnapshotter().

  • (*Daemon).ContainerCreate
  • (*Daemon).containerCreate
  • (*Daemon).createContainerOSSpecificSettings
  • (*Daemon).Mount
  • (*containerd.ImageService).Mount
  • mount.All

The daemon certainly could use reference counting and only mount things once in a snapshotter world, same as in a graphdriver world. The only difference is that containerd.ImageService doesn't have the luxury of deferring the reference-counting responsibilities to the graph driver implementations.

The openContainerFS would need some more work, I'm not sure we need to unmount volumes here: https://github.com/moby/moby/blob/master/daemon/containerfs_linux.go#L69, they are unmounted in the finalizer of the struct

The finalizer is only set on the containerFSView struct when the containerFSView is constructed, which only happens after all the fallible operations succeed. That deferred call is needed to decrement the reference counts of the volumes in the event that one of the aforementioned operations fails, before the finalizer is set. And finalizers should not be relied upon as the runtime makes no guarantees as to when (or whether) the finalizer is invoked. I only put the finalizer in there as a last resort to make it less likely for the mounts to leak if there is a bug which results in the containerFSView becoming garbage without having been explicitly closed.

@thaJeztah
Copy link
Member

thaJeztah commented Jun 8, 2023

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/runtime Runtime containerd-integration Issues and PRs related to containerd integration kind/bugfix PR's that fix bugs status/2-code-review

Projects

Development

Successfully merging this pull request may close these issues.

6 participants