-
Notifications
You must be signed in to change notification settings - Fork 3.8k
cri: Fix userns with container image VOLUME mounts that need copy #12218
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Hi @rata. Thanks for your PR. I'm waiting for a containerd member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
87c5e5d to
4c86977
Compare
|
cc @halaney in case you are interested in this PR too :) |
422f27f to
f59f155
Compare
29a8e36 to
a3054ee
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I'm following correctly I think this looks good!
The alpine container image already contains a "/run" directory. This
will force the code in WithVolumes() to copy its content to the new
volume created for the VOLUME directive. This copies the content as well
as the ownership.However, as we perform the mounts from the host POV without being inside
a userns, the idmap option will just shift the IDs in ways that will
screw up the ownerships when copied. We should only use the idmap option
when running the container inside a userns, so the ownerships are fine
(the userns will do a shift and the idmap another, to make it all seem
as if there was no UID/GID shift in the first place).
I think it would be useful to show the host view of /run/ before and after this change just for an easier picture. Maybe bonus points of how that looks at the temp mount with its idmap going on.
IIUC let's say the host has an empty host volume /run/ owned by 0, container has /run/foobar file which is owned by uid/gid 0/0. Let's say the {u,g}idmap is 0 on host to 65556 in container. Without this change we'd:
- Mount the temporary overlayfs with idmap'ing (resulting in the host seeing this mount's dirs/content now being owned by 65556)
- Copy that foobar file out of the overlayfs, preserving its permissions, to
/run/ - The result is the host's
/run/foobaris now owned by 65556, not 0 as expected. 0 is expected since as you mention the container's user namespace mapping will effectively undo the idmap mapping from its pov, resulting in 65556 being 0 in the actual container. When we actually do the final mount in the container proper, its also a idmap'ed bind mount onto the overlayfs right? So it will also be all messed up in the container's POV (it'll be shifted an extra 65556 resulting in user 65556 owning it within the container as well)?
Idk, after typing maybe there's not more info to add but this exercise at least convinced me of what's going on here :D
So either we need to idmap the volume too, or not do the idmap on the overlayfs. So your approach seems like the path of least work!
|
@halaney Thanks for the review, added the before/after to the PR description.
I think you had a typo here, it's 0 inside the container mapped to 65536 on the host (or whatever hostID). This is what the mapping is doing.
Not 100% sure what you mean its also idmapped bind mount onto the rootfs, but IIUC nope. There are no "nested" idmapping here, if that is what you are saying. The rootfs is overlay and is idmapped, this is a bind-mount and is idmapped, but those idmappings don't nest and no Also, overlayfs doesn't support to idmap the merged dir, what it supports is to idmap the lowerdir/upperdir. So, we can't really use AT_RECURSIVE either here.
No, the overlayfs always needs to be idmap. As I said, they are not nested, they don't interfere, nor idmap is recursive either. It could work to have "the right permissions" for the directory when you look only at this layer, but it doesn't when you look at the bigger picture. The hostUID is dynamic, you don't want to chown all the files every time you get a new mapping. You want pods to be able to enable/disable userns, but you can't unless you chown all the volumes with that solution, etc. The current design is simpler, we just idmap when running, all works as expected, no chown ever required. |
|
@rata me explaining myself below, but tldr we're saying the same thing I just wrote it confusingly earlier (and maybe still) below):
Nesting is not what I meant, your description matches what I was attempting (:P) to say. My bits about:
was trying to express that to fix the current problem either we don't want to idmap the overlayfs for this Here in i.e. before it was something like now things work since we don't shift them at the beginning. Alternatively, you could do the copy content with both locations idmapped and that would also work out consistently. Its the half doing of it that's the problem, and since there's no reason to do it in Sorry for confusion! this looks good to me, thanks for the extra details on the MR description, if I'm still not making sense just ignore me we're trying to say the same thing :D |
| // Always copy files without UID/GID transformations. | ||
| // The ID transformations are only needed when running inside a userns, that is not | ||
| // the case here. | ||
| mounts = mount.RemoveIDMapOption(mounts) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mikebrow do we still need to support that volume defined in image?
Kubelet always recreates a container once the current one exits. Using an image volume only adds unnecessary I/O.
If there is no valid use case for it, we probably consider deprecating it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the data there is always lost in k8s containers, which misses the point of volumes. But it might be used just as a volume, as we create a bind mount in a regular fs usually (not overlay, or tmpfs as some emptyDir might do, etc.).
I'll be up to deprecate it in some way, if k8s exposes replacements (e.g. I think emptyDir might be possible without a tmpfs, but didn't check)
If a Dockerfile is using a `VOLUME` directive and the directory exists in the rootfs, like in this example: FROM docker.io/library/alpine:latest VOLUME [ "/run" ] The alpine container image already contains a "/run" directory. This will force the code in WithVolumes() to copy its content to the new volume created for the VOLUME directive. This copies the content as well as the ownership. However, as we perform the mounts from the host POV without being inside a userns, the idmap option will just shift the IDs in ways that will screw up the ownerships when copied. We should only use the idmap option when running the container inside a userns, so the ownerships are fine (the userns will do a shift and the idmap another, to make it all seem as if there was no UID/GID shift in the first place). This PR does just that, remove the idmap option from mounts so we copy the files without any ID transformations. It's simpler and easier to reason about if we just don't mount with the idmap option here: all files are copied just fine without ID transformations and ID transformation is applied via the idmap option at mount time when running the pod. Also, note that `VOLUME` directives that refer to directories that don't exist on the rootfs work fine (`VOLUME [ "/rata" ]` for example), as there is no copy done in that case so the permissions weren't changed. Signed-off-by: Rodrigo Campos <rodrigoca@microsoft.com>
73d1cc3 to
d50a6e7
Compare
Signed-off-by: Rodrigo Campos <rodrigoca@microsoft.com>
|
Does this need backporting? |
|
/cherry-pick release/2.1 release/2.0 |
|
@rata: only containerd org members may request cherry picks. If you are already part of the org, make sure to change your membership to public. Otherwise you can still do the cherry-pick manually. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
This logically reverts commit 2a53de8 but keeps 8a465db. Resolved in containerd/containerd#12218.
This PR fixes the "copy-up" used in
VOLUMEdirectives if the path in the rootfs has some contents AND the pod is using user-namespaces (it was working fine if the rootfs didn't have content for the volume path).The issue is that when doing the copy-up, we screw up the permissions in the userns case. This is explained in more detail in the commit msg (c&p below). Also, I've added an integration test that test the copy-up with userns. It's based in the existing copy-up test case, that is testing already a lot of things.
The copy-up integration test adapted for userns fails without this PR (and works with this, of course :)). I've also added to the test to check the file permissions, for completeness, although it wasn't needed to detect this bug.
Below the commit msg with the long explanation.
If a Dockerfile is using a
VOLUMEdirective and the directory existsin the rootfs, like in this example:
The alpine container image already contains a "/run" directory. This
will force the code in WithVolumes() to copy its content to the new
volume created for the VOLUME directive. This copies the content as well
as the ownership.
However, as we perform the mounts from the host POV without being inside
a userns, the idmap option will just shift the IDs in ways that will
screw up the ownerships when copied. We should only use the idmap option
when running the container inside a userns, so the ownerships are fine
(the userns will do a shift and the idmap another, to make it all seem
as if there was no UID/GID shift in the first place).
This PR does just that, remove the idmap option from mounts so we copy
the files without any ID transformations. It's simpler and easier to
reason about if we just don't mount with the idmap option here: all
files are copied just fine without ID transformations and ID
transformation is applied via the idmap option at mount time when
running the pod.
Also, note that
VOLUMEdirectives that refer to directories that don'texist on the rootfs work fine (
VOLUME [ "/rata" ]for example), asthere is no copy done in that case so the permissions weren't changed.
In case you are curious, if you run the repro on a k8s pod and also add
automountServiceAccountToken: false, the pod will start and you will see the wrong permissions.Before the PR:
Basically the UID/GIDs this is using are not mapped in the userns after all the ids transformations. That is why we can't create a directory there.
If the container userns is using the host UID 100000 mapped to 0 inside the container, then /run will be changed to be owned by that. But then that ID is not mapped inside the userns and it shows like this, mapped to the overflow uid.
After this PR, the permissions look as expected:
And we can create the mountpoint for the k8s service account token too.
The files copied and the permission for the /run directory itself is as UID 0 (as we don't use an idmap at all for the copy now), and then we use the idmap when accessing the volumes and all works as expected.
Fixes: #11852