Skip to content

Conversation

@rata
Copy link
Contributor

@rata rata commented Aug 19, 2025

This PR fixes the "copy-up" used in VOLUME directives if the path in the rootfs has some contents AND the pod is using user-namespaces (it was working fine if the rootfs didn't have content for the volume path).

The issue is that when doing the copy-up, we screw up the permissions in the userns case. This is explained in more detail in the commit msg (c&p below). Also, I've added an integration test that test the copy-up with userns. It's based in the existing copy-up test case, that is testing already a lot of things.

The copy-up integration test adapted for userns fails without this PR (and works with this, of course :)). I've also added to the test to check the file permissions, for completeness, although it wasn't needed to detect this bug.

Below the commit msg with the long explanation.


If a Dockerfile is using a VOLUME directive and the directory exists
in the rootfs, like in this example:

        FROM docker.io/library/alpine:latest
        VOLUME [ "/run" ]

The alpine container image already contains a "/run" directory. This
will force the code in WithVolumes() to copy its content to the new
volume created for the VOLUME directive. This copies the content as well
as the ownership.

However, as we perform the mounts from the host POV without being inside
a userns, the idmap option will just shift the IDs in ways that will
screw up the ownerships when copied. We should only use the idmap option
when running the container inside a userns, so the ownerships are fine
(the userns will do a shift and the idmap another, to make it all seem
as if there was no UID/GID shift in the first place).

This PR does just that, remove the idmap option from mounts so we copy
the files without any ID transformations. It's simpler and easier to
reason about if we just don't mount with the idmap option here: all
files are copied just fine without ID transformations and ID
transformation is applied via the idmap option at mount time when
running the pod.

Also, note that VOLUME directives that refer to directories that don't
exist on the rootfs work fine (VOLUME [ "/rata" ] for example), as
there is no copy done in that case so the permissions weren't changed.


In case you are curious, if you run the repro on a k8s pod and also add automountServiceAccountToken: false, the pod will start and you will see the wrong permissions.

Before the PR:

/ # ls -ldn /run
drwxr-xr-x    3 65534    65534         4096 Aug 22 12:46 /run
/ # ls -ln /run
total 4
drwxr-xr-x    2 65534    65534         4096 Jul 15 10:42 lock

Basically the UID/GIDs this is using are not mapped in the userns after all the ids transformations. That is why we can't create a directory there.

If the container userns is using the host UID 100000 mapped to 0 inside the container, then /run will be changed to be owned by that. But then that ID is not mapped inside the userns and it shows like this, mapped to the overflow uid.

After this PR, the permissions look as expected:

/ # ls -ld /run
drwxr-xr-x    3 root     root          4096 Aug 22 12:46 /run
/ # ls -ldn /run
drwxr-xr-x    3 0        0             4096 Aug 22 12:46 /run
/ # ls -ldn /run
drwxr-xr-x    3 0        0             4096 Aug 22 12:46 /run
/ # ls -ln /run
total 4
drwxr-xr-x    2 0        0             4096 Jul 15 10:42 lock

And we can create the mountpoint for the k8s service account token too.

The files copied and the permission for the /run directory itself is as UID 0 (as we don't use an idmap at all for the copy now), and then we use the idmap when accessing the volumes and all works as expected.

Fixes: #11852

@k8s-ci-robot
Copy link

Hi @rata. Thanks for your PR.

I'm waiting for a containerd member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@dosubot dosubot bot added the area/cri Container Runtime Interface (CRI) label Aug 19, 2025
@rata rata force-pushed the issue-11852 branch 2 times, most recently from 87c5e5d to 4c86977 Compare August 20, 2025 10:47
@rata rata changed the title WIP: cri: Fix userns with Dockerfile VOLUME mounts that need copy cri: Fix userns with Dockerfile VOLUME mounts that need copy Aug 20, 2025
@rata
Copy link
Contributor Author

rata commented Aug 20, 2025

cc @halaney in case you are interested in this PR too :)

@rata rata force-pushed the issue-11852 branch 3 times, most recently from 422f27f to f59f155 Compare August 21, 2025 11:08
@rata rata changed the title cri: Fix userns with Dockerfile VOLUME mounts that need copy cri: Fix userns with container image VOLUME mounts that need copy Aug 21, 2025
@rata rata force-pushed the issue-11852 branch 3 times, most recently from 29a8e36 to a3054ee Compare August 21, 2025 14:19
Copy link
Contributor

@halaney halaney left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I'm following correctly I think this looks good!

The alpine container image already contains a "/run" directory. This
will force the code in WithVolumes() to copy its content to the new
volume created for the VOLUME directive. This copies the content as well
as the ownership.

However, as we perform the mounts from the host POV without being inside
a userns, the idmap option will just shift the IDs in ways that will
screw up the ownerships when copied. We should only use the idmap option
when running the container inside a userns, so the ownerships are fine
(the userns will do a shift and the idmap another, to make it all seem
as if there was no UID/GID shift in the first place).

I think it would be useful to show the host view of /run/ before and after this change just for an easier picture. Maybe bonus points of how that looks at the temp mount with its idmap going on.

IIUC let's say the host has an empty host volume /run/ owned by 0, container has /run/foobar file which is owned by uid/gid 0/0. Let's say the {u,g}idmap is 0 on host to 65556 in container. Without this change we'd:

  1. Mount the temporary overlayfs with idmap'ing (resulting in the host seeing this mount's dirs/content now being owned by 65556)
  2. Copy that foobar file out of the overlayfs, preserving its permissions, to /run/
  3. The result is the host's /run/foobar is now owned by 65556, not 0 as expected. 0 is expected since as you mention the container's user namespace mapping will effectively undo the idmap mapping from its pov, resulting in 65556 being 0 in the actual container. When we actually do the final mount in the container proper, its also a idmap'ed bind mount onto the overlayfs right? So it will also be all messed up in the container's POV (it'll be shifted an extra 65556 resulting in user 65556 owning it within the container as well)?

Idk, after typing maybe there's not more info to add but this exercise at least convinced me of what's going on here :D
So either we need to idmap the volume too, or not do the idmap on the overlayfs. So your approach seems like the path of least work!

@rata
Copy link
Contributor Author

rata commented Aug 22, 2025

@halaney Thanks for the review, added the before/after to the PR description.

IIUC let's say the host has an empty host volume /run/ owned by 0, container has /run/foobar file which is owned by uid/gid 0/0. Let's say the {u,g}idmap is 0 on host to 65556 in container. Without this change we'd:

I think you had a typo here, it's 0 inside the container mapped to 65536 on the host (or whatever hostID). This is what the mapping is doing.

  1. The result is the host's /run/foobar is now owned by 65556, not 0 as expected. 0 is expected since as you mention the container's user namespace mapping will effectively undo the idmap mapping from its pov, resulting in 65556 being 0 in the actual container. When we actually do the final mount in the container proper, its also a idmap'ed bind mount onto the overlayfs right? So it will also be all messed up in the container's POV (it'll be shifted an extra 65556 resulting in user 65556 owning it within the container as well)?

Not 100% sure what you mean its also idmapped bind mount onto the rootfs, but IIUC nope. There are no "nested" idmapping here, if that is what you are saying. The rootfs is overlay and is idmapped, this is a bind-mount and is idmapped, but those idmappings don't nest and no AT_RECURSIVE is used either.

Also, overlayfs doesn't support to idmap the merged dir, what it supports is to idmap the lowerdir/upperdir. So, we can't really use AT_RECURSIVE either here.

So either we need to idmap the volume too, or not do the idmap on the overlayfs. So your approach seems like the path of least work!

No, the overlayfs always needs to be idmap. As I said, they are not nested, they don't interfere, nor idmap is recursive either.

It could work to have "the right permissions" for the directory when you look only at this layer, but it doesn't when you look at the bigger picture. The hostUID is dynamic, you don't want to chown all the files every time you get a new mapping. You want pods to be able to enable/disable userns, but you can't unless you chown all the volumes with that solution, etc.

The current design is simpler, we just idmap when running, all works as expected, no chown ever required.
This is not done only in this part (for image volume directives), but all the way in all volume files (even configmaps, etc.).

@halaney
Copy link
Contributor

halaney commented Aug 22, 2025

@rata me explaining myself below, but tldr we're saying the same thing I just wrote it confusingly earlier (and maybe still) below):

Not 100% sure what you mean its also idmapped bind mount onto the rootfs, but IIUC nope. There are no "nested" idmapping here, if that is what you are saying. The rootfs is overlay and is idmapped, this is a bind-mount and is idmapped, but those idmappings don't nest and no AT_RECURSIVE is used either.

Nesting is not what I meant, your description matches what I was attempting (:P) to say. My bits about:

So either we need to idmap the volume too, or not do the idmap on the overlayfs. So your approach seems like the path of least work!

was trying to express that to fix the current problem either we don't want to idmap the overlayfs for this WithVolume operation (what you opted for), or we need to idmap mount both the overlayfs (really its lower dirs but just using that term for the whole thing) and the volume, copying between those two idmap'ed views to keep the users consistent. i.e. both the source and destination need to be idmapped or neither should be idmapped, not mixed.

Here in WithVolume we can just skip it as it provides no benefit, later when the container runs we need to idmap both because the from the host's pov the container is running as user 100000 in example below and needs to be given a view of the mounts that's accessible to user 100000 via the idmaps.

i.e. before it was something like

idmap mount 0 -> 100000 of overlayfs at /tmp/idmapped-overlayfs
cp -R /tmp/idmapped-overlayfs/<volume>/ /<volume>/
# this results in /<volume>/ containing files owned by 100000 instead 
# of 0 (assuming everything is originally owned by 0 before idmaps)
...
# make real container rootfs mounts:
idmap mount 0 -> 100000 of overlayfs at /var/lib/container/.../rootfs
idmap mount 0 -> 100000 of <volume> at /var/lib/container/.../rootfs/<volume>
# this results in  /var/lib/container/.../rootfs/<volume> contents being owned 
# by 200000 instead of intended 100000 and within the container this would look
# like user 100000 owns it instead of 0 as expected (or nobody user if 100000 is
# out of uidmap range) due to user namespace mapping

now things work since we don't shift them at the beginning. Alternatively, you could do the copy content with both locations idmapped and that would also work out consistently. Its the half doing of it that's the problem, and since there's no reason to do it in WithVolumes this PR the best solution to the problem.

Sorry for confusion! this looks good to me, thanks for the extra details on the MR description, if I'm still not making sense just ignore me we're trying to say the same thing :D

// Always copy files without UID/GID transformations.
// The ID transformations are only needed when running inside a userns, that is not
// the case here.
mounts = mount.RemoveIDMapOption(mounts)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mikebrow do we still need to support that volume defined in image?
Kubelet always recreates a container once the current one exits. Using an image volume only adds unnecessary I/O.
If there is no valid use case for it, we probably consider deprecating it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the data there is always lost in k8s containers, which misses the point of volumes. But it might be used just as a volume, as we create a bind mount in a regular fs usually (not overlay, or tmpfs as some emptyDir might do, etc.).

I'll be up to deprecate it in some way, if k8s exposes replacements (e.g. I think emptyDir might be possible without a tmpfs, but didn't check)

@github-project-automation github-project-automation bot moved this from Needs Triage to Review In Progress in Pull Request Review Aug 24, 2025
If a Dockerfile is using a `VOLUME` directive and the directory exists
in the rootfs, like in this example:

	FROM docker.io/library/alpine:latest
	VOLUME [ "/run" ]

The alpine container image already contains a "/run" directory. This
will force the code in WithVolumes() to copy its  content to the new
volume created for the VOLUME directive. This copies the content as well
as the ownership.

However, as we perform the mounts from the host POV without being inside
a userns, the idmap option will just shift the IDs in ways that will
screw up the ownerships when copied. We should only use the idmap option
when running the container inside a userns, so the ownerships are fine
(the userns will do a shift and the idmap another, to make it all seem
as if there was no UID/GID shift in the first place).

This PR does just that, remove the idmap option from mounts so we copy
the files without any ID transformations. It's simpler and easier to
reason about if we just don't mount with the idmap option here: all
files are copied just fine without ID transformations and ID
transformation is applied via the idmap option at mount time when
running the pod.

Also, note that `VOLUME` directives that refer to directories that don't
exist on the rootfs work fine (`VOLUME [ "/rata" ]` for example), as
there is no copy done in that case so the permissions weren't changed.

Signed-off-by: Rodrigo Campos <rodrigoca@microsoft.com>
@rata rata force-pushed the issue-11852 branch 2 times, most recently from 73d1cc3 to d50a6e7 Compare August 26, 2025 13:12
Signed-off-by: Rodrigo Campos <rodrigoca@microsoft.com>
@fuweid fuweid added this pull request to the merge queue Aug 26, 2025
@fuweid fuweid added cherry-pick/2.0.x Change to be cherry picked to release/2.0 branch cherry-pick/2.1.x Change to be cherry picked to release/2.1 branch labels Aug 26, 2025
Merged via the queue into containerd:main with commit 4312e07 Aug 26, 2025
94 of 95 checks passed
@github-project-automation github-project-automation bot moved this from Review In Progress to Done in Pull Request Review Aug 26, 2025
@brandond
Copy link
Contributor

Does this need backporting?

@rata rata deleted the issue-11852 branch August 27, 2025 09:29
@rata
Copy link
Contributor Author

rata commented Aug 27, 2025

/cherry-pick release/2.1 release/2.0

@k8s-infra-cherrypick-robot

@rata: only containerd org members may request cherry picks. If you are already part of the org, make sure to change your membership to public. Otherwise you can still do the cherry-pick manually.

Details

In response to this:

/cherry-pick release/2.1 release/2.0

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@rata
Copy link
Contributor Author

rata commented Aug 27, 2025

Opened backport PRs manually: #12241 and #12242

@fuweid fuweid added cherry-picked/2.0.x PR commits are cherry picked into the release/2.0 branch cherry-picked/2.1.x PR commits are cherry picked into the release/2.1 branch and removed cherry-pick/2.0.x Change to be cherry picked to release/2.0 branch cherry-pick/2.1.x Change to be cherry picked to release/2.1 branch labels Aug 27, 2025
adelton added a commit to freeipa/freeipa-container that referenced this pull request Nov 6, 2025
This logically reverts commit 2a53de8
but keeps 8a465db.

Resolved in containerd/containerd#12218.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/cri Container Runtime Interface (CRI) cherry-picked/2.0.x PR commits are cherry picked into the release/2.0 branch cherry-picked/2.1.x PR commits are cherry picked into the release/2.1 branch needs-ok-to-test size/L

Projects

Archived in project

7 participants