cri: Fix userns with container image VOLUME mounts that need copy #12218

rata · 2025-08-19T15:30:52Z

This PR fixes the "copy-up" used in VOLUME directives if the path in the rootfs has some contents AND the pod is using user-namespaces (it was working fine if the rootfs didn't have content for the volume path).

The issue is that when doing the copy-up, we screw up the permissions in the userns case. This is explained in more detail in the commit msg (c&p below). Also, I've added an integration test that test the copy-up with userns. It's based in the existing copy-up test case, that is testing already a lot of things.

The copy-up integration test adapted for userns fails without this PR (and works with this, of course :)). I've also added to the test to check the file permissions, for completeness, although it wasn't needed to detect this bug.

Below the commit msg with the long explanation.

If a Dockerfile is using a VOLUME directive and the directory exists
in the rootfs, like in this example:

        FROM docker.io/library/alpine:latest
        VOLUME [ "/run" ]

The alpine container image already contains a "/run" directory. This
will force the code in WithVolumes() to copy its content to the new
volume created for the VOLUME directive. This copies the content as well
as the ownership.

However, as we perform the mounts from the host POV without being inside
a userns, the idmap option will just shift the IDs in ways that will
screw up the ownerships when copied. We should only use the idmap option
when running the container inside a userns, so the ownerships are fine
(the userns will do a shift and the idmap another, to make it all seem
as if there was no UID/GID shift in the first place).

This PR does just that, remove the idmap option from mounts so we copy
the files without any ID transformations. It's simpler and easier to
reason about if we just don't mount with the idmap option here: all
files are copied just fine without ID transformations and ID
transformation is applied via the idmap option at mount time when
running the pod.

Also, note that VOLUME directives that refer to directories that don't
exist on the rootfs work fine (VOLUME [ "/rata" ] for example), as
there is no copy done in that case so the permissions weren't changed.

In case you are curious, if you run the repro on a k8s pod and also add automountServiceAccountToken: false, the pod will start and you will see the wrong permissions.

Before the PR:

/ # ls -ldn /run
drwxr-xr-x    3 65534    65534         4096 Aug 22 12:46 /run
/ # ls -ln /run
total 4
drwxr-xr-x    2 65534    65534         4096 Jul 15 10:42 lock

Basically the UID/GIDs this is using are not mapped in the userns after all the ids transformations. That is why we can't create a directory there.

If the container userns is using the host UID 100000 mapped to 0 inside the container, then /run will be changed to be owned by that. But then that ID is not mapped inside the userns and it shows like this, mapped to the overflow uid.

After this PR, the permissions look as expected:

/ # ls -ld /run
drwxr-xr-x    3 root     root          4096 Aug 22 12:46 /run
/ # ls -ldn /run
drwxr-xr-x    3 0        0             4096 Aug 22 12:46 /run
/ # ls -ldn /run
drwxr-xr-x    3 0        0             4096 Aug 22 12:46 /run
/ # ls -ln /run
total 4
drwxr-xr-x    2 0        0             4096 Jul 15 10:42 lock

And we can create the mountpoint for the k8s service account token too.

The files copied and the permission for the /run directory itself is as UID 0 (as we don't use an idmap at all for the copy now), and then we use the idmap when accessing the volumes and all works as expected.

Fixes: #11852

k8s-ci-robot · 2025-08-19T15:31:03Z

Hi @rata. Thanks for your PR.

I'm waiting for a containerd member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

rata · 2025-08-20T10:59:57Z

cc @halaney in case you are interested in this PR too :)

halaney

If I'm following correctly I think this looks good!

The alpine container image already contains a "/run" directory. This
will force the code in WithVolumes() to copy its content to the new
volume created for the VOLUME directive. This copies the content as well
as the ownership.

However, as we perform the mounts from the host POV without being inside
a userns, the idmap option will just shift the IDs in ways that will
screw up the ownerships when copied. We should only use the idmap option
when running the container inside a userns, so the ownerships are fine
(the userns will do a shift and the idmap another, to make it all seem
as if there was no UID/GID shift in the first place).

I think it would be useful to show the host view of /run/ before and after this change just for an easier picture. Maybe bonus points of how that looks at the temp mount with its idmap going on.

IIUC let's say the host has an empty host volume /run/ owned by 0, container has /run/foobar file which is owned by uid/gid 0/0. Let's say the {u,g}idmap is 0 on host to 65556 in container. Without this change we'd:

Mount the temporary overlayfs with idmap'ing (resulting in the host seeing this mount's dirs/content now being owned by 65556)
Copy that foobar file out of the overlayfs, preserving its permissions, to /run/
The result is the host's /run/foobar is now owned by 65556, not 0 as expected. 0 is expected since as you mention the container's user namespace mapping will effectively undo the idmap mapping from its pov, resulting in 65556 being 0 in the actual container. When we actually do the final mount in the container proper, its also a idmap'ed bind mount onto the overlayfs right? So it will also be all messed up in the container's POV (it'll be shifted an extra 65556 resulting in user 65556 owning it within the container as well)?

Idk, after typing maybe there's not more info to add but this exercise at least convinced me of what's going on here :D
So either we need to idmap the volume too, or not do the idmap on the overlayfs. So your approach seems like the path of least work!

integration/pod_userns_linux_test.go

rata · 2025-08-22T12:56:19Z

@halaney Thanks for the review, added the before/after to the PR description.

IIUC let's say the host has an empty host volume /run/ owned by 0, container has /run/foobar file which is owned by uid/gid 0/0. Let's say the {u,g}idmap is 0 on host to 65556 in container. Without this change we'd:

I think you had a typo here, it's 0 inside the container mapped to 65536 on the host (or whatever hostID). This is what the mapping is doing.

The result is the host's /run/foobar is now owned by 65556, not 0 as expected. 0 is expected since as you mention the container's user namespace mapping will effectively undo the idmap mapping from its pov, resulting in 65556 being 0 in the actual container. When we actually do the final mount in the container proper, its also a idmap'ed bind mount onto the overlayfs right? So it will also be all messed up in the container's POV (it'll be shifted an extra 65556 resulting in user 65556 owning it within the container as well)?

Not 100% sure what you mean its also idmapped bind mount onto the rootfs, but IIUC nope. There are no "nested" idmapping here, if that is what you are saying. The rootfs is overlay and is idmapped, this is a bind-mount and is idmapped, but those idmappings don't nest and no AT_RECURSIVE is used either.

Also, overlayfs doesn't support to idmap the merged dir, what it supports is to idmap the lowerdir/upperdir. So, we can't really use AT_RECURSIVE either here.

So either we need to idmap the volume too, or not do the idmap on the overlayfs. So your approach seems like the path of least work!

No, the overlayfs always needs to be idmap. As I said, they are not nested, they don't interfere, nor idmap is recursive either.

It could work to have "the right permissions" for the directory when you look only at this layer, but it doesn't when you look at the bigger picture. The hostUID is dynamic, you don't want to chown all the files every time you get a new mapping. You want pods to be able to enable/disable userns, but you can't unless you chown all the volumes with that solution, etc.

The current design is simpler, we just idmap when running, all works as expected, no chown ever required.
This is not done only in this part (for image volume directives), but all the way in all volume files (even configmaps, etc.).

halaney · 2025-08-22T18:12:44Z

@rata me explaining myself below, but tldr we're saying the same thing I just wrote it confusingly earlier (and maybe still) below):

Not 100% sure what you mean its also idmapped bind mount onto the rootfs, but IIUC nope. There are no "nested" idmapping here, if that is what you are saying. The rootfs is overlay and is idmapped, this is a bind-mount and is idmapped, but those idmappings don't nest and no AT_RECURSIVE is used either.

Nesting is not what I meant, your description matches what I was attempting (:P) to say. My bits about:

So either we need to idmap the volume too, or not do the idmap on the overlayfs. So your approach seems like the path of least work!

was trying to express that to fix the current problem either we don't want to idmap the overlayfs for this WithVolume operation (what you opted for), or we need to idmap mount both the overlayfs (really its lower dirs but just using that term for the whole thing) and the volume, copying between those two idmap'ed views to keep the users consistent. i.e. both the source and destination need to be idmapped or neither should be idmapped, not mixed.

Here in WithVolume we can just skip it as it provides no benefit, later when the container runs we need to idmap both because the from the host's pov the container is running as user 100000 in example below and needs to be given a view of the mounts that's accessible to user 100000 via the idmaps.

i.e. before it was something like

idmap mount 0 -> 100000 of overlayfs at /tmp/idmapped-overlayfs
cp -R /tmp/idmapped-overlayfs/<volume>/ /<volume>/
# this results in /<volume>/ containing files owned by 100000 instead 
# of 0 (assuming everything is originally owned by 0 before idmaps)
...
# make real container rootfs mounts:
idmap mount 0 -> 100000 of overlayfs at /var/lib/container/.../rootfs
idmap mount 0 -> 100000 of <volume> at /var/lib/container/.../rootfs/<volume>
# this results in  /var/lib/container/.../rootfs/<volume> contents being owned 
# by 200000 instead of intended 100000 and within the container this would look
# like user 100000 owns it instead of 0 as expected (or nobody user if 100000 is
# out of uidmap range) due to user namespace mapping

now things work since we don't shift them at the beginning. Alternatively, you could do the copy content with both locations idmapped and that would also work out consistently. Its the half doing of it that's the problem, and since there's no reason to do it in WithVolumes this PR the best solution to the problem.

Sorry for confusion! this looks good to me, thanks for the extra details on the MR description, if I'm still not making sense just ignore me we're trying to say the same thing :D

fuweid · 2025-08-24T19:45:04Z

internal/cri/opts/container.go

+		// Always copy files without UID/GID transformations.
+		// The ID transformations are only needed when running inside a userns, that is not
+		// the case here.
+		mounts = mount.RemoveIDMapOption(mounts)


@mikebrow do we still need to support that volume defined in image?
Kubelet always recreates a container once the current one exits. Using an image volume only adds unnecessary I/O.
If there is no valid use case for it, we probably consider deprecating it.

I think the data there is always lost in k8s containers, which misses the point of volumes. But it might be used just as a volume, as we create a bind mount in a regular fs usually (not overlay, or tmpfs as some emptyDir might do, etc.).

I'll be up to deprecate it in some way, if k8s exposes replacements (e.g. I think emptyDir might be possible without a tmpfs, but didn't check)

If a Dockerfile is using a `VOLUME` directive and the directory exists in the rootfs, like in this example: FROM docker.io/library/alpine:latest VOLUME [ "/run" ] The alpine container image already contains a "/run" directory. This will force the code in WithVolumes() to copy its content to the new volume created for the VOLUME directive. This copies the content as well as the ownership. However, as we perform the mounts from the host POV without being inside a userns, the idmap option will just shift the IDs in ways that will screw up the ownerships when copied. We should only use the idmap option when running the container inside a userns, so the ownerships are fine (the userns will do a shift and the idmap another, to make it all seem as if there was no UID/GID shift in the first place). This PR does just that, remove the idmap option from mounts so we copy the files without any ID transformations. It's simpler and easier to reason about if we just don't mount with the idmap option here: all files are copied just fine without ID transformations and ID transformation is applied via the idmap option at mount time when running the pod. Also, note that `VOLUME` directives that refer to directories that don't exist on the rootfs work fine (`VOLUME [ "/rata" ]` for example), as there is no copy done in that case so the permissions weren't changed. Signed-off-by: Rodrigo Campos <rodrigoca@microsoft.com>

Signed-off-by: Rodrigo Campos <rodrigoca@microsoft.com>

brandond · 2025-08-27T01:25:53Z

Does this need backporting?

rata · 2025-08-27T09:31:17Z

/cherry-pick release/2.1 release/2.0

k8s-infra-cherrypick-robot · 2025-08-27T09:31:27Z

@rata: only containerd org members may request cherry picks. If you are already part of the org, make sure to change your membership to public. Otherwise you can still do the cherry-pick manually.

Details

In response to this:

/cherry-pick release/2.1 release/2.0

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

rata · 2025-08-27T10:04:28Z

Opened backport PRs manually: #12241 and #12242

This logically reverts commit 2a53de8 but keeps 8a465db. Resolved in containerd/containerd#12218.

github-project-automation bot added this to Pull Request Review Aug 19, 2025

k8s-ci-robot added the do-not-merge/work-in-progress label Aug 19, 2025

github-project-automation bot moved this to Needs Triage in Pull Request Review Aug 19, 2025

k8s-ci-robot added size/L needs-ok-to-test labels Aug 19, 2025

dosubot bot added the area/cri Container Runtime Interface (CRI) label Aug 19, 2025

rata force-pushed the issue-11852 branch 2 times, most recently from 87c5e5d to 4c86977 Compare August 20, 2025 10:47

rata changed the title ~~WIP: cri: Fix userns with Dockerfile VOLUME mounts that need copy~~ cri: Fix userns with Dockerfile VOLUME mounts that need copy Aug 20, 2025

k8s-ci-robot removed the do-not-merge/work-in-progress label Aug 20, 2025

rata force-pushed the issue-11852 branch from 4c86977 to c5b3818 Compare August 20, 2025 10:58

rata force-pushed the issue-11852 branch 3 times, most recently from 422f27f to f59f155 Compare August 21, 2025 11:08

rata changed the title ~~cri: Fix userns with Dockerfile VOLUME mounts that need copy~~ cri: Fix userns with container image VOLUME mounts that need copy Aug 21, 2025

rata force-pushed the issue-11852 branch 3 times, most recently from 29a8e36 to a3054ee Compare August 21, 2025 14:19

halaney approved these changes Aug 21, 2025

View reviewed changes

integration/pod_userns_linux_test.go Outdated Show resolved Hide resolved

estesp approved these changes Aug 21, 2025

View reviewed changes

fuweid reviewed Aug 24, 2025

View reviewed changes

fuweid approved these changes Aug 24, 2025

View reviewed changes

github-project-automation bot moved this from Needs Triage to Review In Progress in Pull Request Review Aug 24, 2025

rata force-pushed the issue-11852 branch from a3054ee to e7722ef Compare August 25, 2025 14:09

rata force-pushed the issue-11852 branch 2 times, most recently from 73d1cc3 to d50a6e7 Compare August 26, 2025 13:12

integration: Add test for directives with userns

f0ee598

Signed-off-by: Rodrigo Campos <rodrigoca@microsoft.com>

rata force-pushed the issue-11852 branch from d50a6e7 to f0ee598 Compare August 26, 2025 14:29

fuweid approved these changes Aug 26, 2025

View reviewed changes

fuweid added this pull request to the merge queue Aug 26, 2025

fuweid added cherry-pick/2.0.x Change to be cherry picked to release/2.0 branch cherry-pick/2.1.x Change to be cherry picked to release/2.1 branch labels Aug 26, 2025

Merged via the queue into containerd:main with commit 4312e07 Aug 26, 2025
94 of 95 checks passed

github-project-automation bot moved this from Review In Progress to Done in Pull Request Review Aug 26, 2025

rata deleted the issue-11852 branch August 27, 2025 09:29

This was referenced Aug 27, 2025

[release/2.0] Fix userns with container image VOLUME mounts that need copy #12241

Merged

[release/2.1] Fix userns with container image VOLUME mounts that need copy #12242

Merged

adelton added a commit to freeipa/freeipa-container that referenced this pull request Nov 6, 2025

Revert "Workaround containerd/containerd#11852."

771ec34

This logically reverts commit 2a53de8 but keeps 8a465db. Resolved in containerd/containerd#12218.

cri: Fix userns with container image VOLUME mounts that need copy #12218

cri: Fix userns with container image VOLUME mounts that need copy #12218

Uh oh!

Conversation

rata commented Aug 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

k8s-ci-robot commented Aug 19, 2025

Uh oh!

rata commented Aug 20, 2025

Uh oh!

halaney left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

rata commented Aug 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

halaney commented Aug 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

fuweid Aug 24, 2025

Choose a reason for hiding this comment

Uh oh!

rata Aug 27, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

brandond commented Aug 27, 2025

Uh oh!

rata commented Aug 27, 2025

Uh oh!

k8s-infra-cherrypick-robot commented Aug 27, 2025

Uh oh!

rata commented Aug 27, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

rata commented Aug 19, 2025 •

edited

Loading

halaney left a comment •

edited

Loading

rata commented Aug 22, 2025 •

edited

Loading

halaney commented Aug 22, 2025 •

edited

Loading