Skip to content

Conversation

@giuseppe
Copy link
Contributor

@giuseppe giuseppe commented Oct 8, 2018

More details about the issue: genuinetools/img#170

@AkihiroSuda
Copy link

seems related to torvalds/linux@41c21e3

cc @amluto

AkihiroSuda added a commit to AkihiroSuda/img that referenced this pull request Oct 9, 2018
Applies shadow-maint/shadow#132 so that we
don't need to have CAP_SYS_ADMIN.

See also genuinetools#170 .

Signed-off-by: Akihiro Suda <suda.akihiro@lab.ntt.co.jp>
jessfraz pushed a commit to genuinetools/img that referenced this pull request Oct 9, 2018
* use Giuseppe's forked newuidmap/newgidmap

Applies shadow-maint/shadow#132 so that we
don't need to have CAP_SYS_ADMIN.

See also #170 .

Signed-off-by: Akihiro Suda <suda.akihiro@lab.ntt.co.jp>

* shut up codacy

Signed-off-by: Akihiro Suda <suda.akihiro@lab.ntt.co.jp>
@rhatdan
Copy link

rhatdan commented Oct 9, 2018

@giuseppe Great find. Does this for CAP_SETPCAP to be added to newuidmap and newgidmap? Could these utils be run with filecaps rather then having to be setuid?

@rhatdan
Copy link

rhatdan commented Oct 9, 2018

@t8m PTAL

@rhatdan
Copy link

rhatdan commented Oct 9, 2018

@vbatts FYI

@vbatts
Copy link

vbatts commented Oct 9, 2018

Wait. I'm already having an updated shadow build that has a newxidmap. We just talked about this

@giuseppe
Copy link
Contributor Author

giuseppe commented Oct 9, 2018

@rhatdan I've not tried but I think filecaps should work as well

@vbatts
Copy link

vbatts commented Oct 9, 2018

oh oh good. I was seeing the forked shadow, but that is a narrow bypass for this CAP issue.

@giuseppe
Copy link
Contributor Author

/cc @hallyn @ebiederm

AkihiroSuda added a commit to AkihiroSuda/buildkit_poc that referenced this pull request Oct 15, 2018
Apply shadow-maint/shadow#132 so that newuidmap/newgidmap
doesn't require CAP_SYS_ADMIN

Signed-off-by: Akihiro Suda <suda.akihiro@lab.ntt.co.jp>
AkihiroSuda added a commit to AkihiroSuda/buildkit_poc that referenced this pull request Oct 15, 2018
Apply shadow-maint/shadow#132 so that newuidmap/newgidmap
doesn't require CAP_SYS_ADMIN

Signed-off-by: Akihiro Suda <suda.akihiro@lab.ntt.co.jp>
AkihiroSuda added a commit to AkihiroSuda/buildkit_poc that referenced this pull request Oct 16, 2018
Apply shadow-maint/shadow#132 so that newuidmap/newgidmap
doesn't require CAP_SYS_ADMIN

Signed-off-by: Akihiro Suda <suda.akihiro@lab.ntt.co.jp>
AkihiroSuda added a commit to AkihiroSuda/buildkit_poc that referenced this pull request Oct 16, 2018
Apply shadow-maint/shadow#132 so that newuidmap/newgidmap
doesn't require CAP_SYS_ADMIN

Signed-off-by: Akihiro Suda <suda.akihiro@lab.ntt.co.jp>
AkihiroSuda added a commit to AkihiroSuda/buildkit_poc that referenced this pull request Oct 16, 2018
Apply shadow-maint/shadow#132 so that newuidmap/newgidmap
doesn't require CAP_SYS_ADMIN

Signed-off-by: Akihiro Suda <suda.akihiro@lab.ntt.co.jp>
@hallyn
Copy link
Member

hallyn commented Oct 18, 2018

Hi,

can you give a concise example of a way to trigger this?

The patch is mostly fine, except build should absolutely not fail if libcap is not available.

@hallyn
Copy link
Member

hallyn commented Oct 18, 2018

Hey @brauner , just wondering, is this something you've already thought through?

@brauner
Copy link
Collaborator

brauner commented Oct 18, 2018

So, I would really like to see reproducer for this scenario to better reason about it. :)

Copy link
Collaborator

@brauner brauner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if the euid!=owner of the userns, the kernel returns EPERM when trying
to write the uidmap and there is no CAP_SYS_ADMIN in the parent
namespace.

So that analysis is partially wrong and for what it's worth my first one [1] was too.

What actually fails is the CAP_SYS_ADMIN in the target user namespace in the first check with file_ns_capable() fails in the setuid case. Here is my full analysis of this:

Here is how I think this goes.

  1. newuidmap is setuid, only caps that are kept are CAP_SETUID and CAP_SETGID
unshare -U sleep infinity &
newuidmap $? 0 100000 65536

You first hit file_ns_capable(file, ns, CAP_SYS_ADMIN). This calls into cap_capable() and hits the loop

for (;;) {
        /* Do we have the necessary capabilities? */
        if (ns == cred->user_ns)
                return cap_raised(cred->cap_effective, cap) ? 0 : -EPERM;

        /*
         * If we're already at a lower level than we're looking for,
         * we're done searching.
         */
        if (ns->level <= cred->user_ns->level)
                return -EPERM;

        /*
         * The owner of the user namespace in the parent of the
         * user namespace has all caps.
        */
        if ((ns->parent == cred->user_ns) && uid_eq(ns->owner, cred->euid))
                return 0;

        /*
         * If you have a capability in a parent user ns, then you have
         * it over all children user namespaces as well.
        */
        ns = ns->parent;
}

The first check fails and falls through to the end of the loop and retrieves the parent user namespace and checks whether CAP_SYS_ADMIN is available there which isn't.

  1. newuidmap has CAP_SETUID as fscaps set CAP_SETUID and all other caps other than CAP_SETUID have been dropped.
unshare -U sleep infinity &
newuidmap $? 0 100000 65536

You pass the first file_ns_capable() check for CAP_SYS_ADMIN since the euid hasn't been changed and thus pass:

if ((ns->parent == cred->user_ns) && uid_eq(ns->owner, cred->euid))
        return 0;

Now this hits new_idmap_permitted() and ns_capable(ns->parent, CAP_SETUID) which is passed since CAP_SETUID is available in the parent user namespace. Now the right hand side of the conjunct file_ns_capable(file, ns->parent, CAP_SETUID) is hit and the cap_capable() loop (see above) is entered again. This passes

if (ns == cred->user_ns)
        return cap_raised(cred->cap_effective, cap) ? 0 : -EPERM;

since CAP_SETUID is available in the parent user namespace. Hence you can write the mapping and are done.

[1]:
The caller doesn't need to have CAP_SYS_ADMIN in the parent user namespace it needs to have CAP_SYS_ADMIN in the target user namespace. That is if you call open() on /proc/<pid>/{g,u}id_map it will hit proc_id_map_open() which will retrieve (via a few more intermediate steps) the user namespace based on get_proc_task() which takes the pid associated with the inode of the proc file into account aka the user namespace you're seeing is not your own but that of the target task you're trying to operate on. You should always have CAP_SYS_ADMIN in there when you have unshared and or cloned it before you have written a mapping and before you have explicitly dropped any caps. If that weren't the case you would never get past the initial unconditional:

if (cap_valid(cap_setid) && !file_ns_capable(file, ns, CAP_SYS_ADMIN))
		goto out;

check. You would never be able to write any idmappings as an unprivileged user because you don't have CAP_SYS_ADMIN as uid != 0 in the initial user namespace.

So the problem you're hitting in the issue you've linked in is a specific corner case:
- you're uid != 0 (e.g. 1000)
- you're euid != uid

~~So you're hitting:

if ((new_map->nr_extents == 1) && (new_map->extent[0].count == 1) && uid_eq(ns->owner, cred->euid))

since you don't fulfill the latter requirement you're falling into the case were you now need

  • CAP_SETUID or CAP_SETGID in your user namespace
  • CAP_SETUID or CAP_SETGID in the target's parent user namespace wrt to the /proc/<pid>/{g,u}id_map file~~

So to handle the corner case you want to handle all you should need is to do the seteuid() (It's late and I'm jotting down the analysis so please do test.). Also please note that this won't help you in the general case where you want to write more than a single mapping of a uid pair that is always allowed. To write more complex mappings you will need privileges that's why I don't see the need for the cap dance here.

Last thing that I find odd. The PRs of @AkihiroSuda that are linked here that switch to @giuseppe's vendored new{g,u}idmap do not show that this actually works and are misleading because the setuid is set on them. So it's trivial that they work.~~

@AkihiroSuda
Copy link

misleading because the setuid is set on them.

setuid is set but CAP_SYS_ADMIN is dropped by default in Docker/Kubernetes.
Without enabling CAP_SYS_ADMIN explicitly, newuidmap fails with newuidmap: write to uid_map failed: Operation not permitted genuinetools/img#170

@giuseppe giuseppe changed the title [RFC] newuidmap/newgidmap: do not require CAP_SYS_ADMIN in the parent user namespace newuidmap/newgidmap: do not require CAP_SYS_ADMIN in the parent user namespace Oct 22, 2018
@hallyn
Copy link
Member

hallyn commented Oct 22, 2018

Gah - sorry about that the repeated comments.

github was wonky lastnight.

configure.ac Outdated
LIBCAP=-lcap
AC_DEFINE(HAVE_LIBCAP, 1, [Defined if you have libcap.])],
[LIBCAP=""])],
[LIBCAP=""])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are you checking for libcap? You're only using the raw syscalls cap{g,s}et() so there's no use in linking new{g,u}idmap against libcap, I think.

exit(EXIT_FAILURE);
}

data[0].permitted |= CAP_TO_MASK(uid_map ? CAP_SETUID : CAP_SETGID);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please clear all other capabilities. You really only need CAP_SET*ID and set it in effective too otherwise this will not work. :)

if the euid!=owner of the userns, the kernel returns EPERM when trying
to write the uidmap and there is no CAP_SYS_ADMIN in the parent
namespace.

Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
@ebiederm
Copy link

I am trying to undetstand what is being proposed and what is going on here.

  • newuidmap is required when the caller does not have CAPSETID. It limits the caller to just what is allowed in /etc/subuid.

  • Making newuidmap setcap instead of setuid root is reasonable, and takes no code changes and should just work. The downside is it is called as someone other than the creator of the user namespace it will fail.

  • Why would you call newuidmap with a set of credentials different from how the namespace was created?

  • If you are already changing your uid in the caller of newuidmap than the caller should be the one making the changes to uid.

  • newuidmap should be as simple as absolutely possible because it is designed to be a setuid root executable. It should have no bells and no whistles because every extra line of code is a possible bug which could be a possible security hole.

So I see a great argument for making newuidmap use file capabilities to gain SETUID and newgidmap to use file capabilities to gain SETGID. Beyond that everything appears to be the callers problem.

I don't see a justification for unsharing a user namespace as one euid and calling newuidmap with another euid. As the caller is doing both then the caller can take care of whatever needs to be take care of.

@giuseppe
Copy link
Contributor Author

newuidmap/newgidmap are not going to be called with a different set of credentials, we only need to set the euid to the caller of the tools. If it makes things clearer I can use getuid() instead of pw->pw_uid when calling write_mapping(...). This patch is only useful when newuidmap/newgidmap are installed as suid binaries and root doesn't have CAP_SYS_ADMIN but only CAP_SETUID/CAP_SETGID.

File caps work fine in this case, the euid is not changed to root and we don't need the UID->suid()->root->seteuid()->UID dance when root doesn't have CAP_SYS_ADMIN as it usually happens in a container. We also avoid gaining other capabilities that are not needed.

@rhatdan
Copy link

rhatdan commented Oct 23, 2018

@giuseppe You are saying that this is needed for when you run podman inside of a container that does not have access to SYS_ADMIN privs, But does have SETUID/SETGID. Would this also happen if a user account dropped SYS_ADMIN from the bounding set and then attempted to run podman?

@giuseppe
Copy link
Contributor Author

@rhatdan without CAP_SYS_ADMIN podman won't work, both when running as root or as rootless with CAP_SYS_ADMIN gained in a new user namespace

@ebiederm
Copy link

@giuseppe Why does it make sense to change the code of newuidmap and newgidmap to support this environment where setuid does not grant CAP_SYS_ADMIN? Why not change the spec file or Makefile to install newuidmap and newgidmap with file capabilities?

It seems like changing how permissions are granted would be the simpler and much less risky change.

@giuseppe
Copy link
Contributor Author

@giuseppe Why does it make sense to change the code of newuidmap and newgidmap to support this environment where setuid does not grant CAP_SYS_ADMIN? Why not change the spec file or Makefile to install newuidmap and newgidmap with file capabilities?

It seems like changing how permissions are granted would be the simpler and much less risky change.

I am completely fine with using file capabilities.

I've used the seteuid approach as all other programs that need capabilities are installed as suid binaries and changing newuidmap/newgidmap to use file caps seemed to me like a bigger change than a slightly different code path that is used only when CAP_SYS_ADMIN is not available (so in an already "safe" environment).

I've opened another PR where file capabilities are used so we can more easily compare the two options: #136

@ebiederm
Copy link

If make the goal here to always make the code in newuidmap and newgidmap always run with the minimal set of capabilities and uids. As if the utilities were exclusively using file capabilities I think there are security improvements to be had. Something like this (untested):

0001-newuidmap-newgidmap-Only-run-with-the-minimal-necess.txt

@hallyn
Copy link
Member

hallyn commented Oct 24, 2018

@ebiederm part of me feels like "the minimal amount of work we do, the better", but please feel free to post that as a PR. note that it can't be literally what you posted, since we cannot assume that libcap is available, but I can do the autoconf tweaking if you prefer.

I'm closing this request, we'll go with one of the filecaps PRs instead.

Thanks, everyone.

@hallyn hallyn closed this Oct 24, 2018
@AkihiroSuda
Copy link

As mentioned in #136 , file capabilities does not work well with user namespaces for older kernels.

@ebiederm Could you open PR?

@hallyn
Copy link
Member

hallyn commented Oct 25, 2018

That may be, but remember all this hubub is abuot a pretty silly context - where you do not have cap_sys_admin but have cap_setuid. So now for nsfscaps to come into play, we're talking about a case where you are in a user namespace, run docker in there, and now want to run unprivileged builds inside that?

@ebiederm
Copy link

@hallyn @AkihiroSuda I am going to pass at opening a PR at this time. I don't currently care enough to take the code I have posted any farther.

I just wanted to point out that it is possible and without conditionals, and with just the capset system call that glibc supports to lockdown newuidmap and newgidmap and in so doing happen to fix this issue.

@giuseppe
Copy link
Contributor Author

@hallyn since file caps are not going to be the default anyway for a while, could we please reconsider this patch for now?

@hallyn hallyn reopened this Oct 27, 2018
@hallyn
Copy link
Member

hallyn commented Oct 27, 2018

What about verifying, along with

  •   if (!(data[0].effective & CAP_TO_MASK(CAP_SYS_ADMIN)) &&
    
  •       uid != geteuid()) {
    

that ruid is either 0 or uid ? Otherwise if two users somehow share a uid delegation, I think this would let them mess with each other's namespaces.

@hallyn
Copy link
Member

hallyn commented Oct 27, 2018

No, I guess we already check for that.

@hallyn hallyn merged commit d5255da into shadow-maint:master Oct 27, 2018
@hallyn
Copy link
Member

hallyn commented Oct 27, 2018

(... well, the title was still wrong, but too late now)

@brauner
Copy link
Collaborator

brauner commented Oct 27, 2018

Uhm... I was about to send a proper patch for this. Let me send an improved version.

@giuseppe
Copy link
Contributor Author

Uhm... I was about to send a proper patch for this.

was any comment left unaddressed code wise? I don't see any functional difference in the other PR, in facts the new code block seems to duplicate what is already done below it and doesn't build when <sys/capability.h> is present

@brauner
Copy link
Collaborator

brauner commented Oct 27, 2018

Forgot, to push over my missing rebase. The functional changes should be pretty obvious though:

  • unconditionally drop all unneeded caps
  • check whether we can safely seteuid()
  • allocate memory after we have done all capability handling

Further simplifcations:

  • do not use a bool use a cap and pass the cap directly instead of needing to check an intermediate bool variable
  • move everything additional under one single ifdefine

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants