Skip to content

Proposal - Allow the container to share the PID namespace with the host#9339

Closed
rhatdan wants to merge 1 commit intomoby:masterfrom
rhatdan:pid
Closed

Proposal - Allow the container to share the PID namespace with the host#9339
rhatdan wants to merge 1 commit intomoby:masterfrom
rhatdan:pid

Conversation

@rhatdan
Copy link
Copy Markdown
Contributor

@rhatdan rhatdan commented Nov 25, 2014

We want to be able to use container without the PID namespace. We basically
want containers that can manage the host os, which I call Super Privileged
Containers. We eventually would like to get to the point where the only
namespace we use is the MNT namespace to bring the Apps userspace with it.

By eliminating the PID namespace we can get better communication between the
host and the clients and potentially tools like strace and gdb become easier
to use. We also see tools like libvirtd running within a container telling
systemd to place a VM in a particular cgroup, we need to have communications of the PID.

I don't see us needing to share PID namespaces between containers, since this
is really what docker exec does.

So currently I see us just needing docker run --pid=host

Docker-DCO-1.1-Signed-off-by: Dan Walsh dwalsh@redhat.com (github: rhatdan)

@icecrime
Copy link
Copy Markdown
Contributor

Flagging as proposal, thanks for the "docs first" approach!

@icecrime icecrime changed the title Allow the container to share the PID namespace with the host Proposal - Allow the container to share the PID namespace with the host Nov 26, 2014
@SvenDowideit
Copy link
Copy Markdown
Contributor

I'd feel more comfortable if there was a default --pid=namespaced (or something), making it a little more obvious that this flag changes the norm.

@jessfraz
Copy link
Copy Markdown
Contributor

I don't understand the point

@jessfraz
Copy link
Copy Markdown
Contributor

Why use a container if you don't want to be contained?

@SvenDowideit
Copy link
Copy Markdown
Contributor

it allows the use of container images as a distribution mechanism - ie, will make the rpm format obsolete 🎉 :)

and you get a magical way to uninstall, or up/down grade a host's daemons

@rhatdan
Copy link
Copy Markdown
Contributor Author

rhatdan commented Nov 30, 2014

I would not go as far as saying make RPM obsolete. As you still need a packageing format for building your image. But yes, the point is to use docker images as a packaging format.

Turning off all namespaces except the mnt namespace is the goal. BTW CoreOS is doing this now but they just use the docker format and then run systemd-nspawn on the exploded image.

I would prefer to use native docker commands.

@SvenDowideit
Copy link
Copy Markdown
Contributor

@rhatdan one thing that --pid=container:someother might be useful for, is to compose a set of services that might be managed by some kind of supervisor process with might have been coded container unaware (but with configurable spawn commands). But hey, i also want to have --fs=container:someother to let me compose a server with dynamic plugins.

@sdake
Copy link
Copy Markdown

sdake commented Dec 1, 2014

This feature would allow the entire OS image to be modified on the fly without modifying running applications. If applications were smart enough to handle a restart in this situation, the base images and most of the applications could be restarted leaving key processes behind.

In one use case, we want to run OpenStack Nova and Libvirt in an upgrade, but keep all of the KVM processes between container restarts. This would allow libvirt to reconnect to the existing KVM processes, even after what would appear like a full container restart ;)

@crosbymichael
Copy link
Copy Markdown
Contributor

@sdake idk, i would take the opposite approach and think that docker should inspect the cgroup and kill off any tasks when the container stops or else ( if you do not understand what this means ) you will have processes that live on after docker reports that your container has stopped.

idk, what do you think?

@sdake
Copy link
Copy Markdown

sdake commented Dec 1, 2014

@crosbymichael The reason to share PID namespaces between containers and between executions of the container tech is to allow the rest of the OS to upgrade seamlessly while critical applications launched from container daemons remain untouched. I get it is not very intuitive, which is why I think Dan suggested a special flag to get Docker to execute in this way.

@rhatdan
Copy link
Copy Markdown
Contributor Author

rhatdan commented Dec 1, 2014

I would argue you want both. I want all of the processes running in the container to exit on stopping the container, but I also could see allowing moving certain pids out of the container cgroup and letting them live on until killed.

Specifically we are looking at libvirt again running iwithin the container. We might want to reboot the libvirtd container but allow the virtual machines to continue running.

@sdake
Copy link
Copy Markdown

sdake commented Dec 1, 2014

@rhatdan That WFM, would that be a different feature, or had you intended this feature to handle the "both" model?

@rhatdan
Copy link
Copy Markdown
Contributor Author

rhatdan commented Jan 7, 2015

@sdake I wanted this feature to handle both.

@rhatdan
Copy link
Copy Markdown
Contributor Author

rhatdan commented Jan 7, 2015

@crosbymichael

The problem I am seeing is if I do

docker run -ti --pid=host fedora /bin/sh

sleep 6000 &

^d

The process just hangs, and the container never exits. What I would like to do is realize that the primary pid has died and then send kill signals to all of the processes in the cgroup.

I think the problem is docker does not watch for a pid1 to die, it watches for the cgroup to get removed when all of the processes exit.

    oomKillNotification, err := libcontainer.NotifyOnOOM(state)

Is there a place where go gets a sigchild from Pid1 of the container?

@crosbymichael
Copy link
Copy Markdown
Contributor

Actually the cmd.Wait() will return after PID1 dies. We will then have to use the cgroup to freeze, send a sigkill to all remaining processes, unfreeze and wait for them to die.

@crosbymichael
Copy link
Copy Markdown
Contributor

Do you think it should be conditional that we nuke the cgroup or is it safe to do it every time?

Do you think that setting a parent death signal to SIGKILL work be enough if we using the host's PID namespace?

@rhatdan
Copy link
Copy Markdown
Contributor Author

rhatdan commented Jan 7, 2015

I found the place and updated the pull request, it now kills all processes if you are using the Host PID Namespace.

@rhatdan
Copy link
Copy Markdown
Contributor Author

rhatdan commented Jan 7, 2015

The patch I have is currently racy in that a ForkBomb could leave a process running.
I am not sure if the cgroup freeze will fix the problem, but I guess it is better then what I have now.

@crosbymichael
Copy link
Copy Markdown
Contributor

I'm updating docker with the latest changes from libcontainer with this change so we can move forward.

#10049

@rhatdan rhatdan force-pushed the pid branch 3 times, most recently from 4996709 to 091b271 Compare January 12, 2015 19:54
@rhatdan
Copy link
Copy Markdown
Contributor Author

rhatdan commented Jan 12, 2015

@crosbymichael updated

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this should just be HostPid to match Ipc above?

@crosbymichael
Copy link
Copy Markdown
Contributor

@icecrime can you review this one also?

@crosbymichael
Copy link
Copy Markdown
Contributor

This is looking good so far.

root@f9b5d1691ec2:/go/src/github.com/docker/docker# docker run --rm busybox ps aux
PID   USER     COMMAND
    1 root     ps aux
root@f9b5d1691ec2:/go/src/github.com/docker/docker# docker run --rm --pid host busybox ps aux
PID   USER     COMMAND
    1 root     bash
  496 root     docker -d -s vfs
  600 root     docker run --rm --pid host busybox ps aux
  615 root     ps aux
root@f9b5d1691ec2:/go/src/github.com/docker/docker#

@crosbymichael
Copy link
Copy Markdown
Contributor

LGTM

It would be nice to remove the NS suffix from the var names to make it more consistent with Ipc.

After being created on 2014-04-08 14:16:33, we can finally run crosbymichael/htop with this PR

@rhatdan rhatdan force-pushed the pid branch 2 times, most recently from 3212d9f to 9330d0a Compare January 12, 2015 22:01
@rhatdan
Copy link
Copy Markdown
Contributor Author

rhatdan commented Jan 12, 2015

sed s/pidns/pid done

@unclejack
Copy link
Copy Markdown
Contributor

LGTM for carry @crosbymichael
I don't know if the docs are OK from @fredlf's, @SvenDowideit's and @jamtur01's point of view.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wrap line to 80?

@jamtur01
Copy link
Copy Markdown
Contributor

Minor Docs comments but otherwise LGTM.

@rhatdan
Copy link
Copy Markdown
Contributor Author

rhatdan commented Jan 13, 2015

Fixed docs.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/allows processes ids/allows process ids/ ?

We want to be able to use container without the PID namespace.  We basically
want containers that can manage the host os, which I call Super Privileged
Containers.  We eventually would like to get to the point where the only
namespace we use is the MNT namespace to bring the Apps userspace with it.

By eliminating the PID namespace we can get better communication between the
host and the clients and potentially tools like strace and gdb become easier
to use.  We also see tools like libvirtd running within a container telling
systemd to place a VM in a particular cgroup, we need to have communications of the PID.

I don't see us needing to share PID namespaces between containers, since this
is really what docker exec does.

So currently I see us just needing docker run --pid=host

Docker-DCO-1.1-Signed-off-by: Dan Walsh <dwalsh@redhat.com> (github: rhatdan)
@crosbymichael
Copy link
Copy Markdown
Contributor

Closing in favor of #10080

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.