Shims related to the same container should be spawned into the same PID namespace

I was playing with docker recently, starting a container from a first shell:
```bash
docker run --runtime=runc --rm -it busybox
```
and exec'ing an extra process with the following command:
```bash
docker exec -it $CONTAINER_ID sh
```
when I realized that sometimes (I would say 50% of the time), the exec'ed process was not returning after an `exit` from the container process, and the shell was staying completely stuck until I restart my docker service.

I have investigated by looking at the difference between the logs from a working run vs a not working run, and I have found that when the shim corresponding to the container process was exiting before the shim related to the exec'ed process (basically a child process), we were having the wrong behavior described above.

The rationale behind the issue seems logical to me, because docker expects those processes to run inside a PID namespace, it does expect that the container process should be the last one to exit. Indeed, the container process being the "init" process of the PID namespace, when it terminates for any reason, the kernel SIGKILL all the other processes inside the PID namespace, and the "init" process returns.

Now, why are we getting issues in our case. The thing is that when our container process inside the VM terminates, the guest kernel will kill the processes inside the same PID namespace and they will all return `137` as exit code. But we are still inside the VM here, and we have to pass this information back to the shim (through a proxy in some cases), which means we have no control which shim (corresponding to every process) will exit first.

I have confirmed those investigations by getting this test working all the time by adding a 5s `sleep()` [here](https://github.com/clearcontainers/agent/blob/master/agent.go#L1127) in case the process was the container process. This way, I have forced the container process to wait 5s before to return its exit code back to the shim, leaving some times for other shims to return before the container process.

My suggestion for this issue is to solve this in 2 steps:
- The agent has to ensure every child process started with exec should be waited by the container process so that no process is left behind after the container process returns its exit code.
- Virtcontainers should start every shim related to the same container into the same PID namespace. This way, when the container process terminates, the host kernel can kill our shim processes related to exec'ed processes, all returning `137` exit code, or we can get the same exit code from the agent (in case the shim received the exit code before to be killed).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Shims related to the same container should be spawned into the same PID namespace #613

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Shims related to the same container should be spawned into the same PID namespace #613

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions