Race condition in exec() exists waiting for fifo from container process

Container failures are seen downstream in Kubernetes e2e runs when very short-lived containers are quickly created/completed. https://github.com/kubernetes/kubernetes/issues/86312

The following error is returned from the container runtime:
```
Terminated:&ContainerStateTerminated{ExitCode:128,Signal:0,Reason:ContainerCannotRun,Message:OCI runtime start failed: container process is already dead: unknown
```

Even though the container actually ran successfully, based on log output.

@random-liu [tracked this down](https://github.com/kubernetes/kubernetes/issues/86312#issuecomment-566722798) to a race condition in the code that checks if a process is dead/zombie while waiting to open the fifo successfully. A sequence of events that could cause this is:

1. process starts, opens fifo in write mode
2. awaitFifoOpen opens fifo in read mode
3. process completes
4. 100ms timeout in awaitProcessExit fires, stats pid, gets error, closes isDead channel
5. select block in exec() takes the awaitProcessExit branch and returns a "container process is already dead" error
6. goroutine in awaitFifoOpen propagates the open fifo to the fifoOpened channel

We only see this under fairly heavy load, in which normal expectations around speed and sequence of goroutine execution can be challenged.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Race condition in exec() exists waiting for fifo from container process #2183

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Race condition in exec() exists waiting for fifo from container process #2183

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions