Skip to content

Race condition in exec() exists waiting for fifo from container process #2183

@liggitt

Description

@liggitt

Container failures are seen downstream in Kubernetes e2e runs when very short-lived containers are quickly created/completed. kubernetes/kubernetes#86312

The following error is returned from the container runtime:

Terminated:&ContainerStateTerminated{ExitCode:128,Signal:0,Reason:ContainerCannotRun,Message:OCI runtime start failed: container process is already dead: unknown

Even though the container actually ran successfully, based on log output.

@Random-Liu tracked this down to a race condition in the code that checks if a process is dead/zombie while waiting to open the fifo successfully. A sequence of events that could cause this is:

  1. process starts, opens fifo in write mode
  2. awaitFifoOpen opens fifo in read mode
  3. process completes
  4. 100ms timeout in awaitProcessExit fires, stats pid, gets error, closes isDead channel
  5. select block in exec() takes the awaitProcessExit branch and returns a "container process is already dead" error
  6. goroutine in awaitFifoOpen propagates the open fifo to the fifoOpened channel

We only see this under fairly heavy load, in which normal expectations around speed and sequence of goroutine execution can be challenged.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions