-
Notifications
You must be signed in to change notification settings - Fork 2.3k
Closed
Description
Container failures are seen downstream in Kubernetes e2e runs when very short-lived containers are quickly created/completed. kubernetes/kubernetes#86312
The following error is returned from the container runtime:
Terminated:&ContainerStateTerminated{ExitCode:128,Signal:0,Reason:ContainerCannotRun,Message:OCI runtime start failed: container process is already dead: unknown
Even though the container actually ran successfully, based on log output.
@Random-Liu tracked this down to a race condition in the code that checks if a process is dead/zombie while waiting to open the fifo successfully. A sequence of events that could cause this is:
- process starts, opens fifo in write mode
- awaitFifoOpen opens fifo in read mode
- process completes
- 100ms timeout in awaitProcessExit fires, stats pid, gets error, closes isDead channel
- select block in exec() takes the awaitProcessExit branch and returns a "container process is already dead" error
- goroutine in awaitFifoOpen propagates the open fifo to the fifoOpened channel
We only see this under fairly heavy load, in which normal expectations around speed and sequence of goroutine execution can be challenged.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels