-
Notifications
You must be signed in to change notification settings - Fork 2.3k
Closed
Description
Hi, we recently experienced an issue where many of our nodes were failing to create pods (CreateContainerConfigError).
We noticed that containerd had spawned many runc init processes which I gather is normal, except they never got to execve and instead were hanging on openat() on the execFifo pipe.
# lsof -p 129677 -w
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
runc:[2:I 129677 nobody cwd DIR 0,2119 4096 12563672 /app
runc:[2:I 129677 nobody rtd DIR 0,2119 4096 8206649 /
runc:[2:I 129677 nobody txt REG 259,1 11049264 9946 /
runc:[2:I 129677 nobody mem REG 259,1 2030928 2237 /lib/x86_64-linux-gnu/libc-2.27.so
runc:[2:I 129677 nobody mem REG 259,1 129312 2209 /lib/x86_64-linux-gnu/libseccomp.so.2.5.1
runc:[2:I 129677 nobody mem REG 259,1 144976 2263 /lib/x86_64-linux-gnu/libpthread-2.27.so
runc:[2:I 129677 nobody mem REG 259,1 179152 2233 /lib/x86_64-linux-gnu/ld-2.27.so
runc:[2:I 129677 nobody 0u CHR 1,3 0t0 7 /dev/null
runc:[2:I 129677 nobody 1w FIFO 0,13 0t0 2416908 pipe
runc:[2:I 129677 nobody 2w FIFO 0,13 0t0 2416909 pipe
runc:[2:I 129677 nobody 5u FIFO 0,25 0t0 4889 /run/containerd/runc/k8s.io/d0136625f29d1ab1b14c4180fe69816c90d2641100caaa19a1d515d05a78f408/exec.fifo
runc:[2:I 129677 nobody 7u a_inode 0,14 0 10761 [eventpoll]
runc:[2:I 129677 nobody 8r FIFO 0,13 0t0 2412940 pipe
runc:[2:I 129677 nobody 9w FIFO 0,13 0t0 2412940 pipe
# strace -p 129377
strace: Process 129677 attached
openat(AT_FDCWD, "/proc/self/fd/5", O_WRONLY|O_CLOEXEC
Logs are full of below, but no smoking guns.
Apr 03 00:44:52 $host containerd[77726]: {"error":"failed to set removing state for container \"e312392e9d198e3db585bfe80473765291d471f86017cb04c4a24c6475250852\": container is in starting state, can't be removed","level":"error","msg":"RemoveContainer for \"e312392e9d198e3db585bfe80473765291d471f86017cb04c4a24c6475250852\" failed","time":"2022-04-03T00:44:52.821862269Z"}
The issue seems very similar to #2828 minus that we are on 1.0.3. Also I am able to re strace the runc process and it doesn't cause it to exit after detaching.
uname: 5.4.0-1071-aws #76~18.04.1-Ubuntu
Any help would be greatly appreciated.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels