Skip to content

runc init hanging on openat() #3448

@alam0rt

Description

@alam0rt

Hi, we recently experienced an issue where many of our nodes were failing to create pods (CreateContainerConfigError).

We noticed that containerd had spawned many runc init processes which I gather is normal, except they never got to execve and instead were hanging on openat() on the execFifo pipe.

# lsof -p 129677 -w
COMMAND      PID   USER   FD      TYPE DEVICE SIZE/OFF     NODE NAME
runc:[2:I 129677 nobody  cwd       DIR 0,2119     4096 12563672 /app
runc:[2:I 129677 nobody  rtd       DIR 0,2119     4096  8206649 /
runc:[2:I 129677 nobody  txt       REG  259,1 11049264     9946 /
runc:[2:I 129677 nobody  mem       REG  259,1  2030928     2237 /lib/x86_64-linux-gnu/libc-2.27.so
runc:[2:I 129677 nobody  mem       REG  259,1   129312     2209 /lib/x86_64-linux-gnu/libseccomp.so.2.5.1
runc:[2:I 129677 nobody  mem       REG  259,1   144976     2263 /lib/x86_64-linux-gnu/libpthread-2.27.so
runc:[2:I 129677 nobody  mem       REG  259,1   179152     2233 /lib/x86_64-linux-gnu/ld-2.27.so
runc:[2:I 129677 nobody    0u      CHR    1,3      0t0        7 /dev/null
runc:[2:I 129677 nobody    1w     FIFO   0,13      0t0  2416908 pipe
runc:[2:I 129677 nobody    2w     FIFO   0,13      0t0  2416909 pipe
runc:[2:I 129677 nobody    5u     FIFO   0,25      0t0     4889 /run/containerd/runc/k8s.io/d0136625f29d1ab1b14c4180fe69816c90d2641100caaa19a1d515d05a78f408/exec.fifo
runc:[2:I 129677 nobody    7u  a_inode   0,14        0    10761 [eventpoll]
runc:[2:I 129677 nobody    8r     FIFO   0,13      0t0  2412940 pipe
runc:[2:I 129677 nobody    9w     FIFO   0,13      0t0  2412940 pipe

# strace -p 129377
strace: Process 129677 attached
openat(AT_FDCWD, "/proc/self/fd/5", O_WRONLY|O_CLOEXEC

Logs are full of below, but no smoking guns.

Apr 03 00:44:52 $host containerd[77726]: {"error":"failed to set removing state for container \"e312392e9d198e3db585bfe80473765291d471f86017cb04c4a24c6475250852\": container is in starting state, can't be removed","level":"error","msg":"RemoveContainer for \"e312392e9d198e3db585bfe80473765291d471f86017cb04c4a24c6475250852\" failed","time":"2022-04-03T00:44:52.821862269Z"}

The issue seems very similar to #2828 minus that we are on 1.0.3. Also I am able to re strace the runc process and it doesn't cause it to exit after detaching.

uname: 5.4.0-1071-aws #76~18.04.1-Ubuntu

Any help would be greatly appreciated.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions