retry unix.EINTR for container init process#3045
retry unix.EINTR for container init process#3045kolyshkin merged 1 commit intoopencontainers:masterfrom
Conversation
8e4cf81 to
419ae14
Compare
cyphar
left a comment
There was a problem hiding this comment.
Please fix the lint warnings (most notably you aren't actually using maxEINTRRetryCount in your loop -- also please rename this to maxExecvRetries).
oh, bummer. will do. |
c01fc83 to
3ab53d2
Compare
|
@cyphar, done. please, take another look. |
|
@msscotb, @kevpar, @katiewasnothere, @dcantah, @ambarve FYI |
There was a problem hiding this comment.
By definition, EINTR is a temporary error, so no "max retry counter" is needed.
I would do something like this instead:
diff --git a/libcontainer/standard_init_linux.go b/libcontainer/standard_init_linux.go
index ce9117fa..7cfcc8ca 100644
--- a/libcontainer/standard_init_linux.go
+++ b/libcontainer/standard_init_linux.go
@@ -226,7 +226,7 @@ func (l *linuxStandardInit) Init() error {
return err
}
- if err := unix.Exec(name, l.config.Args[0:], os.Environ()); err != nil {
+ if err := system.Exec(name, l.config.Args[0:], os.Environ()); err != nil {
return newSystemErrorWithCause(err, "exec user process")
}
return nil
diff --git a/libcontainer/system/linux.go b/libcontainer/system/linux.go
index 4379a207..fa1b442c 100644
--- a/libcontainer/system/linux.go
+++ b/libcontainer/system/linux.go
@@ -35,7 +35,16 @@ func Execv(cmd string, args []string, env []string) error {
return err
}
- return unix.Exec(name, args, env)
+ return Exec(name, args, env)
+}
+
+func Exec(cmd string, args []string, env []string) error {
+ for {
+ err := unix.Exec(cmd, args, env)
+ if err != unix.EINTR { //nolint:errorlint // unix errors are bare
+ return err
+ }
+ }
}
func Prlimit(pid, resource int, limit unix.Rlimit) error {
thanks, sounds good to me. the reason I added the max retry counter is that during testing, sometimes we'd end up in an infinite loop, where syscall never succeeded. |
3ab53d2 to
9087833
Compare
This is interesting; can you provide more details? |
9087833 to
e7b9a28
Compare
nothing special here, I replaced the runc binary inside a VM, started a container, inside that container I mounted a cifs share (e.g. at |
It was possibly a process stuck in a syscall. Usually the kernel prints a backtrace if a process is stuck for more than 2 minutes. If you will be able to repro, I suggest waiting 2 minutes and checking the kernel log ( In any case, I don't think this is EINTR. |
e7b9a28 to
b2629a5
Compare
Thanks for suggestions, will try this next time. |
|
I did some background research. It looks like neither golang nor glibc wrap Although it's totally fine to wrap a call in a retry-on-eintr loop (as if it never happens it's a no-op), I have to ask @anmaxvl -- did you really see |
|
I did a quick research, and it looks like
The last two are sort of expected as EINTR is poorly documented in general. Now, it is perfectly fine to wrap a syscall into a retry-on-eintr loop. Yet, I will appreciate if you @anmaxvl can reproduce it and show us the logs (or show the logs that you saw earlier), to make sure it is definitely EINTR returned from execve(2). I believe the error (for |
Yes, that's what I was seeing. Here's an old exec command that I had saved. I just logged an error to stdout when the error was returned. In this case it was only 1 retry, but I've observed cases with multiple retries as well. |
|
@anmaxvl can you please also share the patch you made to show the error? |
When running a script from an azure file share interrupted syscall occurs quite frequently, to remedy this add retries around execve syscall, when EINTR is returned. Signed-off-by: Maksim An <maksiman@microsoft.com>
here's the patch anmaxvl@e07095d, which was applied on top of this PR |
When running a script from an azure file share interrupted syscall
occurs quite frequently, to remedy this add retries around execve
syscall, when EINTR is returned.
Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=1917662
1.0 backport: #3074