In centos 7.4/7.5/7.6, runc may return EINVAL when an other process read the runc status file. by zzzzzzzzzy9 · Pull Request #3705 · opencontainers/runc

zzzzzzzzzy9 · 2023-01-19T15:49:27Z

In centos 7.4/7.5/7.6, when other processes read the /proc/pid/status file of the runc: [1: CHILD] process, and if the runc: [1: CHILD] process happens to be in the unshare stage, the unshare syscall will report an error EINVAL。This is because when other processes read the status file, they will call the kernel function get_ task_ mm(), which sets the decision condition task->mm->mm_users +1. In the unshare syscall, mm does not meet the condition of mm<=1, and the kernel throws an EINVAL error. For this, I think we can try again.

We can use this python script to reproduce this problem.

import os
import time

def clean_fd(fd):
	for i in range(len(fd)):
		try:
			os.close(fd[i])
		except Exception as e:
			print("os.close err:", e)

while(True):
	docker_runc = 'ps -ef | grep "docker-runc create" | grep -v grep'
	ps_docker_runc = os.popen(docker_runc).read().strip("\n").split("\n")
	if ps_docker_runc[0] == '':
		continue
	print("ps docker-r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unc: ", ps_docker_runc)
	runc_pid = docker_runc_pid = ps_docker_runc[0].split()[1]
	command = 'ps -ef | grep ' + docker_runc_pid + ' | grep -v docker-runc | grep -v grep'
	ps_origin = os.popen(command).read().strip("\n")
	ps_origin = ps_origin.split("\n")
	if ps_origin[0] == '':
		continue
	fd = []
	for i in range(len(ps_origin)):
		start_time = time.time()
		ps = ps_origin[i].split()import os
import time

def clean_fd(fd):
	for i in range(len(fd)):
		try:
			os.close(fd[i])
		except Exception as e:
			print("os.close err:", e)

while(True):
	docker_runc = 'ps -ef | g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rep "docker-runc create" | grep -v grep'
	ps_docker_runc = os.popen(docker_runc).read().strip("\n").split("\n")
	if ps_docker_runc[0] == '':
		continue
	print("ps docker-runc: ", ps_docker_runc)
	runc_pid = docker_runc_pid = ps_docker_runc[0].split()[1]
	command = 'ps -ef | grep ' + docker_runc_pid + ' | grep -v docker-runc | grep -v grep'
	ps_origin = os.popen(command).read().strip("\n")
	ps_origin = ps_origin.split("\n")
	if ps_origin[0] == '':
		continue
	fd = []
	for i in range(len(ps_origin)):
		start_time = time.time()
		ps = ps_origin[i].split()
		print("ps is", ps_origin[i])
		pid = ps[1]
		ppid = ps[2]
		print(pid)
		file_name = "/proc/" + pid + "/status"
		try:
			fd.append(os.open(file_name, os.O_RDWR))
		except Exception as e:
			print("os.open err:", e)
			continue
	while(True):
		try:
			_ = os.read(fd[0], 1)
			os.lseek(fd[i], 0, 0)
			_ = os.read(fd[1], 1)
			os.lseek(fd[i], 0, 0)
		except Exception as e:
			print("os.read err:", e)
			clean_fd(fd)
			break
		try:
			os.stat(file_name)
		except Exception as e:
			break

		print("ps is", ps_origin[i])
		pid = ps[1]
		ppid = ps[2]
		print(pid)
		file_name = "/proc/" + pid + "/status"
		try:
			fd.append(os.open(file_name, os.O_RDWR))
		except Exception as e:
			print("os.open err:", e)
			continue
	while(True):
		try:
			_ = os.read(fd[0], 1)
			os.lseek(fd[i], 0, 0)
			_ = os.read(fd[1], 1)
			os.lseek(fd[i], 0, 0)
		except Exception as e:
			print("os.read err:", e)
			clean_fd(fd)
			break
		try:
			os.stat(file_name)
		except Exception as e:
			break

In addition, due to the slow running speed of python, it may be necessary to add a delay in the runc code like this:

                       time.sleep(1);
                       if (unshare(config.cloneflags & ~CLONE_NEWCGROUP) < 0)
                                bail("failed to unshare remaining namespaces (except cgroupns)");

kolyshkin · 2023-01-19T18:48:51Z

libcontainer/nsenter/nsexec.c

+			int i;
+			const int retry_times = 5;


You can just do

int retries = 5; ... for (; retries > 0; retries--) { ...

Done. Thank you for your suggestion.

zzzzzzzzzy9 · 2023-01-28T11:05:16Z

@AkihiroSuda

AkihiroSuda · 2023-02-01T12:41:47Z

libcontainer/nsenter/nsexec.c

 			write_log(DEBUG, "unshare remaining namespace (except cgroupns)");
-			if (unshare(config.cloneflags & ~CLONE_NEWCGROUP) < 0)
-				bail("failed to unshare remaining namespaces (except cgroupns)");
+			for (; retries > 0; retries--) {


Please add a comment line to explain why this retry loop is needed

Done. Thank you for your suggestion.

AkihiroSuda · 2023-02-01T12:42:14Z

libcontainer/nsenter/nsexec.c

+					bail("failed to unshare remaining namespaces (except cgroupns)");
+				if (retries == 1)
+					bail("failed to unshare remaining namespaces (except cgroupns), please retry");
+			}


Don't we need to add delay ?

I think we don't nead to add delay. If we add a delay, the recurrence of the bug will affect the runc running speed. So I think we just need to try again quickly.

kolyshkin · 2023-02-09T04:06:12Z

I think we need to backport it to release-1.1 once merged.

kolyshkin · 2023-02-09T23:05:09Z

libcontainer/nsenter/nsexec.c

+			 * kernel throws an EINVAL error. For this, I think we can try again.
+			 */
+			for (; retries > 0; retries--) {
+				if (unshare(config.cloneflags & ~CLONE_NEWCGROUP) >= 0)


Doesn't matter much, but this should be == 0 (the function returns 0 on success and -1 on error.

Done. Thanks for your suggestion. Why I use >=0 is the origin codes using if (unshare(config.cloneflags & ~CLONE_NEWCGROUP) < 0). The Rigorous usage should be==0.

man 2 unshare:
RETURN VALUE
On success, zero returned. On failure, -1 is returned and errno is set
to indicate the error.

kolyshkin

LGTM

kolyshkin · 2023-02-09T23:10:08Z

@zzyyzte can you please fix the subject of the commit? It should say something like

libct/nsenter: retry unshare on EINVAL

rata · 2023-02-10T18:21:08Z

libcontainer/nsenter/nsexec.c

+			/*
+			 * In centos 7.4/7.5/7.6, when other processes read the /proc/pid/status
+			 * file of the runc: [1: CHILD] process, and if the runc: [1: CHILD]
+			 * process happens to be in the unshare stage, the unshare syscall will
+			 * report an error EINVAL. This is because when other processes read
+			 * the status file, they will call the kernel function get_ task_ mm(),
+			 * which sets the decision condition task->mm->mm_users +1. In the
+			 * unshare syscall, mm does not meet the condition of mm<=1, and the
+			 * kernel throws an EINVAL error. For this, I think we can try again.
+			 */


Doesn't this affect the rest of the unshare() calls here too?

The retry mechanism only occurs when unshare fails, and the unshare successfully executed only once.

Exactly, but my question is why don't we need the retry mechanism on the other unshare() calls we do here. For example, just a few lines above this, still in the runc CHILD, we unshare the userns: https://github.com/opencontainers/runc/pull/3705/files#diff-6383238247e090d88ade6343c0ef59dd09b3c10634bf0584e78445b843c55ab0R1175

There are other unshare calls too. Don't we want to cover with this retry mechanism all of them?

@zzyyzte btw, IIRC there are more calls to unshare than these two

I would also write a wrapper for unshare that will retry and use it from all places that call unshare now. That should not create any measurable overhead I guess.

Thanks for suggestions. I create a new function to try unshare syscall.

zzzzzzzzzy9 · 2023-03-02T02:07:08Z

@zzyyzte can you please fix the subject of the commit? It should say something like
libct/nsenter: retry unshare on EINVAL

Done @kolyshkin

rata · 2023-03-03T10:08:59Z

libcontainer/nsenter/nsexec.c

-			if (unshare(config.cloneflags & ~CLONE_NEWCGROUP) < 0)
-				bail("failed to unshare remaining namespaces (except cgroupns)");
+			/*
+			 * In centos 7.4/7.5/7.6, when other processes read the /proc/pid/status


Is this something that only affect old centos or is it something that still happens on modern (non-centos) kernels? I guess the latter?

If it is the former, I'd clarify this is a workaround for those kernels, if it is the latter I'd not tie the comment so much to these specific centos versions

It's not just CentOS that has this bug, it's a kernel bug that was fixed in torvalds/linux@12c641a. The issue exists in older kernels, such as the v3.10.

Oh, okay, can you then make the comment not centos specific and mention that it was fixed there with kernel XX?

So we can drop it in the future if we want/need.

In centos 7.4/7.5/7.6, runc may return EINVAL when an other process read the runc status file. Signed-off-by: zzyyzte <zhang.yu58@zte.com.cn>

kolyshkin · 2023-03-16T17:48:14Z

I don't like the current patch either and it was easier for me to write it from scratch rather than to explain what needs to be fixed here. In the process I've also found the kernel commit that fixes things, and updated the comment accordingly.

Let's continue in #3772.

zzzzzzzzzy9 force-pushed the main branch from 6fb0d26 to 0b87fb9 Compare January 19, 2023 15:56

kolyshkin reviewed Jan 19, 2023

View reviewed changes

zzzzzzzzzy9 force-pushed the main branch from 0b87fb9 to bfc1d31 Compare January 20, 2023 16:48

AkihiroSuda reviewed Feb 1, 2023

View reviewed changes

zzzzzzzzzy9 force-pushed the main branch 2 times, most recently from dccd897 to de91c5e Compare February 5, 2023 14:28

kolyshkin added the backport/1.1-todo A PR in main branch which needs to be backported to release-1.1 label Feb 9, 2023

AkihiroSuda previously approved these changes Feb 9, 2023

View reviewed changes

kolyshkin force-pushed the main branch from de91c5e to 53069e9 Compare February 9, 2023 23:03

kolyshkin reviewed Feb 9, 2023

View reviewed changes

kolyshkin previously approved these changes Feb 9, 2023

View reviewed changes

kolyshkin added this to the 1.2.0 milestone Feb 9, 2023

kolyshkin dismissed AkihiroSuda’s stale review via 53069e9 February 10, 2023 02:03

rata reviewed Feb 10, 2023

View reviewed changes

zzzzzzzzzy9 dismissed kolyshkin’s stale review via fa21f04 February 12, 2023 15:12

zzzzzzzzzy9 force-pushed the main branch 2 times, most recently from fa21f04 to c63dc96 Compare February 13, 2023 14:10

zzzzzzzzzy9 force-pushed the main branch from c63dc96 to 2746aee Compare March 2, 2023 14:43

rata reviewed Mar 3, 2023

View reviewed changes

zzzzzzzzzy9 force-pushed the main branch from 2746aee to 3d80766 Compare March 16, 2023 06:34

libct/nsenter: retry unshare on EINVAL

9ee0e3b

In centos 7.4/7.5/7.6, runc may return EINVAL when an other process read the runc status file. Signed-off-by: zzyyzte <zhang.yu58@zte.com.cn>

zzzzzzzzzy9 force-pushed the main branch from 3d80766 to 9ee0e3b Compare March 16, 2023 06:40

kolyshkin mentioned this pull request Mar 16, 2023

nsexec: retry unshare on EINVAL #3772

Merged

kolyshkin closed this Mar 16, 2023

kolyshkin removed the backport/1.1-todo A PR in main branch which needs to be backported to release-1.1 label Apr 3, 2023

Conversation

zzzzzzzzzy9 commented Jan 19, 2023

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zzzzzzzzzy9 commented Jan 28, 2023

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kolyshkin commented Feb 9, 2023

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kolyshkin left a comment

Choose a reason for hiding this comment

Uh oh!

kolyshkin commented Feb 9, 2023

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zzzzzzzzzy9 commented Mar 2, 2023

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kolyshkin commented Mar 16, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants