Skip to content

Conversation

@thaJeztah
Copy link
Member

@thaJeztah thaJeztah commented Oct 10, 2023


After pr #8617, create handler of containerd-shim-runc-v2 will
call handleStarted() to record the init process and handle its exit.
Although init process wouldn't quit so early in normal circumstances.
But if this screnario occurs, handleStarted() will call
handleProcessExit(), which will cause deadlock because create() had
acquired s.mu, and handleProcessExit() will try to lock it again.

So I add handleProcessExitNoLock() function which will not lock s.mu.
It can safely be called in create handler without deadlock.

@thaJeztah
Copy link
Member Author

thaJeztah commented Oct 10, 2023

hm.. interesting; vendor check is failing;

+ verify-vendor
go: downloading github.com/Microsoft/hcsshim/test v0.0.0-20210408205431-da33ecd607e1
Files /home/runner/work/containerd/containerd/src/github.com/containerd/containerd/integration/client/go.mod and /tmp/tmp.H87C9Zd1Ht/containerd/integration/client/go.mod differ
Files /home/runner/work/containerd/containerd/src/github.com/containerd/containerd/integration/client/go.sum and /tmp/tmp.H87C9Zd1Ht/containerd/integration/client/go.sum differ
make: *** [Makefile:458: verify-vendor] Error 1
Error: Process completed with exit code 2.

edit: looks like it's due to "testify" now being used;

git diff
diff --git a/integration/client/go.mod b/integration/client/go.mod
index 9c5a55c34..d15b6a311 100644
--- a/integration/client/go.mod
+++ b/integration/client/go.mod
@@ -17,6 +17,7 @@ require (
        github.com/opencontainers/image-spec v1.1.0-rc2.0.20221005185240-3a7f492d3f1b
        github.com/opencontainers/runtime-spec v1.0.3-0.20210326190908-1c3f411f0417
        github.com/sirupsen/logrus v1.9.3
+       github.com/stretchr/testify v1.8.4
        golang.org/x/sys v0.7.0
        gotest.tools/v3 v3.5.0
 )
@@ -28,6 +29,7 @@ require (
        github.com/containerd/fifo v1.0.0 // indirect
        github.com/containerd/log v0.1.0 // indirect
        github.com/coreos/go-systemd/v22 v22.3.2 // indirect
+       github.com/davecgh/go-spew v1.1.1 // indirect
        github.com/docker/go-events v0.0.0-20190806004212-e31b211e4f1c // indirect
        github.com/docker/go-units v0.4.0 // indirect
        github.com/godbus/dbus/v5 v5.0.6 // indirect
@@ -45,6 +47,7 @@ require (
        github.com/opencontainers/selinux v1.10.1 // indirect
        github.com/pelletier/go-toml v1.9.5 // indirect
        github.com/pkg/errors v0.9.1 // indirect
+       github.com/pmezard/go-difflib v1.0.0 // indirect
        go.opencensus.io v0.23.0 // indirect
        golang.org/x/net v0.8.0 // indirect
        golang.org/x/sync v0.1.0 // indirect
@@ -52,6 +55,7 @@ require (
        google.golang.org/genproto v0.0.0-20220502173005-c8bf987b8c21 // indirect
        google.golang.org/grpc v1.50.1 // indirect
        google.golang.org/protobuf v1.28.1 // indirect
+       gopkg.in/yaml.v3 v3.0.1 // indirect
 )

 replace (
diff --git a/integration/client/go.sum b/integration/client/go.sum
index 683f1380c..c2cfc9407 100644
--- a/integration/client/go.sum
+++ b/integration/client/go.sum
@@ -1040,6 +1040,7 @@ gopkg.in/check.v1 v0.0.0-20161208181325-20d25e280405/go.mod h1:Co6ibVJAznAaIkqp8
 gopkg.in/check.v1 v1.0.0-20180628173108-788fd7840127/go.mod h1:Co6ibVJAznAaIkqp8huTwlJQCZ016jof/cbN4VW5Yz0=
 gopkg.in/check.v1 v1.0.0-20190902080502-41f04d3bba15/go.mod h1:Co6ibVJAznAaIkqp8huTwlJQCZ016jof/cbN4VW5Yz0=
 gopkg.in/check.v1 v1.0.0-20200227125254-8fa46927fb4f/go.mod h1:Co6ibVJAznAaIkqp8huTwlJQCZ016jof/cbN4VW5Yz0=
+gopkg.in/check.v1 v1.0.0-20201130134442-10cb98267c6c h1:Hei/4ADfdWqJk1ZMxUNpqntNwaWcugrBjAiHlqqRiVk=
 gopkg.in/check.v1 v1.0.0-20201130134442-10cb98267c6c/go.mod h1:JHkPIbrfpd72SG/EVd6muEfDQjcINNoR0C8j2r3qZ4Q=
 gopkg.in/errgo.v2 v2.1.0/go.mod h1:hNsd1EY+bozCKY1Ytp96fpM3vjJbqLJn88ws8XvfDNI=
 gopkg.in/fsnotify.v1 v1.4.7/go.mod h1:Tz8NjZHkW78fSQdbUxIjBTcgA1z1m8ZHf0WmKUhAMys=

@thaJeztah thaJeztah force-pushed the 1.6_backport_fix_deadlock branch from eb928c3 to 069aab2 Compare October 10, 2023 08:19
@thaJeztah
Copy link
Member Author

@thaJeztah thaJeztah force-pushed the 1.6_backport_fix_deadlock branch from 069aab2 to dfac9ff Compare October 10, 2023 10:13
Comment on lines -104 to +103
err = os.WriteFile(filepath.Join(b.Path, configFilename), spec, 0666)
err = os.WriteFile(filepath.Join(b.Path, oci.ConfigFilename), spec, 0666)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a reduced set of the changes from

if spec := spec.GetValue(); spec != nil {
// write the spec to the bundle
specPath := filepath.Join(b.Path, oci.ConfigFilename)
err = os.WriteFile(specPath, spec, 0666)
if err != nil {
return nil, fmt.Errorf("failed to write bundle spec: %w", err)
}
}
return b, nil
(code diverged on main), only taking the oci.ConfigFilename from that commit.

@thaJeztah thaJeztah force-pushed the 1.6_backport_fix_deadlock branch from dfac9ff to ab3735e Compare October 10, 2023 14:54
@estesp
Copy link
Member

estesp commented Oct 10, 2023

The install failpoint binaries step was never scripted into (via backport) the Vagrantfile in release/1.6, so the vagrant-based tests are failing unable to find the new runc-fp binary.

@fuweid
Copy link
Member

fuweid commented Oct 11, 2023

I think we can add one line

${GOPATH}/src/github.com/containerd/containerd/script/setup/install-failpoint-binaries

at

${GOPATH}/src/github.com/containerd/containerd/script/setup/install-runc

@akhilerm
Copy link
Member

The install failpoint binaries step was never scripted into (via backport) the Vagrantfile in release/1.6, so the vagrant-based tests are failing unable to find the new runc-fp binary.

The vagrant tests on runc/fedora-37 is failing in the release/1.6 branch also. But think that it is an unrelated error. Ref: https://github.com/containerd/containerd/actions/workflows/ci.yml?query=branch%3Arelease%2F1.6

thaJeztah and others added 5 commits October 11, 2023 13:20
Reduced patch, based on 419b5ab for
the 1.6 branch.

Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
Signed-off-by: Marat Radchenko <marat@slonopotamus.org>
(cherry picked from commit 9e34b8b)
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
Previous code has already called `getContainer()`, just pass it into
`s.getContainerPids` to reduce unnecessary lock and map lookup.

Signed-off-by: Chen Yiyang <cyyzero@qq.com>
(cherry picked from commit 6604ff6)
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
After pr containerd#8617, create handler of containerd-shim-runc-v2 will
call handleStarted() to record the init process and handle its exit.
Init process wouldn't quit so early in normal circumstances. But if
this screnario occurs, handleStarted() will call
handleProcessExit(), which will cause deadlock because create() had
acquired s.mu, and handleProcessExit() will try to lock it again.

So, I added a parameter muLocked to handleStarted to indicate whether
or not s.mu is currently locked, and thus deciding whether or not to
lock it when calling handleProcessExit.

Fix: containerd#9103
Signed-off-by: Chen Yiyang <cyyzero@qq.com>
(cherry picked from commit 68dd47e)
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
Signed-off-by: Wei Fu <fuweid89@gmail.com>
(cherry picked from commit 11a7751)
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
@thaJeztah thaJeztah force-pushed the 1.6_backport_fix_deadlock branch from ab3735e to 7b61862 Compare October 11, 2023 11:21
@thaJeztah
Copy link
Member Author

I applied a reduced set of 419b5ab (as a separate commit); let's see if CI likes it.

Copy link
Member

@fuweid fuweid left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM on green

@thaJeztah
Copy link
Member Author

Failure looks to be in one of the new tests that was added 😢

=== FAIL: . TestIssue9103/should_be_stopped_status_if_init_has_been_killed (3.10s)
    container_linux_test.go:1575: 
        	Error Trace:	/home/runner/work/containerd/containerd/integration/client/container_linux_test.go:1575
        	Error:      	Not equal: 
        	            	expected: "created"
        	            	actual  : "stopped"
        	            	
        	            	Diff:
        	            	--- Expected
        	            	+++ Actual
        	            	@@ -1,2 +1,2 @@
        	            	-(containerd.ProcessStatus) (len=7) "created"
        	            	+(containerd.ProcessStatus) (len=7) "stopped"
        	            	 
        	Test:       	TestIssue9103/should_be_stopped_status_if_init_has_been_killed
    --- FAIL: TestIssue9103/should_be_stopped_status_if_init_has_been_killed (3.10s)

wondering if it's flaky; let me try restarting it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants