Make killing shims more resilient by ashrayjain · Pull Request #4204 · containerd/containerd

ashrayjain · 2020-04-22T13:09:07Z

Currently, we send a single SIGKILL to the shim process
once and then we spin in a loop where we use kill(pid, 0)
to detect when the pid has disappeared completely.

Unfortunately, this has a race condition since pids can be reused causing us
to spin in an infinite loop when that happens.

This adds a timeout to this loop which logs a warning and exits the infinite loop.

This fixes containerd/cri#1427

ashrayjain · 2020-04-22T13:17:28Z

Attaching some more relevant information that we found when debugging this issue

$ pidof containerd
1567
$ strace -fp 1567 -e trace=kill
strace: Process 1567 attached with 301 threads
[pid 13252] kill(13737, SIG_0)          = 0
[pid 11864] kill(13737, SIG_0)          = 0
[pid  6042] kill(13737, SIG_0)          = 0
[pid 11864] kill(13737, SIG_0)          = 0
[pid  6042] kill(13737, SIG_0)          = 0
[pid  7318] kill(13737, SIG_0)          = 0
[pid  8004] kill(13737, SIG_0)          = 0
[pid 11864] kill(13737, SIG_0)          = 0
[pid  8004] kill(13737, SIG_0)          = 0
[pid 20957] kill(13737, SIG_0)          = 0
[pid 11864] kill(13737, SIG_0)          = 0
[pid 11864] kill(13737, SIG_0)          = 0
[pid  6042] kill(13737, SIG_0)          = 0
[pid  6042] kill(13737, SIG_0)          = 0
...

and

$ ps aux | grep 13737
root     13737  0.0  0.0 110356  6576 ?        Sl   07:18   0:00 containerd-shim -namespace k8s.io -workdir /var/lib/container-runtime/containerd/io.containerd.runtime.v1.linux/k8s.io/051b2f90455d240b29f3130682e851c0e7fd6bc0e64121b5e77bcfb516c49b18 -address /run/containerd/containerd.sock -containerd-binary /usr/local/bin/containerd

ashrayjain · 2020-04-22T13:18:33Z

Containerd logs are full of
ERROR [2020-04-22T12:43:34.234344073Z] containerd.io/containerd/containerd: "StopPodSandbox for \"d50318c73c890f6a829fab36a980332b87b8133ff5f5b6c72f8693c3e1140d49\" failed" error="rpc error: code = Canceled desc = failed to stop container \"4f1dd102b497be8256c290e4da16128dd36c340575a8d0c36edd2ccae51e2f05\": an error occurs during waiting for container \"4f1dd102b497be8256c290e4da16128dd36c340575a8d0c36edd2ccae51e2f05\" to be killed: wait container \"4f1dd102b497be8256c290e4da16128dd36c340575a8d0c36edd2ccae51e2f05\": context canceled"

ashrayjain · 2020-04-22T13:19:04Z

I uploaded the stack trace dump from a containerd in this state here: https://gist.github.com/ashrayjain/f1bac2cc5bec2af5445268a3e1bc7fef

Zyqsempai

Please sign your commit, otherwise, CI will fail.

ashrayjain · 2020-04-22T17:17:30Z

Hey @Zyqsempai , I think I signed my commit already. Is there anything else needed to satisfy CI?

estesp · 2020-04-22T18:53:44Z

You will need to rebase on master to pass CI once we merge #4206. Sorry--upstream change in golangci-lint broke our dev-tool-install script.

ashrayjain · 2020-04-22T19:50:19Z

@estesp no problem, done!

theopenlab-ci · 2020-04-26T16:33:02Z

Build succeeded.

containerd-build-arm64 : SUCCESS in 4m 36s (non-voting)

codecov-io · 2020-04-26T16:42:00Z

Codecov Report

Merging #4204 into master will not change coverage.
The diff coverage is n/a.

@@           Coverage Diff           @@
##           master    #4204   +/-   ##
=======================================
  Coverage   38.34%   38.34%           
=======================================
  Files          90       90           
  Lines       12728    12728           
=======================================
  Hits         4881     4881           
  Misses       7181     7181           
  Partials      666      666

Flag	Coverage Δ
#windows	`38.34% <ø> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e094d36...c3d0845. Read the comment docs.

BenTheElder · 2020-04-27T21:15:05Z

Is this problem also relevant to the v2 shim?

estesp · 2020-04-27T21:57:52Z

I spent some time trying to look into that late last week; my understanding is that the v2/ code doesn't have this same flow of waiting for a complete exit.

My problem with the PR is that it effectively is changing the semantics of "waitForExit" to "keepTryingToKillForAwhile" :) I'm not sure if @crosbymichael has had a chance to look at this as he is more intimately familiar with this code.

ashrayjain · 2020-04-29T11:03:32Z

@estesp to provide some more info here. We ran the killsnoop bpf tool (https://github.com/iovisor/bcc/blob/master/tools/killsnoop.py) to inspect and track all the kill signals being sent by containerd and the responses they were getting from the kenel.

Here is the output from an occurrence of this issue.

$ journalctl -u killsnoop | grep 10516
killsnoop.sh[922]: TIME      PID    COMM             SIG  TPID   RESULT
killsnoop.sh[922]: 08:39:02  24136  containerd       9    10516  0
killsnoop.sh[922]: 08:39:02  24136  containerd       0    10516  0
killsnoop.sh[922]: 08:39:02  24136  containerd       0    10516  0
killsnoop.sh[922]: 08:39:02  24357  containerd       0    10516  0
killsnoop.sh[922]: 08:39:02  419    containerd       0    10516  0
killsnoop.sh[922]: 08:39:02  30071  containerd       0    10516  0

$ strace -fp $(pidof containerd) -e trace=kill
strace: Process 1595 attached with 224 threads
[pid   699] kill(10516, SIG_0)          = 0
[pid   699] kill(10516, SIG_0)          = 0
[pid   699] kill(10516, SIG_0)          = 0
[pid  4991] kill(10516, SIG_0)          = 0
[pid 26715] kill(10516, SIG_0)          = 0
[pid 14796] kill(10516, SIG_0)          = 0
[pid 14796] kill(10516, SIG_0)          = 0
[pid   699] kill(10516, SIG_0)          = 0
[pid   699] kill(10516, SIG_0)          = 0
[pid   699] kill(10516, SIG_0)          = 0
[pid 26715] kill(10516, SIG_0)          = 0
[pid 26715] kill(10516, SIG_0)          = 0
[pid 14796] kill(10516, SIG_0)          = 0
[pid 14796] kill(10516, SIG_0)          = 0
[pid 26715] kill(10516, SIG_0)          = 0

As you can see, containerd sent a kill -9 to the pid 10516 and got back 0 as the result, however this process was not killed and kept running until manual action was taken. :(

Would this PR be acceptable if we made it clear that the method was retrying the kill as opposed to just waiting for exit?

fuweid · 2020-04-29T15:43:21Z

runtime/v1/shim/client/client.go

If kill doesn't work at this moment, could the caller retry kill shim again? It is reasonable to do retry if failed at first time.

For the mentioned case, if kill -9 fails at the first time, can you kill it by manually? And could it be possible reused pid here (same pid but it is for different container)?

could you mind to check wchan result from ps axo pid,cmd,wchan if the shim is still hanging there?

@fuweid yea, we have seen retrying the kill succeed when the first kill fails.

The manual kill succeeds.

31745 containerd-shim -namespace futex_wait_queue_me

is the output for one such shim.

Thanks for the information. And the shim is the same shim? I mean the pid can be reused.

yes i confirmed the shim is the same based on the container id it was associated with.

ashrayjain · 2020-05-05T15:00:03Z

@estesp @BenTheElder @crosbymichael @fuweid is there anything else i can do to help push this forward?

estesp · 2020-05-06T18:09:17Z

I spent a few minutes looking at the call path (services/tasks/local.go Delete -> runtime/v1 tasks Delete calls into the shim.KillShim which leads to signalShim with SIGKILL) trying to understand if there is a reasonable place to capture a failed kill and insert some form of retry. If the ctx in that flow had a timeout then that would be another route, but apparently it doesn't by default, hence the original problem of spinning forever on the unix.Kill with signal "zero".

It does turn out that this code (signalShim) is only called with SIGKILL as the StopShim function is never used from any caller in containerd at the moment. Otherwise, I assume the same problem might be observed with StopShim using SIGTERM; an infinite wait if the process doesn't respond/exit on SIGTERM.

Seems maybe the right path is to make signalShim a time-out based kill function (since all it is being used is for killing pids) that retries itself until success or exits with error after a timeout or a specified # of retries? Thoughts @fuweid?

fuweid · 2020-05-07T03:18:55Z

I am not sure why kill -9 doesn't work and still want to find the root cause. :)

Back to the issue, for shim v1, the containerd doesn't retry kill -9.
In CRI plugin, the task.Delete is called by CRI plugin event handler when the container init task has been exited. And task.Delete will call t.shim.KillShim(ctx) and we can see containerd send kill -9 signal.

If the task.Delete fails, the container status will not be updated to exited and CRI plugin will try it. Checkout the following code, if the t.shim.Delete return errors or c.ShimInfo return errors, the kill -9 will not be called.

// Delete the task and return the exit status
func (t *Task) Delete(ctx context.Context) (*runtime.Exit, error) {
        rsp, shimErr := t.shim.Delete(ctx, empty)
        if shimErr != nil {
                shimErr = errdefs.FromGRPC(shimErr)
                if !errdefs.IsNotFound(shimErr) {
                        return nil, shimErr
                }
        }
        t.tasks.Delete(ctx, t.id)
        if err := t.shim.KillShim(ctx); err != nil {
                log.G(ctx).WithError(err).Error("failed to kill shim")
        }
        if err := t.bundle.Delete(); err != nil {
                log.G(ctx).WithError(err).Error("failed to delete bundle")
        }
        if shimErr != nil {
                return nil, shimErr
        }
        t.events.Publish(ctx, runtime.TaskDeleteEventTopic, &eventstypes.TaskDelete{
                ContainerID: t.id,
                ExitStatus:  rsp.ExitStatus,
                ExitedAt:    rsp.ExitedAt,
                Pid:         rsp.Pid,
        })
        return &runtime.Exit{
                Status:    rsp.ExitStatus,
                Timestamp: rsp.ExitedAt,
                Pid:       rsp.Pid,
        }, nil
}

func (c *Client) signalShim(ctx context.Context, sig syscall.Signal) error {
        info, err := c.ShimInfo(ctx, empty)
        if err != nil {
                return err
        }
        pid := int(info.ShimPid)
        // make sure we don't kill ourselves if we are running a local shim
        if os.Getpid() == pid {
                return nil
        }
        if err := unix.Kill(pid, sig); err != nil && err != unix.ESRCH {
                return err
        }
        // wait for shim to die after being signaled
        select {
        case <-ctx.Done():
                return ctx.Err()
        case <-c.waitForExit(pid):
                return nil
        }
}

So I think we should check the event handler in CRI plugin and what error we meet.

Maybe related to #4198 because I check the log from https://gist.github.com/ashrayjain/f1bac2cc5bec2af5445268a3e1bc7fef and found

# seems that client ttrpc is hanging on closing but not sure the client ttrpc longs to which shim

# [semacquire, 139 minutes] can be a clue

goroutine 3324776 [semacquire, 139 minutes]:
sync.runtime_Semacquire(0xc00185b618)
	/home/travis/.gimme/versions/go1.12.16.linux.amd64/src/runtime/sema.go:56 +0x3b
sync.(*WaitGroup).Wait(0xc00185b610)
	/home/travis/.gimme/versions/go1.12.16.linux.amd64/src/sync/waitgroup.go:130 +0x67
github.com/containerd/containerd/vendor/github.com/containerd/ttrpc.(*Client).run.func2(0xc00185b610, 0xc002015b00)
	/home/travis/gopath/src/github.com/containerd/containerd/vendor/github.com/containerd/ttrpc/client.go:246 +0x2d
created by github.com/containerd/containerd/vendor/github.com/containerd/ttrpc.(*Client).run
	/home/travis/gopath/src/github.com/containerd/containerd/vendor/github.com/containerd/ttrpc/client.go:245 +0x1a7

@ashrayjain would you mind to provide more information about CRI plugin event log? like failed to handle container TaskExit event something? Thanks

cc @estesp

fuweid · 2020-05-09T08:16:59Z

ping @ashrayjain

ashrayjain · 2020-05-11T19:54:49Z

Hey @fuweid, apologies for the delay.

I'm waiting for another repro of this issue on our cluster to provide you with some more info on this.

ashrayjain · 2020-05-23T15:27:30Z

@fuweid @estesp So we just had another repro of this issue.

Here is what we found:
Containerd was stuck trying to kill pid 9750

$ strace -fp $(pidof containerd) -e trace=kill
strace: Process 1603 attached with 131 threads
[pid  6034] kill(9750, SIG_0)           = 0
[pid  6034] kill(9750, SIG_0)           = 0
[pid  6034] kill(9750, SIG_0)           = 0
[pid 28558] kill(9750, SIG_0)           = 0
[pid 28558] kill(9750, SIG_0)           = 0
[pid 30413] kill(9750, SIG_0)           = 0
[pid 30413] kill(9750, SIG_0)           = 0
[pid 30413] kill(9750, SIG_0)           = 0
[pid 30413] kill(9750, SIG_0)           = 0
[pid 30413] kill(9750, SIG_0)           = 0

Based on the killsnoop tool, we know that containerd sent a kill -9 to this pid and got back a 0 exit code

11:21:14  12062  containerd       9    9750   0      R (running)
11:21:14  12062  containerd       0    9750   0      R (running)
11:21:14  5450   containerd       0    9750   0      S (sleeping)
11:21:14  28560  containerd       0    9750   0     S (sleeping)
11:21:14  28560  containerd       0    9750   0     S (sleeping)

On trying to look at the containerd logs, we saw

"shim containerd-shim started" address=/containerd-shim/21e89a11376d496299190e55578927b75988ab9664374ee924299553fdbfe45d.sock debug=false pid=9750

Based on how the code generates 21e89a11376d496299190e55578927b75988ab9664374ee924299553fdbfe45d, we were able to confirm that this was for sandbox id 3c6ced659501c69a1b0342c46cbb2c47c9dde045cf92f22fe82e96a4a1e22068

func (b *bundle) shimAddress(namespace string) string {
	d := sha256.Sum256([]byte(filepath.Join(namespace, b.id)))
	return filepath.Join(string(filepath.Separator), "containerd-shim", fmt.Sprintf("%x.sock", d))
}

and k8s.io/3c6ced659501c69a1b0342c46cbb2c47c9dde045cf92f22fe82e96a4a1e22068 hashes to 21e89a11376d496299190e55578927b75988ab9664374ee924299553fdbfe45d

For this pod, if we run crictl inspectp, we get the following redacted info

{
  "status": {
    "id": "3c6ced659501c69a1b0342c46cbb2c47c9dde045cf92f22fe82e96a4a1e22068",
...
    "state": "SANDBOX_READY",
  },
  "info": {
    "pid": 9768,
    "processStatus": "deleted",
...
}

There was no process on this host with pid 9768 at this time.

Additionally, for the sandbox in question, we also have

11:21:14.276625531Z] containerd.io/containerd/containerd: "shim reaped" id=3c6ced659501c69a1b0342c46cbb2c47c9dde045cf92f22fe82e96a4a1e22068

which is at the exact time we observe the first kill -9 in killsnoop above.

Trying to find the "current" pid 9750 on the host, we find

$ pstree -pa -G -s 9750
systemd,1
  └─containerd,1603
      └─containerd-shim,9371 -namespace k8s.io -workdir ...
          └─tini,9438 -s -- ...
              └─java,9661 ...
                  └─{java},9750

So this is actually a thread being used by an unrelated container.
From this container's inspect output, we have

    "createdAt": "2020-05-23T11:21:13.699473306Z",
    "startedAt": "2020-05-23T11:21:14.141006424Z",

So in summary, what appears to have happened here is that containerd tried to kill the shim with pid 9768/9750 for sandbox 3c6ced659501c69a1b0342c46cbb2c47c9dde045cf92f22fe82e96a4a1e22068 (where one of those pids was probably a thread?) and succeeding in doing so, however at roughly the same time, a different container started (the java container above) and one of the threads in that container got the same pid (9750). This meant that containerd got stuck in its kill -0 loop since from the kernel's perspective, that pid exists, even though containerd just deleted it.

@fuweid it seems when you first suggested that the pid might be getting reused, you were on to something ;)

Do you folks have ideas on how to handle this situation better?

tedyu · 2020-05-23T16:08:12Z

Was the

    "startedAt": "2020-05-23T11:21:14.141006424Z",

part of the output of 'crictl inspectp' ?
(I ran the same command on a node with cri-o container but didn't see this field).

I wonder if the start time can be obtained and disambiguate whether the same process id was reused.

ashrayjain · 2020-05-23T17:14:10Z

It was part of the crictl inspect output for the container

fuweid · 2020-05-24T01:14:03Z

@ashrayjain OK, It is pid-reuse issue... 😂

I think @tedyu 's idea is reasonable to prevent dead loop to kill -0.

But it there any log related to failed to handle container TaskExit event? It is the key to know why the StopContainer action always timeout. Thanks!

runtime/v1/shim/client/client.go

theopenlab-ci · 2020-05-30T17:44:34Z

Build succeeded.

containerd-build-arm64 : FAILURE in 1m 55s (non-voting)

theopenlab-ci · 2020-05-30T17:49:02Z

Build succeeded.

containerd-build-arm64 : FAILURE in 2m 38s (non-voting)

ashrayjain · 2020-06-01T17:21:57Z

@fuweid would appreciate a look here when you get a chance.

Thanks for continuing to push this forward, I think we are almost there with the fix :)

estesp

LGTM

runtime/v1/shim/client/client.go

Currently, we send a single SIGKILL to the shim process once and then we spin in a loop where we use kill(pid, 0) to detect when the pid has disappeared completely. Unfortunately, this has a race condition since pids can be reused causing us to spin in an infinite loop when that happens. This adds a timeout to this loop which logs a warning and exits the infinite loop. Signed-off-by: Ashray Jain <ashrayj@palantir.com>

ashrayjain · 2020-06-03T11:58:36Z

@mikebrow @fuweid please take a look again. I switched it to rely on the timeout already present on the context.

I still kept the ticker because i think that's more idiomatic Go than using a time.Sleep for this.

ashrayjain · 2020-06-03T11:59:02Z

Additionally, i've fixed the initial delay issue, so now there won't be an initial 10ms wait

theopenlab-ci · 2020-06-03T11:59:29Z

Build succeeded.

containerd-build-arm64 : FAILURE in 1m 31s (non-voting)

mikebrow

LGTM

crosbymichael · 2020-06-03T15:10:39Z

LGTM

This is fine for v1 but the proper fix is to move to v2 for the runtime shim :)

crosbymichael · 2020-06-03T15:11:07Z

Thanks for your first PR @ashrayjain !

fuweid · 2020-06-03T16:12:37Z

@ashrayjain Thanks for your patience.

fuweid · 2020-06-03T16:16:23Z

@containerd/containerd-release is it good to cherry-pick into release/1.3?

ashrayjain · 2020-06-05T08:33:09Z

@fuweid is there another 1.3.x release scheduled in the near future?

fuweid · 2020-06-16T02:39:31Z

@ashrayjain #4307 has been merged. No sure about the 1.3.5 release schedule :)

Zyqsempai reviewed Apr 22, 2020

View reviewed changes

ashrayjain force-pushed the aj/add-kill-retry branch from c5e0ae6 to 32d081f Compare April 22, 2020 19:50

ashrayjain force-pushed the aj/add-kill-retry branch from 32d081f to 46ce63c Compare April 22, 2020 20:27

containerd deleted a comment from theopenlab-ci bot Apr 23, 2020

containerd deleted a comment from codecov-io Apr 23, 2020

containerd deleted a comment from theopenlab-ci bot Apr 23, 2020

BenTheElder mentioned this pull request Apr 25, 2020

upgrade to containerd 1.3.4 kubernetes-sigs/kind#1511

Closed

ashrayjain force-pushed the aj/add-kill-retry branch from c4f0810 to c3d0845 Compare April 26, 2020 16:27

fuweid reviewed Apr 29, 2020

View reviewed changes

fuweid reviewed May 29, 2020

View reviewed changes

runtime/v1/shim/client/client.go Outdated Show resolved Hide resolved

ashrayjain force-pushed the aj/add-kill-retry branch 2 times, most recently from fba79c8 to e2c200a Compare May 30, 2020 17:42

ashrayjain force-pushed the aj/add-kill-retry branch from e2c200a to 23e069e Compare May 30, 2020 17:45

estesp approved these changes Jun 1, 2020

View reviewed changes

mikebrow reviewed Jun 2, 2020

View reviewed changes

runtime/v1/shim/client/client.go Outdated Show resolved Hide resolved

ashrayjain force-pushed the aj/add-kill-retry branch from 23e069e to 3e95727 Compare June 3, 2020 11:57

mikebrow approved these changes Jun 3, 2020

View reviewed changes

crosbymichael merged commit 7ce8a9d into containerd:master Jun 3, 2020

fuweid added the cherry-pick/1.3.x label Jun 3, 2020

ashrayjain deleted the aj/add-kill-retry branch June 3, 2020 21:57

fuweid mentioned this pull request Jun 5, 2020

[release/1.3] Make killing shims more resilient #4307

Merged

AkihiroSuda added cherry-picked/1.3.x and removed cherry-pick/1.3.x labels Jun 22, 2020

fuweid mentioned this pull request May 17, 2023

RFC: Initial support of idmapped mount points #5890

Merged

This was referenced Feb 27, 2026

🌱 CNCF mission generation 2026-02-27 kubestellar/console-kb#6

Closed

🌱 CNCF mission generation 2026-02-27 kubestellar/console-kb#11

Merged

Conversation

ashrayjain commented Apr 22, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ashrayjain commented Apr 22, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ashrayjain commented Apr 22, 2020

Uh oh!

ashrayjain commented Apr 22, 2020

Uh oh!

Zyqsempai left a comment

Choose a reason for hiding this comment

Uh oh!

ashrayjain commented Apr 22, 2020

Uh oh!

estesp commented Apr 22, 2020

Uh oh!

ashrayjain commented Apr 22, 2020

Uh oh!

theopenlab-ci bot commented Apr 26, 2020

Uh oh!

codecov-io commented Apr 26, 2020

Codecov Report

Uh oh!

BenTheElder commented Apr 27, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

estesp commented Apr 27, 2020

Uh oh!

ashrayjain commented Apr 29, 2020

Uh oh!

fuweid Apr 29, 2020

Choose a reason for hiding this comment

Uh oh!

fuweid Apr 29, 2020

Choose a reason for hiding this comment

Uh oh!

ashrayjain May 3, 2020

Choose a reason for hiding this comment

Uh oh!

fuweid May 3, 2020

Choose a reason for hiding this comment

Uh oh!

ashrayjain May 5, 2020

Choose a reason for hiding this comment

Uh oh!

ashrayjain commented May 5, 2020

Uh oh!

estesp commented May 6, 2020

Uh oh!

fuweid commented May 7, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

fuweid commented May 9, 2020

Uh oh!

ashrayjain commented May 11, 2020

Uh oh!

ashrayjain commented May 23, 2020

Uh oh!

tedyu commented May 23, 2020

Uh oh!

ashrayjain commented May 23, 2020

Uh oh!

fuweid commented May 24, 2020

Uh oh!

Uh oh!

theopenlab-ci bot commented May 30, 2020

Uh oh!

theopenlab-ci bot commented May 30, 2020

Uh oh!

ashrayjain commented Jun 1, 2020

Uh oh!

estesp left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ashrayjain commented Jun 3, 2020

Uh oh!

ashrayjain commented Jun 3, 2020

ashrayjain commented Apr 22, 2020 •

edited

Loading

ashrayjain commented Apr 22, 2020 •

edited

Loading

BenTheElder commented Apr 27, 2020 •

edited

Loading

fuweid commented May 7, 2020 •

edited

Loading