Skip to content

Make killing shims more resilient#4204

Merged
crosbymichael merged 1 commit intocontainerd:masterfrom
ashrayjain:aj/add-kill-retry
Jun 3, 2020
Merged

Make killing shims more resilient#4204
crosbymichael merged 1 commit intocontainerd:masterfrom
ashrayjain:aj/add-kill-retry

Conversation

@ashrayjain
Copy link
Copy Markdown
Contributor

@ashrayjain ashrayjain commented Apr 22, 2020

Currently, we send a single SIGKILL to the shim process
once and then we spin in a loop where we use kill(pid, 0)
to detect when the pid has disappeared completely.

Unfortunately, this has a race condition since pids can be reused causing us
to spin in an infinite loop when that happens.

This adds a timeout to this loop which logs a warning and exits the infinite loop.

This fixes containerd/cri#1427

@ashrayjain
Copy link
Copy Markdown
Contributor Author

ashrayjain commented Apr 22, 2020

Attaching some more relevant information that we found when debugging this issue

$ pidof containerd
1567
$ strace -fp 1567 -e trace=kill
strace: Process 1567 attached with 301 threads
[pid 13252] kill(13737, SIG_0)          = 0
[pid 11864] kill(13737, SIG_0)          = 0
[pid  6042] kill(13737, SIG_0)          = 0
[pid 11864] kill(13737, SIG_0)          = 0
[pid  6042] kill(13737, SIG_0)          = 0
[pid  7318] kill(13737, SIG_0)          = 0
[pid  8004] kill(13737, SIG_0)          = 0
[pid 11864] kill(13737, SIG_0)          = 0
[pid  8004] kill(13737, SIG_0)          = 0
[pid 20957] kill(13737, SIG_0)          = 0
[pid 11864] kill(13737, SIG_0)          = 0
[pid 11864] kill(13737, SIG_0)          = 0
[pid  6042] kill(13737, SIG_0)          = 0
[pid  6042] kill(13737, SIG_0)          = 0
...

and

$ ps aux | grep 13737
root     13737  0.0  0.0 110356  6576 ?        Sl   07:18   0:00 containerd-shim -namespace k8s.io -workdir /var/lib/container-runtime/containerd/io.containerd.runtime.v1.linux/k8s.io/051b2f90455d240b29f3130682e851c0e7fd6bc0e64121b5e77bcfb516c49b18 -address /run/containerd/containerd.sock -containerd-binary /usr/local/bin/containerd

@ashrayjain
Copy link
Copy Markdown
Contributor Author

Containerd logs are full of
ERROR [2020-04-22T12:43:34.234344073Z] containerd.io/containerd/containerd: "StopPodSandbox for \"d50318c73c890f6a829fab36a980332b87b8133ff5f5b6c72f8693c3e1140d49\" failed" error="rpc error: code = Canceled desc = failed to stop container \"4f1dd102b497be8256c290e4da16128dd36c340575a8d0c36edd2ccae51e2f05\": an error occurs during waiting for container \"4f1dd102b497be8256c290e4da16128dd36c340575a8d0c36edd2ccae51e2f05\" to be killed: wait container \"4f1dd102b497be8256c290e4da16128dd36c340575a8d0c36edd2ccae51e2f05\": context canceled"

@ashrayjain
Copy link
Copy Markdown
Contributor Author

I uploaded the stack trace dump from a containerd in this state here: https://gist.github.com/ashrayjain/f1bac2cc5bec2af5445268a3e1bc7fef

Copy link
Copy Markdown
Contributor

@Zyqsempai Zyqsempai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please sign your commit, otherwise, CI will fail.

@ashrayjain
Copy link
Copy Markdown
Contributor Author

Hey @Zyqsempai , I think I signed my commit already. Is there anything else needed to satisfy CI?

@estesp
Copy link
Copy Markdown
Member

estesp commented Apr 22, 2020

You will need to rebase on master to pass CI once we merge #4206. Sorry--upstream change in golangci-lint broke our dev-tool-install script.

@ashrayjain
Copy link
Copy Markdown
Contributor Author

@estesp no problem, done!

@containerd containerd deleted a comment from theopenlab-ci bot Apr 23, 2020
@containerd containerd deleted a comment from theopenlab-ci bot Apr 23, 2020
@containerd containerd deleted a comment from codecov-io Apr 23, 2020
@containerd containerd deleted a comment from theopenlab-ci bot Apr 23, 2020
@theopenlab-ci
Copy link
Copy Markdown

theopenlab-ci bot commented Apr 26, 2020

Build succeeded.

@codecov-io
Copy link
Copy Markdown

Codecov Report

Merging #4204 into master will not change coverage.
The diff coverage is n/a.

Impacted file tree graph

@@           Coverage Diff           @@
##           master    #4204   +/-   ##
=======================================
  Coverage   38.34%   38.34%           
=======================================
  Files          90       90           
  Lines       12728    12728           
=======================================
  Hits         4881     4881           
  Misses       7181     7181           
  Partials      666      666           
Flag Coverage Δ
#windows 38.34% <ø> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e094d36...c3d0845. Read the comment docs.

@BenTheElder
Copy link
Copy Markdown
Contributor

BenTheElder commented Apr 27, 2020

Is this problem also relevant to the v2 shim?

@estesp
Copy link
Copy Markdown
Member

estesp commented Apr 27, 2020

I spent some time trying to look into that late last week; my understanding is that the v2/ code doesn't have this same flow of waiting for a complete exit.

My problem with the PR is that it effectively is changing the semantics of "waitForExit" to "keepTryingToKillForAwhile" :) I'm not sure if @crosbymichael has had a chance to look at this as he is more intimately familiar with this code.

@ashrayjain
Copy link
Copy Markdown
Contributor Author

@estesp to provide some more info here. We ran the killsnoop bpf tool (https://github.com/iovisor/bcc/blob/master/tools/killsnoop.py) to inspect and track all the kill signals being sent by containerd and the responses they were getting from the kenel.

Here is the output from an occurrence of this issue.

$ journalctl -u killsnoop | grep 10516
killsnoop.sh[922]: TIME      PID    COMM             SIG  TPID   RESULT
killsnoop.sh[922]: 08:39:02  24136  containerd       9    10516  0
killsnoop.sh[922]: 08:39:02  24136  containerd       0    10516  0
killsnoop.sh[922]: 08:39:02  24136  containerd       0    10516  0
killsnoop.sh[922]: 08:39:02  24357  containerd       0    10516  0
killsnoop.sh[922]: 08:39:02  419    containerd       0    10516  0
killsnoop.sh[922]: 08:39:02  30071  containerd       0    10516  0
$ strace -fp $(pidof containerd) -e trace=kill
strace: Process 1595 attached with 224 threads
[pid   699] kill(10516, SIG_0)          = 0
[pid   699] kill(10516, SIG_0)          = 0
[pid   699] kill(10516, SIG_0)          = 0
[pid  4991] kill(10516, SIG_0)          = 0
[pid 26715] kill(10516, SIG_0)          = 0
[pid 14796] kill(10516, SIG_0)          = 0
[pid 14796] kill(10516, SIG_0)          = 0
[pid   699] kill(10516, SIG_0)          = 0
[pid   699] kill(10516, SIG_0)          = 0
[pid   699] kill(10516, SIG_0)          = 0
[pid 26715] kill(10516, SIG_0)          = 0
[pid 26715] kill(10516, SIG_0)          = 0
[pid 14796] kill(10516, SIG_0)          = 0
[pid 14796] kill(10516, SIG_0)          = 0
[pid 26715] kill(10516, SIG_0)          = 0

As you can see, containerd sent a kill -9 to the pid 10516 and got back 0 as the result, however this process was not killed and kept running until manual action was taken. :(

Would this PR be acceptable if we made it clear that the method was retrying the kill as opposed to just waiting for exit?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If kill doesn't work at this moment, could the caller retry kill shim again? It is reasonable to do retry if failed at first time.

For the mentioned case, if kill -9 fails at the first time, can you kill it by manually? And could it be possible reused pid here (same pid but it is for different container)?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you mind to check wchan result from ps axo pid,cmd,wchan if the shim is still hanging there?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fuweid yea, we have seen retrying the kill succeed when the first kill fails.

The manual kill succeeds.

31745 containerd-shim -namespace  futex_wait_queue_me

is the output for one such shim.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the information. And the shim is the same shim? I mean the pid can be reused.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes i confirmed the shim is the same based on the container id it was associated with.

@ashrayjain
Copy link
Copy Markdown
Contributor Author

@estesp @BenTheElder @crosbymichael @fuweid is there anything else i can do to help push this forward?

@estesp
Copy link
Copy Markdown
Member

estesp commented May 6, 2020

I spent a few minutes looking at the call path (services/tasks/local.go Delete -> runtime/v1 tasks Delete calls into the shim.KillShim which leads to signalShim with SIGKILL) trying to understand if there is a reasonable place to capture a failed kill and insert some form of retry. If the ctx in that flow had a timeout then that would be another route, but apparently it doesn't by default, hence the original problem of spinning forever on the unix.Kill with signal "zero".

It does turn out that this code (signalShim) is only called with SIGKILL as the StopShim function is never used from any caller in containerd at the moment. Otherwise, I assume the same problem might be observed with StopShim using SIGTERM; an infinite wait if the process doesn't respond/exit on SIGTERM.

Seems maybe the right path is to make signalShim a time-out based kill function (since all it is being used is for killing pids) that retries itself until success or exits with error after a timeout or a specified # of retries? Thoughts @fuweid?

@fuweid
Copy link
Copy Markdown
Member

fuweid commented May 7, 2020

I am not sure why kill -9 doesn't work and still want to find the root cause. :)

Back to the issue, for shim v1, the containerd doesn't retry kill -9.
In CRI plugin, the task.Delete is called by CRI plugin event handler when the container init task has been exited. And task.Delete will call t.shim.KillShim(ctx) and we can see containerd send kill -9 signal.

If the task.Delete fails, the container status will not be updated to exited and CRI plugin will try it. Checkout the following code, if the t.shim.Delete return errors or c.ShimInfo return errors, the kill -9 will not be called.

// Delete the task and return the exit status
func (t *Task) Delete(ctx context.Context) (*runtime.Exit, error) {
        rsp, shimErr := t.shim.Delete(ctx, empty)
        if shimErr != nil {
                shimErr = errdefs.FromGRPC(shimErr)
                if !errdefs.IsNotFound(shimErr) {
                        return nil, shimErr
                }
        }
        t.tasks.Delete(ctx, t.id)
        if err := t.shim.KillShim(ctx); err != nil {
                log.G(ctx).WithError(err).Error("failed to kill shim")
        }
        if err := t.bundle.Delete(); err != nil {
                log.G(ctx).WithError(err).Error("failed to delete bundle")
        }
        if shimErr != nil {
                return nil, shimErr
        }
        t.events.Publish(ctx, runtime.TaskDeleteEventTopic, &eventstypes.TaskDelete{
                ContainerID: t.id,
                ExitStatus:  rsp.ExitStatus,
                ExitedAt:    rsp.ExitedAt,
                Pid:         rsp.Pid,
        })
        return &runtime.Exit{
                Status:    rsp.ExitStatus,
                Timestamp: rsp.ExitedAt,
                Pid:       rsp.Pid,
        }, nil
}

func (c *Client) signalShim(ctx context.Context, sig syscall.Signal) error {
        info, err := c.ShimInfo(ctx, empty)
        if err != nil {
                return err
        }
        pid := int(info.ShimPid)
        // make sure we don't kill ourselves if we are running a local shim
        if os.Getpid() == pid {
                return nil
        }
        if err := unix.Kill(pid, sig); err != nil && err != unix.ESRCH {
                return err
        }
        // wait for shim to die after being signaled
        select {
        case <-ctx.Done():
                return ctx.Err()
        case <-c.waitForExit(pid):
                return nil
        }
}

So I think we should check the event handler in CRI plugin and what error we meet.

Maybe related to #4198 because I check the log from https://gist.github.com/ashrayjain/f1bac2cc5bec2af5445268a3e1bc7fef and found

# seems that client ttrpc is hanging on closing but not sure the client ttrpc longs to which shim

# [semacquire, 139 minutes] can be a clue

goroutine 3324776 [semacquire, 139 minutes]:
sync.runtime_Semacquire(0xc00185b618)
	/home/travis/.gimme/versions/go1.12.16.linux.amd64/src/runtime/sema.go:56 +0x3b
sync.(*WaitGroup).Wait(0xc00185b610)
	/home/travis/.gimme/versions/go1.12.16.linux.amd64/src/sync/waitgroup.go:130 +0x67
github.com/containerd/containerd/vendor/github.com/containerd/ttrpc.(*Client).run.func2(0xc00185b610, 0xc002015b00)
	/home/travis/gopath/src/github.com/containerd/containerd/vendor/github.com/containerd/ttrpc/client.go:246 +0x2d
created by github.com/containerd/containerd/vendor/github.com/containerd/ttrpc.(*Client).run
	/home/travis/gopath/src/github.com/containerd/containerd/vendor/github.com/containerd/ttrpc/client.go:245 +0x1a7

@ashrayjain would you mind to provide more information about CRI plugin event log? like failed to handle container TaskExit event something? Thanks

cc @estesp

@fuweid
Copy link
Copy Markdown
Member

fuweid commented May 9, 2020

ping @ashrayjain

@ashrayjain
Copy link
Copy Markdown
Contributor Author

Hey @fuweid, apologies for the delay.

I'm waiting for another repro of this issue on our cluster to provide you with some more info on this.

@ashrayjain
Copy link
Copy Markdown
Contributor Author

@fuweid @estesp So we just had another repro of this issue.

Here is what we found:
Containerd was stuck trying to kill pid 9750

$ strace -fp $(pidof containerd) -e trace=kill
strace: Process 1603 attached with 131 threads
[pid  6034] kill(9750, SIG_0)           = 0
[pid  6034] kill(9750, SIG_0)           = 0
[pid  6034] kill(9750, SIG_0)           = 0
[pid 28558] kill(9750, SIG_0)           = 0
[pid 28558] kill(9750, SIG_0)           = 0
[pid 30413] kill(9750, SIG_0)           = 0
[pid 30413] kill(9750, SIG_0)           = 0
[pid 30413] kill(9750, SIG_0)           = 0
[pid 30413] kill(9750, SIG_0)           = 0
[pid 30413] kill(9750, SIG_0)           = 0

Based on the killsnoop tool, we know that containerd sent a kill -9 to this pid and got back a 0 exit code

11:21:14  12062  containerd       9    9750   0      R (running)
11:21:14  12062  containerd       0    9750   0      R (running)
11:21:14  5450   containerd       0    9750   0      S (sleeping)
11:21:14  28560  containerd       0    9750   0     S (sleeping)
11:21:14  28560  containerd       0    9750   0     S (sleeping)

On trying to look at the containerd logs, we saw

"shim containerd-shim started" address=/containerd-shim/21e89a11376d496299190e55578927b75988ab9664374ee924299553fdbfe45d.sock debug=false pid=9750

Based on how the code generates 21e89a11376d496299190e55578927b75988ab9664374ee924299553fdbfe45d, we were able to confirm that this was for sandbox id 3c6ced659501c69a1b0342c46cbb2c47c9dde045cf92f22fe82e96a4a1e22068

func (b *bundle) shimAddress(namespace string) string {
	d := sha256.Sum256([]byte(filepath.Join(namespace, b.id)))
	return filepath.Join(string(filepath.Separator), "containerd-shim", fmt.Sprintf("%x.sock", d))
}

and k8s.io/3c6ced659501c69a1b0342c46cbb2c47c9dde045cf92f22fe82e96a4a1e22068 hashes to 21e89a11376d496299190e55578927b75988ab9664374ee924299553fdbfe45d

For this pod, if we run crictl inspectp, we get the following redacted info

{
  "status": {
    "id": "3c6ced659501c69a1b0342c46cbb2c47c9dde045cf92f22fe82e96a4a1e22068",
...
    "state": "SANDBOX_READY",
  },
  "info": {
    "pid": 9768,
    "processStatus": "deleted",
...
}

There was no process on this host with pid 9768 at this time.

Additionally, for the sandbox in question, we also have

11:21:14.276625531Z] containerd.io/containerd/containerd: "shim reaped" id=3c6ced659501c69a1b0342c46cbb2c47c9dde045cf92f22fe82e96a4a1e22068

which is at the exact time we observe the first kill -9 in killsnoop above.

Trying to find the "current" pid 9750 on the host, we find

$ pstree -pa -G -s 9750
systemd,1
  └─containerd,1603
      └─containerd-shim,9371 -namespace k8s.io -workdir ...
          └─tini,9438 -s -- ...
              └─java,9661 ...
                  └─{java},9750

So this is actually a thread being used by an unrelated container.
From this container's inspect output, we have

    "createdAt": "2020-05-23T11:21:13.699473306Z",
    "startedAt": "2020-05-23T11:21:14.141006424Z",

So in summary, what appears to have happened here is that containerd tried to kill the shim with pid 9768/9750 for sandbox 3c6ced659501c69a1b0342c46cbb2c47c9dde045cf92f22fe82e96a4a1e22068 (where one of those pids was probably a thread?) and succeeding in doing so, however at roughly the same time, a different container started (the java container above) and one of the threads in that container got the same pid (9750). This meant that containerd got stuck in its kill -0 loop since from the kernel's perspective, that pid exists, even though containerd just deleted it.

@fuweid it seems when you first suggested that the pid might be getting reused, you were on to something ;)

Do you folks have ideas on how to handle this situation better?

@tedyu
Copy link
Copy Markdown
Contributor

tedyu commented May 23, 2020

Was the

    "startedAt": "2020-05-23T11:21:14.141006424Z",

part of the output of 'crictl inspectp' ?
(I ran the same command on a node with cri-o container but didn't see this field).

I wonder if the start time can be obtained and disambiguate whether the same process id was reused.

@ashrayjain
Copy link
Copy Markdown
Contributor Author

It was part of the crictl inspect output for the container

@fuweid
Copy link
Copy Markdown
Member

fuweid commented May 24, 2020

@ashrayjain OK, It is pid-reuse issue... 😂

I think @tedyu 's idea is reasonable to prevent dead loop to kill -0.

But it there any log related to failed to handle container TaskExit event? It is the key to know why the StopContainer action always timeout. Thanks!

@ashrayjain ashrayjain force-pushed the aj/add-kill-retry branch 2 times, most recently from fba79c8 to e2c200a Compare May 30, 2020 17:42
@theopenlab-ci
Copy link
Copy Markdown

theopenlab-ci bot commented May 30, 2020

Build succeeded.

@ashrayjain ashrayjain force-pushed the aj/add-kill-retry branch from e2c200a to 23e069e Compare May 30, 2020 17:45
@theopenlab-ci
Copy link
Copy Markdown

theopenlab-ci bot commented May 30, 2020

Build succeeded.

@ashrayjain
Copy link
Copy Markdown
Contributor Author

@fuweid would appreciate a look here when you get a chance.

Thanks for continuing to push this forward, I think we are almost there with the fix :)

Copy link
Copy Markdown
Member

@estesp estesp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Currently, we send a single SIGKILL to the shim process
once and then we spin in a loop where we use kill(pid, 0)
to detect when the pid has disappeared completely.

Unfortunately, this has a race condition since pids can be reused causing us
to spin in an infinite loop when that happens.

This adds a timeout to this loop which logs a warning and exits the
infinite loop.

Signed-off-by: Ashray Jain <ashrayj@palantir.com>
@ashrayjain ashrayjain force-pushed the aj/add-kill-retry branch from 23e069e to 3e95727 Compare June 3, 2020 11:57
@ashrayjain
Copy link
Copy Markdown
Contributor Author

@mikebrow @fuweid please take a look again. I switched it to rely on the timeout already present on the context.

I still kept the ticker because i think that's more idiomatic Go than using a time.Sleep for this.

@ashrayjain
Copy link
Copy Markdown
Contributor Author

Additionally, i've fixed the initial delay issue, so now there won't be an initial 10ms wait

@theopenlab-ci
Copy link
Copy Markdown

theopenlab-ci bot commented Jun 3, 2020

Build succeeded.

Copy link
Copy Markdown
Member

@mikebrow mikebrow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@crosbymichael
Copy link
Copy Markdown
Member

LGTM

This is fine for v1 but the proper fix is to move to v2 for the runtime shim :)

@crosbymichael crosbymichael merged commit 7ce8a9d into containerd:master Jun 3, 2020
@crosbymichael
Copy link
Copy Markdown
Member

Thanks for your first PR @ashrayjain !

@fuweid
Copy link
Copy Markdown
Member

fuweid commented Jun 3, 2020

@ashrayjain Thanks for your patience.

@fuweid
Copy link
Copy Markdown
Member

fuweid commented Jun 3, 2020

@containerd/containerd-release is it good to cherry-pick into release/1.3?

@ashrayjain ashrayjain deleted the aj/add-kill-retry branch June 3, 2020 21:57
@ashrayjain
Copy link
Copy Markdown
Contributor Author

@fuweid is there another 1.3.x release scheduled in the near future?

@fuweid
Copy link
Copy Markdown
Member

fuweid commented Jun 16, 2020

@ashrayjain #4307 has been merged. No sure about the 1.3.5 release schedule :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

CRI stops receiving events, causes timeouts in StopContainer and StopSandboxContainer

10 participants