Skip to content
This repository was archived by the owner on May 12, 2021. It is now read-only.
This repository was archived by the owner on May 12, 2021. It is now read-only.

vsock semantics #445

@jodh-intel

Description

@jodh-intel

Background

Whilst working on the agent tracing (#415), I noticed that occasionally the agent's gRPC server was not shutting down. This is a big problem for tracing since we need to guarantee a clean agent shutdown. Since the gRPC server is not shutting down, it is stopping the agent from finalising trace spans, meaning only partial trace information is being sent back to the Jaeger agent on the host.

Findings

From what I can see, the hang has nothing to do with the gRPC library, nor the existing agent code, nor the new agent tracing code: the problem appears to be vsock: toggling use_vsock = false in configuration.toml stops the hang.

Observed vsock behaviour in kata-agent

Tracing the agent code shows that deep in the bowels of the gRPC server, the following is happening:

lis, err := vsock.Listen(vSockPort)
err = grpcServer.Serve(lis)

And grpcServer.Serve() is doing this:

for {
    rawConn, err := lis.Accept()

    s.handleRawConn(rawConn)
}

Using a serial device or a unix domain socket, that Accept() call will fail when the client (the kata-runtime) disconnects, returning an error (a Yamux ErrSessionShutdown).

But with kata-runtime, vsock seemingly never fails at this point, always returning a new connection. As a result, the infinite loop never exits, ergo the hang.

It may be that we are mis-using vsock somehow or it may be that we're hitting a kernel bug with the vsock module.

Recreating the problem

To help debugging this issue, I've created a branch that:

  • Allows the agent to be run stand-alone (and it also includes a ton of debug messages :)
  • Provides a kata-agent-ctl command that provides just enough functionality to trigger the problem.

Agent

Testing

Test using a unix domain socket

  • Start kata-agent (server)

    $ cd $GOPATH/src/github.com/kata-containers/agent
    $ ./kata-agent --debug --grpc-trace --channelPath unix:///tmp/kata-agent.sock --no-udev
  • Start client

    $ cli/kata-agent-ctl --keep-alive --debug --agentAddress unix:///tmp/kata-agent.sock --enable-yamux

Running the client should display some output and exit. It should also cause the server to exit.

Test using a vsock socket

  • Setup vsock on host

    $ sudo modprobe vhost_vsock
    $ sudo chmod 666 /dev/vhost-vsock /dev/vsock
  • Start qemu with -device vhost-vsock-pci,id=vhost-vsock-pci0,guest-cid=3

  • Start kata-agent server (inside qemu)

    $ sudo ./kata-agent --debug --grpc-trace --no-udev
  • Start client (on host)

    $ sudo cli/kata-agent-ctl --keep-alive --debug

    (Note: the --keep-alive option doesn't appear to make life better fwics so that may be optional).

What you will probably find is that the behaviour is now erratic:

  • Sometimes the server will exit when the client exits (as expected).
  • Sometimes the server will not exit when the client exits, but re-running the client will cause the server and client to exit.
  • Sometimes the server will not exit when the client exits and re-running the client will cause the server to immediately exit, but the client will then hang waiting for the server.

@stefanha - could you tal and sanity check the above? Is there anything obvious we're doing or not doing that might trigger behaviour like this? The agent file dealing with vsock is:

/cc @sboeuf, @mcastelino,

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions