-
Notifications
You must be signed in to change notification settings - Fork 109
vsock semantics #445
Description
Background
Whilst working on the agent tracing (#415), I noticed that occasionally the agent's gRPC server was not shutting down. This is a big problem for tracing since we need to guarantee a clean agent shutdown. Since the gRPC server is not shutting down, it is stopping the agent from finalising trace spans, meaning only partial trace information is being sent back to the Jaeger agent on the host.
Findings
From what I can see, the hang has nothing to do with the gRPC library, nor the existing agent code, nor the new agent tracing code: the problem appears to be vsock: toggling use_vsock = false in configuration.toml stops the hang.
Observed vsock behaviour in kata-agent
Tracing the agent code shows that deep in the bowels of the gRPC server, the following is happening:
lis, err := vsock.Listen(vSockPort)
err = grpcServer.Serve(lis)And grpcServer.Serve() is doing this:
for {
rawConn, err := lis.Accept()
s.handleRawConn(rawConn)
}Using a serial device or a unix domain socket, that Accept() call will fail when the client (the kata-runtime) disconnects, returning an error (a Yamux ErrSessionShutdown).
But with kata-runtime, vsock seemingly never fails at this point, always returning a new connection. As a result, the infinite loop never exits, ergo the hang.
It may be that we are mis-using vsock somehow or it may be that we're hitting a kernel bug with the vsock module.
Recreating the problem
To help debugging this issue, I've created a branch that:
- Allows the agent to be run stand-alone (and it also includes a ton of debug messages :)
- Provides a
kata-agent-ctlcommand that provides just enough functionality to trigger the problem.
Agent
- Clone https://github.com/jodh-intel/agent-1/tree/run-on-host to
$GOPATH/src/github.com/kata-containers/agent. - Then:
$ make clean && make
Testing
Test using a unix domain socket
-
Start kata-agent (server)
$ cd $GOPATH/src/github.com/kata-containers/agent $ ./kata-agent --debug --grpc-trace --channelPath unix:///tmp/kata-agent.sock --no-udev
-
Start client
$ cli/kata-agent-ctl --keep-alive --debug --agentAddress unix:///tmp/kata-agent.sock --enable-yamux
Running the client should display some output and exit. It should also cause the server to exit.
Test using a vsock socket
-
Setup vsock on host
$ sudo modprobe vhost_vsock $ sudo chmod 666 /dev/vhost-vsock /dev/vsock
-
Start qemu with
-device vhost-vsock-pci,id=vhost-vsock-pci0,guest-cid=3 -
Start kata-agent server (inside qemu)
$ sudo ./kata-agent --debug --grpc-trace --no-udev
-
Start client (on host)
$ sudo cli/kata-agent-ctl --keep-alive --debug
(Note: the
--keep-aliveoption doesn't appear to make life better fwics so that may be optional).
What you will probably find is that the behaviour is now erratic:
- Sometimes the server will exit when the client exits (as expected).
- Sometimes the server will not exit when the client exits, but re-running the client will cause the server and client to exit.
- Sometimes the server will not exit when the client exits and re-running the client will cause the server to immediately exit, but the client will then hang waiting for the server.
@stefanha - could you tal and sanity check the above? Is there anything obvious we're doing or not doing that might trigger behaviour like this? The agent file dealing with vsock is:
/cc @sboeuf, @mcastelino,