-
Notifications
You must be signed in to change notification settings - Fork 109
uevent listening goroutine is not (always) working #372
Description
As I was adding network hotplug support for vm factory and found out that the uevent listening goroutine is not working for me. After hotplugging a new nic, however long I wait in UpdateInterface, the uevent listening goroutine cannot get the new device sysfs uevent.
Then I tested the network monitor process (kata-containers/runtime#534), and it failed on getting the new uevent as well:
Sep 13 09:37:20 macbeth kata-proxy[54869]: time="2018-09-13T09:37:20.532400733+08:00" level=info msg="time="2018-09-13T01:37:20.511006103Z" level=debug msg="new request" name=kata-agent pid=128 req="interface:<device:"eth1" name:"eth1" IPAddresses:<family:2 address:"172.18.0.2" mask:"16" > mtu:1500 hwAddr:"02:42:ac:12:00:02" pciAddr:"02/01" > " request=/grpc.AgentService/UpdateInterface sandbox=e2145826daa97ea538d49bffc8c25ad13e0849a1edf5546478458fa17aa63002 source=agent\n" name=kata-proxy pid=54869 sandbox=e2145826daa97ea538d49bffc8c25ad13e0849a1edf5546478458fa17aa63002 source=agent
Sep 13 09:37:20 macbeth kata-proxy[54869]: time="2018-09-13T09:37:20.533607734+08:00" level=info msg="time="2018-09-13T01:37:20.512336531Z" level=info msg="Fetched PCI address for device" completePCIAddr="0000:00:02.0/0000:01:01.0" name=kata-agent pid=128 sandbox=e2145826daa97ea538d49bffc8c25ad13e0849a1edf5546478458fa17aa63002 source=agent\n" name=kata-proxy pid=54869 sandbox=e2145826daa97ea538d49bffc8c25ad13e0849a1edf5546478458fa17aa63002 source=agent
Sep 13 09:37:20 macbeth kata-proxy[54869]: time="2018-09-13T09:37:20.535074125+08:00" level=info msg="time="2018-09-13T01:37:20.513595652Z" level=info msg="Waiting on channel for device notification" name=kata-agent pciID=02/01 pid=128 sandbox=e2145826daa97ea538d49bffc8c25ad13e0849a1edf5546478458fa17aa63002 source=agent\n" name=kata-proxy pid=54869 sandbox=e2145826daa97ea538d49bffc8c25ad13e0849a1edf5546478458fa17aa63002 source=agent
Sep 13 09:37:23 macbeth kata-proxy[54869]: time="2018-09-13T09:37:23.537970764+08:00" level=info msg="time="2018-09-13T01:37:23.51786459Z" level=debug msg="request end" duration=3.005565104s error="rpc error: code = DeadlineExceeded desc = Timeout reached after 3s waiting for device 0000:00:02.0/0000:01:01.0" name=kata-agent pid=128 request=/grpc.AgentService/UpdateInterface resp="" sandbox=e2145826daa97ea538d49bffc8c25ad13e0849a1edf5546478458fa17aa63002 source=agent\n" name=kata-proxy pid=54869 sandbox=e2145826daa97ea538d49bffc8c25ad13e0849a1edf5546478458fa17aa63002 source=agent
Sep 13 09:37:23 macbeth kata-runtime[54984]: time="2018-09-13T09:37:23.540837194+08:00" level=error msg="update interface request failed" arch=amd64 command=kata-network container=e2145826daa97ea538d49bffc8c25ad13e0849a1edf5546478458fa17aa63002 error="rpc error: code = DeadlineExceeded desc = Timeout reached after 3s waiting for device 0000:00:02.0/0000:01:01.0" interface-requested="device:"eth1" name:"eth1" IPAddresses:<family:2 address:"172.18.0.2" mask:"16" > mtu:1500 hwAddr:"02:42:ac:12:00:02" pciAddr:"02/01" " name=kata-runtime pid=54984 resulting-interface="" sandbox=e2145826daa97ea538d49bffc8c25ad13e0849a1edf5546478458fa17aa63002 source=virtcontainers subsystem=kata_agent
Sep 13 09:37:23 macbeth kata-runtime[54984]: time="2018-09-13T09:37:23.540955376+08:00" level=error msg="add interface failed" arch=amd64 command=kata-network container=e2145826daa97ea538d49bffc8c25ad13e0849a1edf5546478458fa17aa63002 error="rpc error: code = DeadlineExceeded desc = Timeout reached after 3s waiting for device 0000:00:02.0/0000:01:01.0" name=kata-runtime pid=54984 resulting-interface="" sandbox=e2145826daa97ea538d49bffc8c25ad13e0849a1edf5546478458fa17aa63002 source=runtime
Sep 13 09:37:23 macbeth kata-runtime[54984]: time="2018-09-13T09:37:23.541177053+08:00" level=error msg="rpc error: code = DeadlineExceeded desc = Timeout reached after 3s waiting for device 0000:00:02.0/0000:01:01.0" arch=amd64 command=kata-network container=e2145826daa97ea538d49bffc8c25ad13e0849a1edf5546478458fa17aa63002 name=kata-runtime pid=54984 sandbox=e2145826daa97ea538d49bffc8c25ad13e0849a1edf5546478458fa17aa63002 source=runtime
It's not that the uevent is not sent to the agent. If I remove the waiting and ignore UpdateInterface error, the device shows up very quickly and the uevent listening goroutine is able to pick up uevents for it. So it's like the goroutine is not woke up when agent is waiting in UpdateInterface call. But I don't know why that is happening.
Also since netmon works in @sboeuf 's local environment, I put up a testing PR (kata-containers/tests#743) to get a baseline for environment comparison. And it turns out that two times out of three, it failed on getting the IP of new hotplugged nic. So the failure/success do not seem constant even for CI.
It's not clear to me what has caused the failure/success. @sboeuf @devimc @amshinde @caoruidong help to take a look.