Bug report
We found that when upgrading cilium-agent from 1.7.4 to 1.8.4, both node traffic as well as Pod traffic would get interrupted for a few seconds (e.g. 1~10s according to our observation). The problem is 100% reproducible.
Further tests show that event restart cilium-agent container could reproduce this interruption.
General Information
- Cilium version:
1.8.4
- Kernel version:
4.19.118
- NIC:
mlx5, Mellanox Technologies MT27710 Family [ConnectX-4 Lx]
Note that we find this problem only appears on nodes that shipped with Mellanox NICs.
Trace the problem
We notice that each time the problem appears, there will be a Link up message in dmesg:
[Tue Oct 13 10:12:03 2020] mlx5_core 0000:08:00.0 eth0: Link up
[Tue Oct 13 10:12:03 2020] mlx5_core 0000:08:00.1 eth1: Link up
Further digging we find that it is originated from init.sh during agent start/restart, which unconditionally performs XDP resets:
https://github.com/cilium/cilium/blob/master/bpf/init.sh#L658
# Remove bpf_xdp.o from previously used devices
for iface in $(ip -o -a l | awk '{print $2}' | cut -d: -f1 | cut -d@ -f1 | grep -v cilium); do
[ "$iface" == "$XDP_DEV" ] && continue
for mode in xdpdrv xdpgeneric; do
xdp_unload "$iface" "$mode"
done
done
function xdp_unload()
{
DEV=$1
MODE=$2
ip link set dev $DEV $MODE off 2> /dev/null || true
}
It seems this works fine for NICs from many vendors (e.g. Intel), but not for Mellanox ones.
As a quick test, comment out the above xdp_unload, the problem disappears.
@jaffcheng is working on a patch to fix this.
Bug report
We found that when upgrading cilium-agent from
1.7.4to1.8.4, both node traffic as well as Pod traffic would get interrupted for a few seconds (e.g.1~10saccording to our observation). The problem is 100% reproducible.Further tests show that event restart cilium-agent container could reproduce this interruption.
General Information
1.8.44.19.118mlx5, Mellanox Technologies MT27710 Family [ConnectX-4 Lx]Note that we find this problem only appears on nodes that shipped with Mellanox NICs.
Trace the problem
We notice that each time the problem appears, there will be a
Link upmessage indmesg:Further digging we find that it is originated from
init.shduring agent start/restart, which unconditionally performs XDP resets:https://github.com/cilium/cilium/blob/master/bpf/init.sh#L658
It seems this works fine for NICs from many vendors (e.g. Intel), but not for Mellanox ones.
As a quick test, comment out the above
xdp_unload, the problem disappears.@jaffcheng is working on a patch to fix this.