Skip to content

Traffic disruption during cilium-agent upgrade/restart with Mellanox NICs #13526

@ArthurChiao

Description

@ArthurChiao

Bug report

We found that when upgrading cilium-agent from 1.7.4 to 1.8.4, both node traffic as well as Pod traffic would get interrupted for a few seconds (e.g. 1~10s according to our observation). The problem is 100% reproducible.

Further tests show that event restart cilium-agent container could reproduce this interruption.

General Information

  • Cilium version: 1.8.4
  • Kernel version: 4.19.118
  • NIC: mlx5, Mellanox Technologies MT27710 Family [ConnectX-4 Lx]

Note that we find this problem only appears on nodes that shipped with Mellanox NICs.

Trace the problem

We notice that each time the problem appears, there will be a Link up message in dmesg:

[Tue Oct 13 10:12:03 2020] mlx5_core 0000:08:00.0 eth0: Link up
[Tue Oct 13 10:12:03 2020] mlx5_core 0000:08:00.1 eth1: Link up

Further digging we find that it is originated from init.sh during agent start/restart, which unconditionally performs XDP resets:

https://github.com/cilium/cilium/blob/master/bpf/init.sh#L658

# Remove bpf_xdp.o from previously used devices
for iface in $(ip -o -a l | awk '{print $2}' | cut -d: -f1 | cut -d@ -f1 | grep -v cilium); do
	[ "$iface" == "$XDP_DEV" ] && continue
	for mode in xdpdrv xdpgeneric; do
		xdp_unload "$iface" "$mode"
	done
done

function xdp_unload()
{
	DEV=$1
	MODE=$2

	ip link set dev $DEV $MODE off 2> /dev/null || true
}

It seems this works fine for NICs from many vendors (e.g. Intel), but not for Mellanox ones.

As a quick test, comment out the above xdp_unload, the problem disappears.

@jaffcheng is working on a patch to fix this.

Metadata

Metadata

Assignees

No one assigned

    Labels

    kind/bugThis is a bug in the Cilium logic.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions