Traffic disruption during cilium-agent upgrade/restart with Mellanox NICs

## Bug report

We found that when upgrading cilium-agent from `1.7.4` to `1.8.4`, both **node traffic** as well as **Pod traffic** would get interrupted for a few seconds (e.g. `1~10s` according to our observation). The problem is 100% reproducible.

Further tests show that event **restart cilium-agent container** could reproduce this interruption.

**General Information**

- Cilium version: `1.8.4`
- Kernel version: `4.19.118`
- **NIC: `mlx5`, Mellanox Technologies MT27710 Family [ConnectX-4 Lx]**

**Note that we find this problem only appears on nodes that shipped with Mellanox NICs.**

## Trace the problem

We notice that each time the problem appears, there will be a `Link up` message in `dmesg`:

```
[Tue Oct 13 10:12:03 2020] mlx5_core 0000:08:00.0 eth0: Link up
[Tue Oct 13 10:12:03 2020] mlx5_core 0000:08:00.1 eth1: Link up
```

Further digging we find that it is originated from `init.sh` during agent start/restart, which unconditionally performs XDP resets:

https://github.com/cilium/cilium/blob/master/bpf/init.sh#L658

```shell
# Remove bpf_xdp.o from previously used devices
for iface in $(ip -o -a l | awk '{print $2}' | cut -d: -f1 | cut -d@ -f1 | grep -v cilium); do
	[ "$iface" == "$XDP_DEV" ] && continue
	for mode in xdpdrv xdpgeneric; do
		xdp_unload "$iface" "$mode"
	done
done

function xdp_unload()
{
	DEV=$1
	MODE=$2

	ip link set dev $DEV $MODE off 2> /dev/null || true
}
```

It seems this works fine for NICs from many vendors (e.g. Intel), but not for Mellanox ones.

As a quick test, comment out the above `xdp_unload`, the problem disappears.

@jaffcheng is working on a patch to fix this.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Traffic disruption during cilium-agent upgrade/restart with Mellanox NICs #13526

Bug report

Trace the problem

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Traffic disruption during cilium-agent upgrade/restart with Mellanox NICs #13526

Description

Bug report

Trace the problem

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions