[Azure IPAM] Pods scheduled before Cilium don't get state cleaned up

**Note**: this currently impacts Azure IPAM (so, without CNI chaining enabled), but it's theoretically possible for this to affect other managed K8s products as well. As Azure IPAM is currently marked beta, this is not critical, but this is a fundamental issue that needs to be fixed before Azure IPAM would be recommended to run in production.

## Bug report

This affects Cilium versions 1.9 and earlier.

**How to reproduce the issue**

Consider the following scenario:

1. A default AKS cluster is created. It comes with some default Deployments like `coredns`, `metrics-server` and `tunnelfront`.
2. The default `azure-vnet` CNI plugin installs `ebtables` rules, routes and neigh entries for the Pod onto the node it's scheduled on.
3. When the API and nodepools are up, Cilium is installed in Azure IPAM mode. The agent gets scheduled on the node and installs a CNI configuration that no longer includes `azure-vnet`.
4. Now, the Pods get restarted in order to become managed by Cilium. Since `azure-vnet` is no longer in the CNI chain, ebtables rules and neigh entries are never cleaned up. Examples:
```
# ebtables-save
-A PREROUTING -p ARP --arp-op Request --arp-ip-dst 10.240.0.32 -j arpreply --arpreply-mac a2:96:88:20:cb:c6
-A PREROUTING -p IPv4 -i eth0 --ip-dst 10.240.0.32 -j dnat --to-dst a2:96:88:20:cb:c6 --dnat-target ACCEPT
```
and
```
# ip neigh
10.240.0.57 dev lxc_health lladdr 3a:d3:5d:06:09:0f REACHABLE
10.240.0.57 dev azure0  FAILED
```
5. In the example above, the node now no longer accepts inbound packets for IP `10.240.0.32` since the destination MAC is overridden to `a2:96:88:20:cb:c6`, a MAC Cilium does not know about.
6. Doubly problematic, if the `lxc_health` endpoint gets an address assigned that was previously assigned to a non-Cilium managed Pod, this now leads to health checks performed by Cilium to start failing since inbound packets are dropped.

## Possible Solutions

- Flushing `ebtables` rules when the Cilium agent starts. Might cause issues for users that rely on additional node configurations.
- Ask the user to explicitly flush ebtables rules after installing Cilium. Still causes a race condition when scaling out a node pool and Cilium isn't the first to get scheduled.
- Schedule an additional DaemonSet to repeat the ebtables flushing periodically.
- Implement targeted neigh and ebtables cleanup in Cilium for Pod IPs managed by Cilium.
- Contribute chaining support to the `azure-vnet` plugin. This would allow `azure-vnet` to get invoked during CNI `DEL` events to clean up the resources it created. Will not buy us compatibility with older AKS clusters.
- Work with Microsoft to be able to deploy AKS clusters with Cilium embedded, without ever installing `azure-vnet`.

Alternatively, today, users that want to run Azure IPAM on AKS could roll their own nodepool images (which is not uncommon) without `azure-vnet` installed, or with its CNI configuration disabled. This would prevent the failure on fresh clusters, as well as during scale-out.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Azure IPAM] Pods scheduled before Cilium don't get state cleaned up #14233

Bug report

Possible Solutions

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Azure IPAM] Pods scheduled before Cilium don't get state cleaned up #14233

Description

Bug report

Possible Solutions

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions