Note: this currently impacts Azure IPAM (so, without CNI chaining enabled), but it's theoretically possible for this to affect other managed K8s products as well. As Azure IPAM is currently marked beta, this is not critical, but this is a fundamental issue that needs to be fixed before Azure IPAM would be recommended to run in production.
Bug report
This affects Cilium versions 1.9 and earlier.
How to reproduce the issue
Consider the following scenario:
- A default AKS cluster is created. It comes with some default Deployments like
coredns, metrics-server and tunnelfront.
- The default
azure-vnet CNI plugin installs ebtables rules, routes and neigh entries for the Pod onto the node it's scheduled on.
- When the API and nodepools are up, Cilium is installed in Azure IPAM mode. The agent gets scheduled on the node and installs a CNI configuration that no longer includes
azure-vnet.
- Now, the Pods get restarted in order to become managed by Cilium. Since
azure-vnet is no longer in the CNI chain, ebtables rules and neigh entries are never cleaned up. Examples:
# ebtables-save
-A PREROUTING -p ARP --arp-op Request --arp-ip-dst 10.240.0.32 -j arpreply --arpreply-mac a2:96:88:20:cb:c6
-A PREROUTING -p IPv4 -i eth0 --ip-dst 10.240.0.32 -j dnat --to-dst a2:96:88:20:cb:c6 --dnat-target ACCEPT
and
# ip neigh
10.240.0.57 dev lxc_health lladdr 3a:d3:5d:06:09:0f REACHABLE
10.240.0.57 dev azure0 FAILED
- In the example above, the node now no longer accepts inbound packets for IP
10.240.0.32 since the destination MAC is overridden to a2:96:88:20:cb:c6, a MAC Cilium does not know about.
- Doubly problematic, if the
lxc_health endpoint gets an address assigned that was previously assigned to a non-Cilium managed Pod, this now leads to health checks performed by Cilium to start failing since inbound packets are dropped.
Possible Solutions
- Flushing
ebtables rules when the Cilium agent starts. Might cause issues for users that rely on additional node configurations.
- Ask the user to explicitly flush ebtables rules after installing Cilium. Still causes a race condition when scaling out a node pool and Cilium isn't the first to get scheduled.
- Schedule an additional DaemonSet to repeat the ebtables flushing periodically.
- Implement targeted neigh and ebtables cleanup in Cilium for Pod IPs managed by Cilium.
- Contribute chaining support to the
azure-vnet plugin. This would allow azure-vnet to get invoked during CNI DEL events to clean up the resources it created. Will not buy us compatibility with older AKS clusters.
- Work with Microsoft to be able to deploy AKS clusters with Cilium embedded, without ever installing
azure-vnet.
Alternatively, today, users that want to run Azure IPAM on AKS could roll their own nodepool images (which is not uncommon) without azure-vnet installed, or with its CNI configuration disabled. This would prevent the failure on fresh clusters, as well as during scale-out.
Note: this currently impacts Azure IPAM (so, without CNI chaining enabled), but it's theoretically possible for this to affect other managed K8s products as well. As Azure IPAM is currently marked beta, this is not critical, but this is a fundamental issue that needs to be fixed before Azure IPAM would be recommended to run in production.
Bug report
This affects Cilium versions 1.9 and earlier.
How to reproduce the issue
Consider the following scenario:
coredns,metrics-serverandtunnelfront.azure-vnetCNI plugin installsebtablesrules, routes and neigh entries for the Pod onto the node it's scheduled on.azure-vnet.azure-vnetis no longer in the CNI chain, ebtables rules and neigh entries are never cleaned up. Examples:and
10.240.0.32since the destination MAC is overridden toa2:96:88:20:cb:c6, a MAC Cilium does not know about.lxc_healthendpoint gets an address assigned that was previously assigned to a non-Cilium managed Pod, this now leads to health checks performed by Cilium to start failing since inbound packets are dropped.Possible Solutions
ebtablesrules when the Cilium agent starts. Might cause issues for users that rely on additional node configurations.azure-vnetplugin. This would allowazure-vnetto get invoked during CNIDELevents to clean up the resources it created. Will not buy us compatibility with older AKS clusters.azure-vnet.Alternatively, today, users that want to run Azure IPAM on AKS could roll their own nodepool images (which is not uncommon) without
azure-vnetinstalled, or with its CNI configuration disabled. This would prevent the failure on fresh clusters, as well as during scale-out.