Skip to content

[Azure IPAM] Pods scheduled before Cilium don't get state cleaned up #14233

@ti-mo

Description

@ti-mo

Note: this currently impacts Azure IPAM (so, without CNI chaining enabled), but it's theoretically possible for this to affect other managed K8s products as well. As Azure IPAM is currently marked beta, this is not critical, but this is a fundamental issue that needs to be fixed before Azure IPAM would be recommended to run in production.

Bug report

This affects Cilium versions 1.9 and earlier.

How to reproduce the issue

Consider the following scenario:

  1. A default AKS cluster is created. It comes with some default Deployments like coredns, metrics-server and tunnelfront.
  2. The default azure-vnet CNI plugin installs ebtables rules, routes and neigh entries for the Pod onto the node it's scheduled on.
  3. When the API and nodepools are up, Cilium is installed in Azure IPAM mode. The agent gets scheduled on the node and installs a CNI configuration that no longer includes azure-vnet.
  4. Now, the Pods get restarted in order to become managed by Cilium. Since azure-vnet is no longer in the CNI chain, ebtables rules and neigh entries are never cleaned up. Examples:
# ebtables-save
-A PREROUTING -p ARP --arp-op Request --arp-ip-dst 10.240.0.32 -j arpreply --arpreply-mac a2:96:88:20:cb:c6
-A PREROUTING -p IPv4 -i eth0 --ip-dst 10.240.0.32 -j dnat --to-dst a2:96:88:20:cb:c6 --dnat-target ACCEPT

and

# ip neigh
10.240.0.57 dev lxc_health lladdr 3a:d3:5d:06:09:0f REACHABLE
10.240.0.57 dev azure0  FAILED
  1. In the example above, the node now no longer accepts inbound packets for IP 10.240.0.32 since the destination MAC is overridden to a2:96:88:20:cb:c6, a MAC Cilium does not know about.
  2. Doubly problematic, if the lxc_health endpoint gets an address assigned that was previously assigned to a non-Cilium managed Pod, this now leads to health checks performed by Cilium to start failing since inbound packets are dropped.

Possible Solutions

  • Flushing ebtables rules when the Cilium agent starts. Might cause issues for users that rely on additional node configurations.
  • Ask the user to explicitly flush ebtables rules after installing Cilium. Still causes a race condition when scaling out a node pool and Cilium isn't the first to get scheduled.
  • Schedule an additional DaemonSet to repeat the ebtables flushing periodically.
  • Implement targeted neigh and ebtables cleanup in Cilium for Pod IPs managed by Cilium.
  • Contribute chaining support to the azure-vnet plugin. This would allow azure-vnet to get invoked during CNI DEL events to clean up the resources it created. Will not buy us compatibility with older AKS clusters.
  • Work with Microsoft to be able to deploy AKS clusters with Cilium embedded, without ever installing azure-vnet.

Alternatively, today, users that want to run Azure IPAM on AKS could roll their own nodepool images (which is not uncommon) without azure-vnet installed, or with its CNI configuration disabled. This would prevent the failure on fresh clusters, as well as during scale-out.

Metadata

Metadata

Assignees

Labels

area/azureImpacts Azure based IPAM.area/cniImpacts the Container Networking Interface between Cilium and the orchestrator.kind/bugThis is a bug in the Cilium logic.

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions