Is there an existing issue for this?
Version
equal or higher than v1.17.1 and lower than v1.18.0
What happened?
In a cluster of nearly 3,000 nodes, when we modified the label of a namespace(there is about 20000 pod and 5000+ ciliumidentity in this namespace), a large amount of CiliumIdentity and CIliumEndpoint traffic was generated instantly, causing the apiserver to crash.
- Currently, in clusters where Cilium is deployed, changing the namespace label will instantly generate a number of ciliumidentity when the namespace have many pod who's label is different.
- After these ciliumidentity events are pushed to the API server, they are fully distributed to each node.
- These ciliumidentity changes will result in a large number of ciliumendpoint update events.
Therefore, when there are many pods who's lable is different in a namespace and the cluster has a large number of nodes, changing the namespace label can easily cause significant pressure on the API server and, under extreme circumstances, may lead to the API server crashing.
How can we reproduce the issue?
- The cluster has as many nodes as possible.
- Place as many different label pods as possible in a namespace.
When modifying the namespace label, the impact of traffic on the apiserver is proportional to the results of the above two factors.
Cilium Version
We use v1.13.11, but all the version have the same problem.
Kernel Version
5.10
Kubernetes Version
v1.30
Regression
No response
Sysdump
No response
Relevant log output
Anything else?
No response
Cilium Users Document
Code of Conduct
Is there an existing issue for this?
Version
equal or higher than v1.17.1 and lower than v1.18.0
What happened?
In a cluster of nearly 3,000 nodes, when we modified the label of a namespace(there is about 20000 pod and 5000+ ciliumidentity in this namespace), a large amount of CiliumIdentity and CIliumEndpoint traffic was generated instantly, causing the apiserver to crash.
Therefore, when there are many pods who's lable is different in a namespace and the cluster has a large number of nodes, changing the namespace label can easily cause significant pressure on the API server and, under extreme circumstances, may lead to the API server crashing.
How can we reproduce the issue?
When modifying the namespace label, the impact of traffic on the apiserver is proportional to the results of the above two factors.
Cilium Version
We use v1.13.11, but all the version have the same problem.
Kernel Version
5.10
Kubernetes Version
v1.30
Regression
No response
Sysdump
No response
Relevant log output
Anything else?
No response
Cilium Users Document
Code of Conduct