Bug reports
Title
On a Kubernetes cluster with 200+ nodes, Cilium can't handle the creation of 4000+ pods.
General Information
- Cilium version: v1.2.3
- Kernel version: Linux 4.14.67
- Orchestration system version in use: Kubernetes 1.10.7
- etcd version: 3.1.11
Cluster Configuration
|
Instances |
CPU per Instance |
Memory per Instance |
| Masters |
3 |
4 |
16GB |
| Nodes |
200 |
8 |
15GB |
How to reproduce the issue
- Deploy a Kubernetes cluster with 3 master nodes and 200 worker nodes
- Create 800 servers and 5200 clients making 10 requests per second each one.
Results
The following chart shows how does the cluster behave under this test scenario.

The gray line counts how many Ready nodes the cluster has. It is extremally unstable due metrics-server outages during the tests.
After reaching 3000 pods, the cluster starts to behave unexpectedly, with some pods dying, and the rate of pod creation declining until no more was created. 1500 pods were no longer created, and the cluster was practically unusable. Check this section of the chart above.

The most recurrent error was:
Warning FailedCreatePodSandBox kubelet, ip-172-151-106-95.ec2.internal Failed create pod sandbox: rpc error: code = Unknown desc = NetworkPlugin cni failed to set up pod "ubuntu-curl-5b9f864c79-mtdj6_ubuntu-curl" network: add cmd: failed to assign an IP address to container
Also, we collected some resource usage metrics before metrics-server crashing. The blue, red and orange lines represent the master nodes.

CPU usage critically increased during the test, making the Kubernetes API unavailable. We believe that this behavior may be due to some misconfiguration, since performing the same test with two other CNI plugins, Calico and Amazon VPC, there was a big difference in the results.
|
Calico |
Amazon VPC |
Cilium |
| Masters Max CPU usage |
80% |
80% |
93%* |
| Masters Max Memory usage |
25% |
25% |
28%* |
| Nodes Max CPU usage |
14% |
35% |
43%* |
| Nodes Max Memory usage |
14% |
16% |
16%* |
| Test duration |
5 min |
12 min |
interrupted |
* reported before metrics-server breaking
Do we have to make any specific configuration for a high scale cluster to work with a big number of pods and a high load networking traffic?
Bug reports
Title
On a Kubernetes cluster with 200+ nodes, Cilium can't handle the creation of 4000+ pods.
General Information
Cluster Configuration
How to reproduce the issue
Results
The following chart shows how does the cluster behave under this test scenario.

The gray line counts how many Ready nodes the cluster has. It is extremally unstable due metrics-server outages during the tests.
After reaching 3000 pods, the cluster starts to behave unexpectedly, with some pods dying, and the rate of pod creation declining until no more was created. 1500 pods were no longer created, and the cluster was practically unusable. Check this section of the chart above.
The most recurrent error was:
Also, we collected some resource usage metrics before metrics-server crashing. The blue, red and orange lines represent the master nodes.
CPU usage critically increased during the test, making the Kubernetes API unavailable. We believe that this behavior may be due to some misconfiguration, since performing the same test with two other CNI plugins, Calico and Amazon VPC, there was a big difference in the results.
* reported before metrics-server breaking
Do we have to make any specific configuration for a high scale cluster to work with a big number of pods and a high load networking traffic?