Cilium can't handle pods massive creation on high scale Kubernetes cluster

## Bug reports

### Title

On a Kubernetes cluster with 200+ nodes, Cilium can't handle the creation of 4000+ pods.

### General Information

- __Cilium version__: v1.2.3
- __Kernel version__: Linux 4.14.67
- __Orchestration system version in use__: Kubernetes 1.10.7
- __etcd version__: 3.1.11

### Cluster Configuration

|       | Instances | CPU per Instance | Memory per Instance |
| --- | --- | --- | --- |
| Masters | 3 | 4 | 16GB |
| Nodes | 200 | 8 | 15GB |

### How to reproduce the issue

1. Deploy a Kubernetes cluster with 3 master nodes and 200 worker nodes
2. Create 800 servers and 5200 clients making 10 requests per second each one.

### Results
The following chart shows how does the cluster behave under this test scenario.
![pods](https://user-images.githubusercontent.com/11253021/47047526-2c060580-d16e-11e8-848c-2939516d000a.png)
The gray line counts how many Ready nodes the cluster has. It is extremally unstable due metrics-server outages during the tests.

After reaching 3000 pods, the cluster starts to behave unexpectedly, with some pods dying, and the rate of pod creation declining until no more was created. 1500 pods were no longer created, and the cluster was practically unusable. Check this section of the chart above.

![zoom_pods](https://user-images.githubusercontent.com/11253021/47049189-ff082180-d172-11e8-9402-5ef3e5e13f54.png)

The most recurrent error was:
```
Warning   FailedCreatePodSandBox   kubelet, ip-172-151-106-95.ec2.internal    Failed create pod sandbox: rpc error: code = Unknown desc = NetworkPlugin cni failed to set up pod "ubuntu-curl-5b9f864c79-mtdj6_ubuntu-curl" network: add cmd: failed to assign an IP address to container
```
Also, we collected some resource usage metrics before metrics-server crashing. The blue, red and orange lines represent the master nodes.

![resources](https://user-images.githubusercontent.com/11253021/47048946-3e823e00-d172-11e8-96d4-485a7ead3e71.png)

CPU usage critically increased during the test, making the Kubernetes API unavailable. We believe that this behavior may be due to some misconfiguration, since performing the same test with two other CNI plugins, Calico and Amazon VPC, there was a big difference in the results.

| | Calico | Amazon VPC | Cilium |
|-- | -- | -- | -- |
|Masters Max CPU usage | 80% | 80% | 93%* |
|Masters Max Memory usage | 25% | 25% | 28%* |
|Nodes Max CPU usage | 14% | 35% | 43%* |
|Nodes Max Memory usage | 14% | 16% | 16%* |
|Test duration | 5 min | 12 min | interrupted |

\* reported before metrics-server breaking

Do we have to make any specific configuration for a high scale cluster to work with a big number of pods and a high load networking traffic?


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cilium can't handle pods massive creation on high scale Kubernetes cluster #5913

Bug reports

Title

General Information

Cluster Configuration

How to reproduce the issue

Results

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

	Calico	Amazon VPC	Cilium
Masters Max CPU usage	80%	80%	93%*
Masters Max Memory usage	25%	25%	28%*
Nodes Max CPU usage	14%	35%	43%*
Nodes Max Memory usage	14%	16%	16%*
Test duration	5 min	12 min	interrupted

Cilium can't handle pods massive creation on high scale Kubernetes cluster #5913

Description

Bug reports

Title

General Information

Cluster Configuration

How to reproduce the issue

Results

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions