A test framework to support large scale performance test

(This is a task description for the Centaurus Summer Camp 2021 event)

**Motivation**:

Develop a test framework (including a lightweight node simulation) to reduce the cost of Arktos control plane scalability test.  It should be much more lightweight than Kubermark/HollowNode as it doesn't need to simulate a full kubelet.

**Motivation**:
An important metrics of Arktos is scalability, which indicates how many host/services can be running in a single cluster. Current community kubernetes single cluster supports 10K nodes, Arktos supports 20K ~ 50K nodes depends on the setting. The goal of Arktos is to support 100K nodes in a single cluster.

However, current scalability performance test, running with kubemark simulating kubelet using hollow node, is very costly. Hollow node simulates a fledge kubelet except the runtime part. Running 50K hollow nodes requires hundreds of VMs.

We observed that many times the system bottleneck is with the master components, such as ETCD, API server, scheduler, controller manager, not kubelet. If we have a test framework that allows test cases to simulate real client usage of arktos control plane with a lightweight node simulation tool, it will save a lot of money in scalability test.

The idea here is to develop a test framework that plugs a lightweight node simulation tool into existing Arktos cluster master components, running original performance test tool against the cluster. The node simulation tool shall be able to simulate the original kubelet and communicate with API server with fake but configurable latencies.

**Deliverable**:

* An test framework that has a lightweight node simulation tool but runs a cluster with real API server/ETCD/controller manager/scheduler
* Current performance test tool shall be able to running against the test framework without changes, i.e., client agnostic
* This test framework shall be able to simulate a cluster with at least a few thousands nodes

**Requirements**
* The test framework shall be able to work on a single machine, started with single command, simulate a cluster with several thousands of nodes.
* The test framework shall use existing Arktos code, ideally without any code changes to API server, controller manager, scheduler, and kubelet, i.e., it does not affect existing system
* After running cluster up in the test framework, we shall see simulated nodes in local cluster and make kubectl/API server request and getting response same as original
* The node simulation tool in the framework shall be able to do the following:
    - Configure and simulate client latencies
    - Simulate some random node failure events
* The test framework shall be able to running 50K node test with only a handful of machine
* Arktos perf test shall be able to run on this test framework
* We will use Arktos perf test to call this test framework and check how many nodes an API server can handle

**Advisor(s)**: @Sindica (Ying Huang)

**Resource Links:**
* The script to bring up a local cluster. By default the cluster has only one node: https://github.com/CentaurusInfra/arktos/blob/master/hack/arktos-up.sh
* The perf test tool that arktos use: 
https://github.com/CentaurusInfra/arktos/blob/master/perf-tests/clusterloader2/run-e2e.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A test framework to support large scale performance test #1085

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

A test framework to support large scale performance test #1085

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions