Skip to content

kvserver: benchmark different cpu load based split thresholds #96869

@kvoli

Description

@kvoli

#96128 Adds support for splitting a range once its leaseholder replica uses more CPU than kv.range_split.load_cpu_threshold. The default value of kv.range_split.load_cpu_threshold is 250ms of CPU use per second, or 1/4 of a CPU core.

This issue is to benchmark performance with different kv.range_split.load_cpu_threshold values set. The results should then inform a default value.

More specifically, benchmark ycsb, kv0, kv95 on three nodes and bisect a value that achieves the highest throughput.

The current value was selected by observing the performance of the cluster from a rebalancing perspective. The specific criteria was to constrain the occurrences of a store being overfull relative to the mean but not having any actions available to resolve being overfull. When running TPCE (50k), CPU splitting with a 250ms threshold performed 1 load based split whilst QPS splitting (2500) performed 12.5.

When running the allocbench/*/kv roachtest suite, CPU splitting (250ms) tended to make between 33-100% more load based splits than QPS splitting (2500) on workloads involving reads (usually large scans), whilst on the write heavy workloads the number of load based splits was identically low.

Here's a comparison of splits running TPCE between master(qps splits)/this branch with 250ms:

image.png

The same for allocbench (5 runs of each type, order is r=0/access=skew, r=0/ops=skew, r=50/ops=skew, r=95/access=skew, r=95/ops=skew.
image copy 1.png

Jira issue: CRDB-24382

Metadata

Metadata

Assignees

Labels

A-kv-distributionRelating to rebalancing and leasing.C-enhancementSolution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)T-kvKV Team

Type

No type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions