-
Notifications
You must be signed in to change notification settings - Fork 4.1k
kvserver: benchmark different cpu load based split thresholds #96869
Description
#96128 Adds support for splitting a range once its leaseholder replica uses more CPU than kv.range_split.load_cpu_threshold. The default value of kv.range_split.load_cpu_threshold is 250ms of CPU use per second, or 1/4 of a CPU core.
This issue is to benchmark performance with different kv.range_split.load_cpu_threshold values set. The results should then inform a default value.
More specifically, benchmark ycsb, kv0, kv95 on three nodes and bisect a value that achieves the highest throughput.
The current value was selected by observing the performance of the cluster from a rebalancing perspective. The specific criteria was to constrain the occurrences of a store being overfull relative to the mean but not having any actions available to resolve being overfull. When running TPCE (50k), CPU splitting with a 250ms threshold performed 1 load based split whilst QPS splitting (2500) performed 12.5.
When running the allocbench/*/kv roachtest suite, CPU splitting (250ms) tended to make between 33-100% more load based splits than QPS splitting (2500) on workloads involving reads (usually large scans), whilst on the write heavy workloads the number of load based splits was identically low.
Here's a comparison of splits running TPCE between master(qps splits)/this branch with 250ms:
The same for allocbench (5 runs of each type, order is r=0/access=skew, r=0/ops=skew, r=50/ops=skew, r=95/access=skew, r=95/ops=skew.
Jira issue: CRDB-24382

