storage: Rebalance replicas based on resource utilization not completed QPS

The load-based rebalancing system uses QPS as its metric. This is subject to an interesting kind of negative feedback: When a node is overloaded, it starts to slow down, reducing its QPS and the urgency of rebalancing. We recently saw one cluster where this effect was so severe that the overloaded node actually had below-average QPS for the cluster, so ranges weren't getting rebalanced away from it.

Instead of QPS, we should be tracking lower-level metrics like the utilization of cpu and disk. In this case the workload was write-heavy and it was getting throttled by the disk (I think disk I/O may have increased super-linearly with the query load because of the write amplification caused by compactions). 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

storage: Rebalance replicas based on resource utilization not completed QPS #34590

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

storage: Rebalance replicas based on resource utilization not completed QPS #34590

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions