-
Notifications
You must be signed in to change notification settings - Fork 4.1k
storage: Rebalance replicas based on resource utilization not completed QPS #34590
Copy link
Copy link
Closed
Labels
A-admission-controlA-kv-distributionRelating to rebalancing and leasing.Relating to rebalancing and leasing.C-enhancementSolution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)T-kvKV TeamKV Team
Description
The load-based rebalancing system uses QPS as its metric. This is subject to an interesting kind of negative feedback: When a node is overloaded, it starts to slow down, reducing its QPS and the urgency of rebalancing. We recently saw one cluster where this effect was so severe that the overloaded node actually had below-average QPS for the cluster, so ranges weren't getting rebalanced away from it.
Instead of QPS, we should be tracking lower-level metrics like the utilization of cpu and disk. In this case the workload was write-heavy and it was getting throttled by the disk (I think disk I/O may have increased super-linearly with the query load because of the write amplification caused by compactions).
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
A-admission-controlA-kv-distributionRelating to rebalancing and leasing.Relating to rebalancing and leasing.C-enhancementSolution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)T-kvKV TeamKV Team