Skip to content

storage: Rebalance replicas based on resource utilization not completed QPS #34590

@bdarnell

Description

@bdarnell

The load-based rebalancing system uses QPS as its metric. This is subject to an interesting kind of negative feedback: When a node is overloaded, it starts to slow down, reducing its QPS and the urgency of rebalancing. We recently saw one cluster where this effect was so severe that the overloaded node actually had below-average QPS for the cluster, so ranges weren't getting rebalanced away from it.

Instead of QPS, we should be tracking lower-level metrics like the utilization of cpu and disk. In this case the workload was write-heavy and it was getting throttled by the disk (I think disk I/O may have increased super-linearly with the query load because of the write amplification caused by compactions).

Metadata

Metadata

Assignees

No one assigned

    Labels

    A-admission-controlA-kv-distributionRelating to rebalancing and leasing.C-enhancementSolution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)T-kvKV Team

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions