kvserver: store-rebalancer can get blocked on load-based replica rebalances

The `StoreRebalancer` goroutine synchronously executes [load-based lease transfers](https://github.com/cockroachdb/cockroach/blob/371466cf9c379af879e53845cdb5925bc85dc285/pkg/kv/kvserver/store_rebalancer.go#L304-L322) and [load-based replica rebalances](https://github.com/cockroachdb/cockroach/blob/371466cf9c379af879e53845cdb5925bc85dc285/pkg/kv/kvserver/store_rebalancer.go#L386-L397) of the hottest ranges in a loop. 

This means that, when a cluster is under duress and load-based replica rebalancing is taking a ~large amount of time, this can _block the store rebalancer goroutine_ (blocking cheaper actions like load-based lease transfers) for an inordinate amount of time until the `AdminRelocateRange` call for each "hot range" to be processed either fails or hits its [timeout](https://github.com/cockroachdb/cockroach/blob/371466cf9c379af879e53845cdb5925bc85dc285/pkg/kv/kvserver/store_rebalancer.go#L385). In other words, if the `StoreRebalancer` tries to rebalance away 1 replica each for a 100 ranges, and those rebalances are bound to hit their timeout, we won't see any load-based rebalancing on this store for a ~100minutes at a minimum.

We noticed this during an escalation where a single store on a hot node couldn't shed its load away because of this. The logs indicated that the StoreRebalancer goroutine was simply blocked on a _ton_ of `AdminRelocateRange` calls that were eventually timing out: 
<img width="700" alt="image" src="https://user-images.githubusercontent.com/10788754/161325459-befb8a85-4f64-4685-8d6a-b3ff069b3d73.png">

Nodes `173` and `159` ^ were both nodes that had extremely high read amp during this incident.

@cockroachdb/kv-notifications 

Jira issue: CRDB-14656


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kvserver: store-rebalancer can get blocked on load-based replica rebalances #79249

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

kvserver: store-rebalancer can get blocked on load-based replica rebalances #79249

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions