-
Notifications
You must be signed in to change notification settings - Fork 4.1k
kv, storage: rebalance replicas when disk throughput / IOPS drops #62168
Description
We occasionally see instances of large production clusters where one node inexplicably got a slower disk (often an AWS/GCP local ssd), and the replicas on that node kept falling further and further behind in writes than the rest of the cluster. And since storage level compactions also take up disk write throughput, the most obvious symptom of this often is compactions backing up and Pebble read amplification increasing.
When a node is disproportionately slower at committing to disk than other nodes, replicas on that node need to be balanced away so that that disk doesn't continue to be overloaded with writes.
One metric that can be observed to identify disk slowness is command commit latency; since a LogData will have to wait for batches ahead of it to be written to the WAL, an increase in the latency of a LogData call would signal a slow disk. We already leverage LogData as part of node liveness heartbeats; before a node responds to a heartbeat request, it does a LogData to each store's engine. Here's the associated comment from liveness.go, which suggests that we already move leases when this latency increases:
// We synchronously write to all disks before updating liveness because we
// don't want any excessively slow disks to prevent leases from being
// shifted to other nodes. A slow/stalled disk would block here and cause
// the node to lose its leases.
Other possibilities of metrics to react to could include changes in disk write ops or cross-node differences in disk-write ops; the production instances of this issue that we've observed tend to show a significantly lower disk write ops on the affected node as opposed to on other nodes (in what is still an IO-bounded workload).
gz#9005
Jira issue: CRDB-2831