Skip to content

kv, storage: rebalance replicas when disk throughput / IOPS drops #62168

@itsbilal

Description

@itsbilal

We occasionally see instances of large production clusters where one node inexplicably got a slower disk (often an AWS/GCP local ssd), and the replicas on that node kept falling further and further behind in writes than the rest of the cluster. And since storage level compactions also take up disk write throughput, the most obvious symptom of this often is compactions backing up and Pebble read amplification increasing.

When a node is disproportionately slower at committing to disk than other nodes, replicas on that node need to be balanced away so that that disk doesn't continue to be overloaded with writes.

One metric that can be observed to identify disk slowness is command commit latency; since a LogData will have to wait for batches ahead of it to be written to the WAL, an increase in the latency of a LogData call would signal a slow disk. We already leverage LogData as part of node liveness heartbeats; before a node responds to a heartbeat request, it does a LogData to each store's engine. Here's the associated comment from liveness.go, which suggests that we already move leases when this latency increases:

			// We synchronously write to all disks before updating liveness because we
			// don't want any excessively slow disks to prevent leases from being
			// shifted to other nodes. A slow/stalled disk would block here and cause
			// the node to lose its leases.

Other possibilities of metrics to react to could include changes in disk write ops or cross-node differences in disk-write ops; the production instances of this issue that we've observed tend to show a significantly lower disk write ops on the affected node as opposed to on other nodes (in what is still an IO-bounded workload).

gz#9005

Jira issue: CRDB-2831

Metadata

Metadata

Assignees

No one assigned

    Labels

    A-kv-distributionRelating to rebalancing and leasing.A-storageRelating to our storage engine (Pebble) on-disk storage.C-enhancementSolution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)T-kvKV Team

    Type

    No type

    Projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions