storage: backpressure writes for large ranges until they split by nvb · Pull Request #21777 · cockroachdb/cockroach

nvb · 2018-01-25T08:45:50Z

Testing has demonstrated that hotspot workloads which write large amounts
of data to a small range of keys can overload the splitQueue and create
excessively large ranges. Previous efforts to parallelize replica splits
have helped improve the queue's ability to keep up with these workloads,
but speeding up splits alone will never be able to fully prevent unbounded
range growth. Large ranges (those in the GB range) are problematic for
multiple reasons including that they slow down snapshots, consistency checks,
and other processes that scan over entire ranges and haven't been built with
these size ranges in mind. In the past we've attempted to prevent these
processes from running when ranges get too large, but this has created
other issues like #20589.

This change introduces a proactive backpressure mechanism to delay writes
once a range gets too large while we wait for its split to succeed. This
accomplishes the task of preventing ranges from getting to large because
it stops the range from growing and prevents new writes from getting in
the way of the split attempt. This has been shown to create an effective
soft limit on the maximum size of the range. By default, this limit is set
to twice the range_max_bytes setting, but this change also introduces an
environment variable for configuring the limit.

Analysis

I did some testing similar to what was done previously. Here, I spun up a four node
cluster and dropped the range size down to 4MB. I then ran 4 instances of the following
kv --min-block-bytes=100000 --max-block-bytes=120000 --splits=0 --concurrency=20
against the cluster with and without backpressure enabled. For each test, I
monitored the number of splits that were pending and that succeeded, the largest
range size in the cluster, and the number of writes that were backpressured. The
results are below:

Without backpressure

With backpressure

As the tests show, without backpressure the split queue was unable to keep up
with the hotspot workload and the maximum range size blew up to more than
750MB, 180 time the range_max_bytes. With backpressure, things were much
better. The max range size consistently stayed below 10MB and gradually dropped
below twice the range_max_bytes, the threshold at which we began backpressuring
writes. The third panel shows that backpressure wasn't even needed for very
long, as the ranges count quickly grew and the load spread out to the point where
the splitQueue could keep up on its own. This is exactly what we were hoping
to see and I attribute most of the difference we saw between this test and this
one to the fix to replica prioritization in the queues made by #21673.

cockroach-teamcity · 2018-01-25T08:45:55Z

This change is

bdarnell · 2018-01-25T23:28:33Z

Reviewed 7 of 7 files at r1, 8 of 8 files at r2.
Review status: all files reviewed at latest revision, 6 unresolved discussions, all commit checks successful.

pkg/storage/metrics.go, line 459 at r1 (raw file):

	// Backpressure metrics.
	metaBackpressuredRequests = metric.Metadata{
		Name: "requests.backpressure",

Let's be more specific here (and maybe in the rest of the variable names): requests.backpressure.split.

pkg/storage/metrics.go, line 460 at r1 (raw file):

	metaBackpressuredRequests = metric.Metadata{
		Name: "requests.backpressure",
		Help: "Number of backpressured writes waiting on a Range split"}

This description sounds like it refers to a gauge, not a counter. A gauge is probably what we want here, so we can see the backlog growing and then clearing.

pkg/storage/queue.go, line 760 at r1 (raw file):

	// Call any registered callbacks.
	for _, cb := range item.callbacks {

Don't we need to clear the callbacks after this?

pkg/storage/replica_backpressure.go, line 34 at r1 (raw file):

// to 0 to disable backpressure altogether.
var backpressureRangeSizeMultiplier = func() float64 {
	mult := envutil.EnvOrDefaultFloat64("COCKROACH_BACKPRESSURE_RANGE_SIZE_MULTIPLIER", 2)

I'd rather use a cluster setting than an env var for this.

pkg/storage/replica_backpressure.go, line 62 at r1 (raw file):

	// Don't backpressure splits themselves.
	//
	// TODO DURING REVIEW should we look at the txn name or should we

I'm OK with looking at the name. I'd rather do this based on the transaction than the keys (we could maybe add an explicit no-backpressure flag if the name feels too hacky).

pkg/storage/replica_backpressure.go, line 82 at r1 (raw file):

// subject to backpressure. This is based on the size of the range in relation
// to the split size. If the range is more than backpressureRangeSizeMultiplier
// times larger than

Finish this sentence.

Comments from Reviewable

spencerkimball · 2018-01-26T10:48:22Z

Super cool

Testing has demonstrated that hotspot workloads which write large amounts of data to a small range of keys can overload the `splitQueue` and create excessively large ranges. Previous efforts to parallelize replica splits have helped improve the queue's ability to keep up with these workloads, but speeding up splits alone will never be able to fully prevent unbounded range growth. Large ranges (those in the GB range) are problematic for multiple reasons including that they slow down snapshots, consistency checks, and other processes that scan over entire ranges and haven't been built with these size ranges in mind. In the past we've attempted to prevent these processes from running when ranges get too large, but this has created other issues like cockroachdb#20589. This change introduces a proactive backpressure mechanism to delay writes once a range gets too large while we wait for its split to succeed. This accomplishes the task of preventing ranges from getting to large because it stops the range from growing and prevents new writes from getting in the way of the split attempt. This has been shown to create an effective soft limit on the maximum size of the range. By default, this limit is set to twice the `range_max_bytes` setting, but this change also introduces an environment variable for configuring the limit. Release note (performance improvement): Writes will now be backpressured when ranges grow to large until the range is successfully able to split. This prevents unbounded range growth and improves a clusters ability to stay healthy under hotspot workloads.

Now that we are able to effectively limit the size that ranges grow to, this was no longer necessary.

This changes the env var: `COCKROACH_BACKPRESSURE_RANGE_SIZE_MULTIPLIER` to a cluster setting called `kv.range.backpressure_range_size_multiplier`. Release note: None

nvb · 2018-01-30T02:36:00Z

TFTR!

Review status: 0 of 11 files reviewed at latest revision, 6 unresolved discussions.

pkg/storage/metrics.go, line 459 at r1 (raw file):