Skip to content

storage: backpressure writes for large ranges until they split#21777

Merged
nvb merged 3 commits intocockroachdb:masterfrom
nvb:nvanbenschoten/backpressureSplits
Jan 30, 2018
Merged

storage: backpressure writes for large ranges until they split#21777
nvb merged 3 commits intocockroachdb:masterfrom
nvb:nvanbenschoten/backpressureSplits

Conversation

@nvb
Copy link
Copy Markdown
Contributor

@nvb nvb commented Jan 25, 2018

Fixes #21357.

Testing has demonstrated that hotspot workloads which write large amounts
of data to a small range of keys can overload the splitQueue and create
excessively large ranges. Previous efforts to parallelize replica splits
have helped improve the queue's ability to keep up with these workloads,
but speeding up splits alone will never be able to fully prevent unbounded
range growth. Large ranges (those in the GB range) are problematic for
multiple reasons including that they slow down snapshots, consistency checks,
and other processes that scan over entire ranges and haven't been built with
these size ranges in mind. In the past we've attempted to prevent these
processes from running when ranges get too large, but this has created
other issues like #20589.

This change introduces a proactive backpressure mechanism to delay writes
once a range gets too large while we wait for its split to succeed. This
accomplishes the task of preventing ranges from getting to large because
it stops the range from growing and prevents new writes from getting in
the way of the split attempt. This has been shown to create an effective
soft limit on the maximum size of the range. By default, this limit is set
to twice the range_max_bytes setting, but this change also introduces an
environment variable for configuring the limit.

Analysis

I did some testing similar to what was done previously. Here, I spun up a four node
cluster and dropped the range size down to 4MB. I then ran 4 instances of the following
kv --min-block-bytes=100000 --max-block-bytes=120000 --splits=0 --concurrency=20
against the cluster with and without backpressure enabled. For each test, I
monitored the number of splits that were pending and that succeeded, the largest
range size in the cluster, and the number of writes that were backpressured. The
results are below:

Without backpressure

without-backpressure

With backpressure

with-backpressure

As the tests show, without backpressure the split queue was unable to keep up
with the hotspot workload and the maximum range size blew up to more than
750MB, 180 time the range_max_bytes. With backpressure, things were much
better. The max range size consistently stayed below 10MB and gradually dropped
below twice the range_max_bytes, the threshold at which we began backpressuring
writes. The third panel shows that backpressure wasn't even needed for very
long, as the ranges count quickly grew and the load spread out to the point where
the splitQueue could keep up on its own. This is exactly what we were hoping
to see and I attribute most of the difference we saw between this test and this
one
to the fix to replica prioritization in the queues made by #21673.

@nvb nvb requested a review from a team January 25, 2018 08:45
@cockroach-teamcity
Copy link
Copy Markdown
Member

This change is Reviewable

@nvb nvb force-pushed the nvanbenschoten/backpressureSplits branch 2 times, most recently from 6b60d0e to a4aebc1 Compare January 25, 2018 18:29
@bdarnell
Copy link
Copy Markdown
Contributor

:lgtm:


Reviewed 7 of 7 files at r1, 8 of 8 files at r2.
Review status: all files reviewed at latest revision, 6 unresolved discussions, all commit checks successful.


pkg/storage/metrics.go, line 459 at r1 (raw file):

	// Backpressure metrics.
	metaBackpressuredRequests = metric.Metadata{
		Name: "requests.backpressure",

Let's be more specific here (and maybe in the rest of the variable names): requests.backpressure.split.


pkg/storage/metrics.go, line 460 at r1 (raw file):

	metaBackpressuredRequests = metric.Metadata{
		Name: "requests.backpressure",
		Help: "Number of backpressured writes waiting on a Range split"}

This description sounds like it refers to a gauge, not a counter. A gauge is probably what we want here, so we can see the backlog growing and then clearing.


pkg/storage/queue.go, line 760 at r1 (raw file):

	// Call any registered callbacks.
	for _, cb := range item.callbacks {

Don't we need to clear the callbacks after this?


pkg/storage/replica_backpressure.go, line 34 at r1 (raw file):

// to 0 to disable backpressure altogether.
var backpressureRangeSizeMultiplier = func() float64 {
	mult := envutil.EnvOrDefaultFloat64("COCKROACH_BACKPRESSURE_RANGE_SIZE_MULTIPLIER", 2)

I'd rather use a cluster setting than an env var for this.


pkg/storage/replica_backpressure.go, line 62 at r1 (raw file):

	// Don't backpressure splits themselves.
	//
	// TODO DURING REVIEW should we look at the txn name or should we

I'm OK with looking at the name. I'd rather do this based on the transaction than the keys (we could maybe add an explicit no-backpressure flag if the name feels too hacky).


pkg/storage/replica_backpressure.go, line 82 at r1 (raw file):

// subject to backpressure. This is based on the size of the range in relation
// to the split size. If the range is more than backpressureRangeSizeMultiplier
// times larger than

Finish this sentence.


Comments from Reviewable

@spencerkimball
Copy link
Copy Markdown
Member

Super cool

nvb added 3 commits January 29, 2018 19:31
Testing has demonstrated that hotspot workloads which write large amounts
of data to a small range of keys can overload the `splitQueue` and create
excessively large ranges. Previous efforts to parallelize replica splits
have helped improve the queue's ability to keep up with these workloads,
but speeding up splits alone will never be able to fully prevent unbounded
range growth. Large ranges (those in the GB range) are problematic for
multiple reasons including that they slow down snapshots, consistency checks,
and other processes that scan over entire ranges and haven't been built with
these size ranges in mind. In the past we've attempted to prevent these
processes from running when ranges get too large, but this has created
other issues like cockroachdb#20589.

This change introduces a proactive backpressure mechanism to delay writes
once a range gets too large while we wait for its split to succeed. This
accomplishes the task of preventing ranges from getting to large because
it stops the range from growing and prevents new writes from getting in
the way of the split attempt. This has been shown to create an effective
soft limit on the maximum size of the range. By default, this limit is set
to twice the `range_max_bytes` setting, but this change also introduces an
environment variable for configuring the limit.

Release note (performance improvement): Writes will now be backpressured
when ranges grow to large until the range is successfully able to split.
This prevents unbounded range growth and improves a clusters ability to
stay healthy under hotspot workloads.
Now that we are able to effectively limit the size that ranges grow to,
this was no longer necessary.
This changes the env var: `COCKROACH_BACKPRESSURE_RANGE_SIZE_MULTIPLIER`
to a cluster setting called `kv.range.backpressure_range_size_multiplier`.

Release note: None
@nvb nvb force-pushed the nvanbenschoten/backpressureSplits branch from a4aebc1 to ddee298 Compare January 30, 2018 02:35
@nvb
Copy link
Copy Markdown
Contributor Author

nvb commented Jan 30, 2018

TFTR!


Review status: 0 of 11 files reviewed at latest revision, 6 unresolved discussions.


pkg/storage/metrics.go, line 459 at r1 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

Let's be more specific here (and maybe in the rest of the variable names): requests.backpressure.split.

Done.


pkg/storage/metrics.go, line 460 at r1 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

This description sounds like it refers to a gauge, not a counter. A gauge is probably what we want here, so we can see the backlog growing and then clearing.

Good point! Done.

Why does Counter have a Dec method on it in the first place? Seems like that breaks the contract from https://prometheus.io/docs/concepts/metric_types/#counter. I'll go through and see if anything can be done about that.


pkg/storage/queue.go, line 760 at r1 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

Don't we need to clear the callbacks after this?

No, the item is thrown away after this.


pkg/storage/replica_backpressure.go, line 34 at r1 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

I'd rather use a cluster setting than an env var for this.

Done.


pkg/storage/replica_backpressure.go, line 62 at r1 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

I'm OK with looking at the name. I'd rather do this based on the transaction than the keys (we could maybe add an explicit no-backpressure flag if the name feels too hacky).

SGTM.


pkg/storage/replica_backpressure.go, line 82 at r1 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

Finish this sentence.

Done.


Comments from Reviewable

@nvb
Copy link
Copy Markdown
Contributor Author

nvb commented Jan 30, 2018

Review status: 0 of 11 files reviewed at latest revision, 6 unresolved discussions.


pkg/storage/metrics.go, line 460 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Good point! Done.

Why does Counter have a Dec method on it in the first place? Seems like that breaks the contract from https://prometheus.io/docs/concepts/metric_types/#counter. I'll go through and see if anything can be done about that.

Follow up: #21903


Comments from Reviewable

@nvb nvb merged commit ddc4160 into cockroachdb:master Jan 30, 2018
@nvb nvb deleted the nvanbenschoten/backpressureSplits branch January 30, 2018 03:45
@a-robinson
Copy link
Copy Markdown
Contributor

Why does Counter have a Dec method on it in the first place? Seems like that breaks the contract from https://prometheus.io/docs/concepts/metric_types/#counter. I'll go through and see if anything can be done about that.

Because we aren't using the prometheus client library. The one we're using predates prometheus by quite a bit and has different opinions on decrement methods.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

storage: backpressure writes when ranges get too large

5 participants