storage: backpressure writes for large ranges until they split#21777
storage: backpressure writes for large ranges until they split#21777nvb merged 3 commits intocockroachdb:masterfrom
Conversation
6b60d0e to
a4aebc1
Compare
|
Reviewed 7 of 7 files at r1, 8 of 8 files at r2. pkg/storage/metrics.go, line 459 at r1 (raw file):
Let's be more specific here (and maybe in the rest of the variable names): pkg/storage/metrics.go, line 460 at r1 (raw file):
This description sounds like it refers to a gauge, not a counter. A gauge is probably what we want here, so we can see the backlog growing and then clearing. pkg/storage/queue.go, line 760 at r1 (raw file):
Don't we need to clear the callbacks after this? pkg/storage/replica_backpressure.go, line 34 at r1 (raw file):
I'd rather use a cluster setting than an env var for this. pkg/storage/replica_backpressure.go, line 62 at r1 (raw file):
I'm OK with looking at the name. I'd rather do this based on the transaction than the keys (we could maybe add an explicit no-backpressure flag if the name feels too hacky). pkg/storage/replica_backpressure.go, line 82 at r1 (raw file):
Finish this sentence. Comments from Reviewable |
|
Super cool |
Testing has demonstrated that hotspot workloads which write large amounts of data to a small range of keys can overload the `splitQueue` and create excessively large ranges. Previous efforts to parallelize replica splits have helped improve the queue's ability to keep up with these workloads, but speeding up splits alone will never be able to fully prevent unbounded range growth. Large ranges (those in the GB range) are problematic for multiple reasons including that they slow down snapshots, consistency checks, and other processes that scan over entire ranges and haven't been built with these size ranges in mind. In the past we've attempted to prevent these processes from running when ranges get too large, but this has created other issues like cockroachdb#20589. This change introduces a proactive backpressure mechanism to delay writes once a range gets too large while we wait for its split to succeed. This accomplishes the task of preventing ranges from getting to large because it stops the range from growing and prevents new writes from getting in the way of the split attempt. This has been shown to create an effective soft limit on the maximum size of the range. By default, this limit is set to twice the `range_max_bytes` setting, but this change also introduces an environment variable for configuring the limit. Release note (performance improvement): Writes will now be backpressured when ranges grow to large until the range is successfully able to split. This prevents unbounded range growth and improves a clusters ability to stay healthy under hotspot workloads.
Now that we are able to effectively limit the size that ranges grow to, this was no longer necessary.
This changes the env var: `COCKROACH_BACKPRESSURE_RANGE_SIZE_MULTIPLIER` to a cluster setting called `kv.range.backpressure_range_size_multiplier`. Release note: None
a4aebc1 to
ddee298
Compare
|
TFTR! Review status: 0 of 11 files reviewed at latest revision, 6 unresolved discussions. pkg/storage/metrics.go, line 459 at r1 (raw file): Previously, bdarnell (Ben Darnell) wrote…
Done. pkg/storage/metrics.go, line 460 at r1 (raw file): Previously, bdarnell (Ben Darnell) wrote…
Good point! Done. Why does pkg/storage/queue.go, line 760 at r1 (raw file): Previously, bdarnell (Ben Darnell) wrote…
No, the item is thrown away after this. pkg/storage/replica_backpressure.go, line 34 at r1 (raw file): Previously, bdarnell (Ben Darnell) wrote…
Done. pkg/storage/replica_backpressure.go, line 62 at r1 (raw file): Previously, bdarnell (Ben Darnell) wrote…
SGTM. pkg/storage/replica_backpressure.go, line 82 at r1 (raw file): Previously, bdarnell (Ben Darnell) wrote…
Done. Comments from Reviewable |
|
Review status: 0 of 11 files reviewed at latest revision, 6 unresolved discussions. pkg/storage/metrics.go, line 460 at r1 (raw file): Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Follow up: #21903 Comments from Reviewable |
Because we aren't using the prometheus client library. The one we're using predates prometheus by quite a bit and has different opinions on decrement methods. |
Fixes #21357.
Testing has demonstrated that hotspot workloads which write large amounts
of data to a small range of keys can overload the
splitQueueand createexcessively large ranges. Previous efforts to parallelize replica splits
have helped improve the queue's ability to keep up with these workloads,
but speeding up splits alone will never be able to fully prevent unbounded
range growth. Large ranges (those in the GB range) are problematic for
multiple reasons including that they slow down snapshots, consistency checks,
and other processes that scan over entire ranges and haven't been built with
these size ranges in mind. In the past we've attempted to prevent these
processes from running when ranges get too large, but this has created
other issues like #20589.
This change introduces a proactive backpressure mechanism to delay writes
once a range gets too large while we wait for its split to succeed. This
accomplishes the task of preventing ranges from getting to large because
it stops the range from growing and prevents new writes from getting in
the way of the split attempt. This has been shown to create an effective
soft limit on the maximum size of the range. By default, this limit is set
to twice the
range_max_bytessetting, but this change also introduces anenvironment variable for configuring the limit.
Analysis
I did some testing similar to what was done previously. Here, I spun up a four node
cluster and dropped the range size down to 4MB. I then ran 4 instances of the following
kv --min-block-bytes=100000 --max-block-bytes=120000 --splits=0 --concurrency=20against the cluster with and without backpressure enabled. For each test, I
monitored the number of splits that were pending and that succeeded, the largest
range size in the cluster, and the number of writes that were backpressured. The
results are below:
Without backpressure
With backpressure
As the tests show, without backpressure the split queue was unable to keep up
with the hotspot workload and the maximum range size blew up to more than
750MB, 180 time the
range_max_bytes. With backpressure, things were muchbetter. The max range size consistently stayed below 10MB and gradually dropped
below twice the
range_max_bytes, the threshold at which we began backpressuringwrites. The third panel shows that backpressure wasn't even needed for very
long, as the ranges count quickly grew and the load spread out to the point where
the
splitQueuecould keep up on its own. This is exactly what we were hopingto see and I attribute most of the difference we saw between this test and this
one to the fix to replica prioritization in the queues made by #21673.