storage: add permitLargeSnapshots flag to replica#20589
storage: add permitLargeSnapshots flag to replica#20589nvb merged 2 commits intocockroachdb:masterfrom
Conversation
|
Reviewed 5 of 5 files at r1. pkg/storage/replica_raftstorage.go, line 406 at r2 (raw file):
Another option would be to always allow Raft snapshots ( Comments from Reviewable |
|
Reviewed 4 of 5 files at r1, 6 of 6 files at r2. pkg/storage/replica.go, line 1220 at r2 (raw file):
Remove this. Comments from Reviewable |
|
Reviewed 5 of 5 files at r1, 6 of 6 files at r2. pkg/storage/client_split_test.go, line 916 at r2 (raw file):
pkg/storage/client_split_test.go, line 970 at r2 (raw file):
At this point, you could've already experienced extra splits and so the check at the end of this method would always pass. Perhaps count the number of splits here after disabling the split queue and use that to compare later. Still racy (since the queue might still be processing after you've stopped it) but should be good enough. pkg/storage/id_alloc_test.go, line 213 at r1 (raw file):
Wanna move this up one line? Now you're doing 10ms back-to-back which makes this longer than it needs to be. Comments from Reviewable |
`TxnCoordSender` made sure to track and eventually clean up intents for transactions that were put into Writing mode before seeing an error. However, it did not previously make an effort to track and clean up intents when it saw an `AmbiguousResultErrors` during the first request of a transaction. This change fixes that. This was giving me all kinds of trouble when trying to write a test for cockroachdb#20589. Split requests continued to leave lots of dangling intents around whenever they were canceled, which slowed down the test and required lots of intent pushing. This change seems to resolve those complications. Release note (bug fix): clean up dangling intents eagerly when AmbiguousResultErrors are seen.
5c21fa6 to
69a8d21
Compare
`TxnCoordSender` made sure to track and eventually clean up intents for transactions that were put into Writing mode before seeing an error. However, it did not previously make an effort to track and clean up intents when it saw an `AmbiguousResultErrors` during the first request of a transaction. This change fixes that. This was giving me all kinds of trouble when trying to write a test for cockroachdb#20589. Split requests continued to leave lots of dangling intents around whenever they were canceled, which slowed down the test and required lots of intent pushing. This change seems to resolve those complications. Release note (bug fix): clean up dangling intents eagerly when AmbiguousResultErrors are seen.
aec9c98 to
5ba9cd5
Compare
Release note: None
5ba9cd5 to
7ce4745
Compare
In a privately reported user issue, we've seen that [our attempts](cockroachdb#7788) at [preventing large snapshots](cockroachdb#7581) can result in replica unavailability. Our current approach to limiting large snapshots assumes is that its ok to block snapshots indefinitely while waiting for a range to first split. Unfortunately, this can create a dependency cycle where a range requires a snapshot to split (because it can't achieve an up-to-date quorum without it) but isn't allowed to perform a snapshot until its size is reduced below the threshold. This can result in unavailability even when a majority of replicas remain live. Currently, we still need this snapshot size limit because unbounded snapshots can result in OOM errors that crash entire nodes. However, once snapshots are streamed from disk to disk, never needing to buffer in-memory on the sending or receiving side, we should be able to remove any snapshot size limit (see cockroachdb#16954). As a holdover, this change introduces a `permitLargeSnapshots` flag on a replica which is set when the replica is too large to snapshot but observes splits failing. When set, the flag allows snapshots to ignore the size limit until the snapshot goes through and splits are able to succeed again. Release note (bug fix): Fixed a scenario where a range that is too big to snapshot can lose availability even with a majority of nodes alive.
7ce4745 to
4b0232d
Compare
|
TFTRs. This test should no longer be flaky. Review status: 2 of 14 files reviewed at latest revision, 6 unresolved discussions, some commit checks pending. pkg/storage/client_split_test.go, line 916 at r2 (raw file): Previously, tschottdorf (Tobias Schottdorf) wrote…
Done. pkg/storage/client_split_test.go, line 970 at r2 (raw file): Previously, tschottdorf (Tobias Schottdorf) wrote…
Done. pkg/storage/client_test.go, line 887 at r5 (raw file):
@bdarnell this seems to address the in-flight message issue we discussed. The breakers are checked synchronously in pkg/storage/id_alloc_test.go, line 213 at r1 (raw file): Previously, tschottdorf (Tobias Schottdorf) wrote…
Done. pkg/storage/replica.go, line 1220 at r2 (raw file): Previously, bdarnell (Ben Darnell) wrote…
Done. Comments from Reviewable |
|
Reviewed 2 of 8 files at r3, 9 of 10 files at r4, 1 of 1 files at r5. Comments from Reviewable |
|
@bdarnell how do you feel about cherry-picking this into 1.1? |
|
Cherry-pick SGTM Review status: all files reviewed at latest revision, 4 unresolved discussions, all commit checks successful. Comments from Reviewable |
Testing has demonstrated that hotspot workloads which write large amounts of data to a small range of keys can overload the `splitQueue` and create excessively large ranges. Previous efforts to parallelize replica splits have helped improve the queue's ability to keep up with these workloads, but speeding up splits alone will never be able to fully prevent unbounded range growth. Large ranges (those in the GB range) are problematic for multiple reasons including that they slow down snapshots, consistency checks, and other processes that scan over entire ranges and haven't been built with these size ranges in mind. In the past we've attempted to prevent these processes from running when ranges get too large, but this has created other issues like cockroachdb#20589. This change introduces a proactive backpressure mechanism to delay writes once a range gets too large while we wait for its split to succeed. This accomplishes the task of preventing ranges from getting to large because it stops the range from growing and prevents new writes from getting in the way of the split attempt. This has been shown to create an effective soft limit on the maximum size of the range. By default, this limit is set to twice the `range_max_bytes` setting, but this change also introduces an environment variable for configuring the limit. Release note (performance improvement): Writes will now be backpressured when ranges grow to large until the range is successfully able to split. This prevents unbounded range growth and improves a clusters ability to stay healthy under hotspot workloads.
Testing has demonstrated that hotspot workloads which write large amounts of data to a small range of keys can overload the `splitQueue` and create excessively large ranges. Previous efforts to parallelize replica splits have helped improve the queue's ability to keep up with these workloads, but speeding up splits alone will never be able to fully prevent unbounded range growth. Large ranges (those in the GB range) are problematic for multiple reasons including that they slow down snapshots, consistency checks, and other processes that scan over entire ranges and haven't been built with these size ranges in mind. In the past we've attempted to prevent these processes from running when ranges get too large, but this has created other issues like cockroachdb#20589. This change introduces a proactive backpressure mechanism to delay writes once a range gets too large while we wait for its split to succeed. This accomplishes the task of preventing ranges from getting to large because it stops the range from growing and prevents new writes from getting in the way of the split attempt. This has been shown to create an effective soft limit on the maximum size of the range. By default, this limit is set to twice the `range_max_bytes` setting, but this change also introduces an environment variable for configuring the limit. Release note (performance improvement): Writes will now be backpressured when ranges grow to large until the range is successfully able to split. This prevents unbounded range growth and improves a clusters ability to stay healthy under hotspot workloads.
A simple data generator that makes a single large range. The created dataset can then be used for various tests, for example to exercises issues as the ones ultimately leading to cockroachdb#20589, or to make sure [large snapshots] work (once implemented). This is a work-in-progress because I haven't reached clarity on what the best way to hook things up in these tests would be. Do we want to create the datasets and upload them somewhere? That has been fragile in the past as the upload progress usually gets seldom exercised and thus rots. The alternative (which I'm leaning towards) is to bundle this binary with the test code (either explicitly or via use as a library) and create fresh test data every time (these tests would run as nightlies and so dataset generation speed isn't the top concern). In making these decisions, we should also take into account more involved datasets that can't as easily be generated from a running cluster, such as [gcpressurizer]. For those, my current take is that we'll just generate an initialized data dir, open the resulting RocksDB instance manually again, and write straight into it (via some facility that updates stats correctly, i.e. presumably `MVCCPut` and friends). Release note: None [large snapshots]: cockroachdb#16954 [gcpressurizer]: cockroachdb#18661
In a privately reported user issue, we've seen that our attempts at preventing
large snapshots can result in replica unavailability. Our current approach
to limiting large snapshots assumes is that it's ok to block snapshots indefinitely
while waiting for a range to first split. Unfortunately, this can create
a dependency cycle where a range requires a snapshot to split (because it
can't achieve an up-to-date quorum without it) but isn't allowed to perform
a snapshot until its size is reduced below the threshold. This can result
in unavailability even when a majority of replicas remain live.
We still need this snapshot size limit because unbounded snapshots
can result in OOM errors that crash entire nodes. However, once snapshots
are streamed from disk to disk, never needing to buffer in-memory on the
sending or receiving side, we should be able to remove any snapshot size
limit altogether (see #16954).
As a holdover, this change introduces a
permitLargeSnapshotsflag on areplica which is set when the replica is too large to snapshot but observes
splits failing. When set, the flag allows snapshots to ignore the size
limit until the snapshot goes through and splits are able to succeed
again.