storage: add permitLargeSnapshots flag to replica by nvb · Pull Request #20589 · cockroachdb/cockroach

nvb · 2017-12-09T09:31:36Z

In a privately reported user issue, we've seen that our attempts at preventing
large snapshots can result in replica unavailability. Our current approach
to limiting large snapshots assumes is that it's ok to block snapshots indefinitely
while waiting for a range to first split. Unfortunately, this can create
a dependency cycle where a range requires a snapshot to split (because it
can't achieve an up-to-date quorum without it) but isn't allowed to perform
a snapshot until its size is reduced below the threshold. This can result
in unavailability even when a majority of replicas remain live.

We still need this snapshot size limit because unbounded snapshots
can result in OOM errors that crash entire nodes. However, once snapshots
are streamed from disk to disk, never needing to buffer in-memory on the
sending or receiving side, we should be able to remove any snapshot size
limit altogether (see #16954).

As a holdover, this change introduces a permitLargeSnapshots flag on a
replica which is set when the replica is too large to snapshot but observes
splits failing. When set, the flag allows snapshots to ignore the size
limit until the snapshot goes through and splits are able to succeed
again.

cockroach-teamcity · 2017-12-09T09:31:42Z

This change is

petermattis · 2017-12-09T16:33:49Z

Reviewed 5 of 5 files at r1.
Review status: 4 of 10 files reviewed at latest revision, 1 unresolved discussion, some commit checks failed.

pkg/storage/replica_raftstorage.go, line 406 at r2 (raw file):

	// sending or receiving side, we should be able to remove any snapshot size
	// limit. See #16954 for more.
	if r.exceedsDoubleSplitSizeRLocked() && !r.mu.permitLargeSnapshots {

Another option would be to always allow Raft snapshots (snapType == snapTypeRaft). The downside to that approach is that it won't be possible to move an unsplitable range or to up-replicate it. I think the approach in this PR is better.

Comments from Reviewable

bdarnell · 2017-12-11T01:34:14Z

, but this should probably have a release note since it fixes/works around an externally-reported bug.

Reviewed 4 of 5 files at r1, 6 of 6 files at r2.
Review status: all files reviewed at latest revision, 2 unresolved discussions, some commit checks failed.

pkg/storage/replica.go, line 1220 at r2 (raw file):

}

// func (r *Replica) DoubleMaxBytes() {

Remove this.

Comments from Reviewable

tbg · 2017-12-11T18:40:42Z

Reviewed 5 of 5 files at r1, 6 of 6 files at r2.
Review status: all files reviewed at latest revision, 5 unresolved discussions, some commit checks failed.

pkg/storage/client_split_test.go, line 916 at r2 (raw file):

}

// TestStoreRangeSplitAfterLargeSnapshot tests a scenerio where a range is too

scenario

pkg/storage/client_split_test.go, line 970 at r2 (raw file):

	header := roachpb.Header{RangeID: rangeID}

	// Deactivate split and snapshot queue.

At this point, you could've already experienced extra splits and so the check at the end of this method would always pass. Perhaps count the number of splits here after disabling the split queue and use that to compare later. Still racy (since the queue might still be processing after you've stopped it) but should be good enough.

pkg/storage/id_alloc_test.go, line 213 at r1 (raw file):

	// blocked. All attempts should hit a context deadline exceeded error.
	for i := 0; i < routines; i++ {
		ctx, _ := context.WithTimeout(context.Background(), 10*time.Millisecond)

Wanna move this up one line? Now you're doing 10ms back-to-back which makes this longer than it needs to be.

Comments from Reviewable

`TxnCoordSender` made sure to track and eventually clean up intents for transactions that were put into Writing mode before seeing an error. However, it did not previously make an effort to track and clean up intents when it saw an `AmbiguousResultErrors` during the first request of a transaction. This change fixes that. This was giving me all kinds of trouble when trying to write a test for cockroachdb#20589. Split requests continued to leave lots of dangling intents around whenever they were canceled, which slowed down the test and required lots of intent pushing. This change seems to resolve those complications. Release note (bug fix): clean up dangling intents eagerly when AmbiguousResultErrors are seen.

Release note: None

In a privately reported user issue, we've seen that [our attempts](cockroachdb#7788) at [preventing large snapshots](cockroachdb#7581) can result in replica unavailability. Our current approach to limiting large snapshots assumes is that its ok to block snapshots indefinitely while waiting for a range to first split. Unfortunately, this can create a dependency cycle where a range requires a snapshot to split (because it can't achieve an up-to-date quorum without it) but isn't allowed to perform a snapshot until its size is reduced below the threshold. This can result in unavailability even when a majority of replicas remain live. Currently, we still need this snapshot size limit because unbounded snapshots can result in OOM errors that crash entire nodes. However, once snapshots are streamed from disk to disk, never needing to buffer in-memory on the sending or receiving side, we should be able to remove any snapshot size limit (see cockroachdb#16954). As a holdover, this change introduces a `permitLargeSnapshots` flag on a replica which is set when the replica is too large to snapshot but observes splits failing. When set, the flag allows snapshots to ignore the size limit until the snapshot goes through and splits are able to succeed again. Release note (bug fix): Fixed a scenario where a range that is too big to snapshot can lose availability even with a majority of nodes alive.

nvb · 2017-12-15T20:34:41Z

TFTRs. This test should no longer be flaky.

Review status: 2 of 14 files reviewed at latest revision, 6 unresolved discussions, some commit checks pending.

pkg/storage/client_split_test.go, line 916 at r2 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

scenario

Done.

pkg/storage/client_split_test.go, line 970 at r2 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

At this point, you could've already experienced extra splits and so the check at the end of this method would always pass. Perhaps count the number of splits here after disabling the split queue and use that to compare later. Still racy (since the queue might still be processing after you've stopped it) but should be good enough.

Done.

pkg/storage/client_test.go, line 887 at r5 (raw file):

	// the transport and end up reaching the store. This has been the cause of
	// flakiness in the past.
	m.transport.GetCircuitBreaker(m.idents[i].NodeID).Break()

@bdarnell this seems to address the in-flight message issue we discussed. The breakers are checked synchronously in SendAsync, so all messages sent to the stopped node should be dropped before they end up in a buffered channel (at which point we would have a lot more trouble making sure they were dropped).

pkg/storage/id_alloc_test.go, line 213 at r1 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

Wanna move this up one line? Now you're doing 10ms back-to-back which makes this longer than it needs to be.

Done.

pkg/storage/replica.go, line 1220 at r2 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

Remove this.

Done.

Comments from Reviewable

bdarnell · 2017-12-17T00:16:48Z

Reviewed 2 of 8 files at r3, 9 of 10 files at r4, 1 of 1 files at r5.
Review status: all files reviewed at latest revision, 4 unresolved discussions, all commit checks successful.

Comments from Reviewable

nvb · 2017-12-18T17:24:00Z

@bdarnell how do you feel about cherry-picking this into 1.1?

bdarnell · 2017-12-18T17:37:37Z

Cherry-pick SGTM

Review status: all files reviewed at latest revision, 4 unresolved discussions, all commit checks successful.

Comments from Reviewable

Testing has demonstrated that hotspot workloads which write large amounts of data to a small range of keys can overload the `splitQueue` and create excessively large ranges. Previous efforts to parallelize replica splits have helped improve the queue's ability to keep up with these workloads, but speeding up splits alone will never be able to fully prevent unbounded range growth. Large ranges (those in the GB range) are problematic for multiple reasons including that they slow down snapshots, consistency checks, and other processes that scan over entire ranges and haven't been built with these size ranges in mind. In the past we've attempted to prevent these processes from running when ranges get too large, but this has created other issues like cockroachdb#20589. This change introduces a proactive backpressure mechanism to delay writes once a range gets too large while we wait for its split to succeed. This accomplishes the task of preventing ranges from getting to large because it stops the range from growing and prevents new writes from getting in the way of the split attempt. This has been shown to create an effective soft limit on the maximum size of the range. By default, this limit is set to twice the `range_max_bytes` setting, but this change also introduces an environment variable for configuring the limit. Release note (performance improvement): Writes will now be backpressured when ranges grow to large until the range is successfully able to split. This prevents unbounded range growth and improves a clusters ability to stay healthy under hotspot workloads.

A simple data generator that makes a single large range. The created dataset can then be used for various tests, for example to exercises issues as the ones ultimately leading to cockroachdb#20589, or to make sure [large snapshots] work (once implemented). This is a work-in-progress because I haven't reached clarity on what the best way to hook things up in these tests would be. Do we want to create the datasets and upload them somewhere? That has been fragile in the past as the upload progress usually gets seldom exercised and thus rots. The alternative (which I'm leaning towards) is to bundle this binary with the test code (either explicitly or via use as a library) and create fresh test data every time (these tests would run as nightlies and so dataset generation speed isn't the top concern). In making these decisions, we should also take into account more involved datasets that can't as easily be generated from a running cluster, such as [gcpressurizer]. For those, my current take is that we'll just generate an initialized data dir, open the resulting RocksDB instance manually again, and write straight into it (via some facility that updates stats correctly, i.e. presumably `MVCCPut` and friends). Release note: None [large snapshots]: cockroachdb#16954 [gcpressurizer]: cockroachdb#18661

nvb requested a review from a team December 9, 2017 09:31

nvb mentioned this pull request Dec 11, 2017

kv: clean up txn intents after AmbiguousResultErrors #20628

Merged

nvb force-pushed the nvanbenschoten/splitFail branch from 5c21fa6 to 69a8d21 Compare December 12, 2017 06:07

nvb mentioned this pull request Dec 12, 2017

storage/idalloc: move idAllocator to idalloc.Allocator #20658

Merged

nvb force-pushed the nvanbenschoten/splitFail branch from aec9c98 to 5ba9cd5 Compare December 15, 2017 20:21

storage: respect context deadline in idAllocator.Allocate

b999045

Release note: None

nvb force-pushed the nvanbenschoten/splitFail branch from 5ba9cd5 to 7ce4745 Compare December 15, 2017 20:24

nvb force-pushed the nvanbenschoten/splitFail branch from 7ce4745 to 4b0232d Compare December 15, 2017 20:29

nvb merged commit 102aea5 into cockroachdb:master Dec 17, 2017

nvb deleted the nvanbenschoten/splitFail branch December 17, 2017 06:12

nvb mentioned this pull request Dec 19, 2017

cherry-pick-1.1: storage: add permitLargeSnapshots flag to replica #20906

Merged

nvb mentioned this pull request Jan 25, 2018

storage: backpressure writes for large ranges until they split #21777

Merged

nvb mentioned this pull request Apr 24, 2018

storage: Don't require splits to complete before running replicate queue #25047

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

storage: add permitLargeSnapshots flag to replica#20589

storage: add permitLargeSnapshots flag to replica#20589
nvb merged 2 commits intocockroachdb:masterfrom
nvb:nvanbenschoten/splitFail

nvb commented Dec 9, 2017 •

edited

Loading

Uh oh!

cockroach-teamcity commented Dec 9, 2017

Uh oh!

petermattis commented Dec 9, 2017

Uh oh!

bdarnell commented Dec 11, 2017

Uh oh!

tbg commented Dec 11, 2017

Uh oh!

nvb commented Dec 15, 2017

Uh oh!

bdarnell commented Dec 17, 2017

Uh oh!

nvb commented Dec 18, 2017

Uh oh!

bdarnell commented Dec 18, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

nvb commented Dec 9, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cockroach-teamcity commented Dec 9, 2017

Uh oh!

petermattis commented Dec 9, 2017

Uh oh!

bdarnell commented Dec 11, 2017

Uh oh!

tbg commented Dec 11, 2017

Uh oh!

nvb commented Dec 15, 2017

Uh oh!

bdarnell commented Dec 17, 2017

Uh oh!

nvb commented Dec 18, 2017

Uh oh!

bdarnell commented Dec 18, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

nvb commented Dec 9, 2017 •

edited

Loading