Skip to content

storage: add permitLargeSnapshots flag to replica#20589

Merged
nvb merged 2 commits intocockroachdb:masterfrom
nvb:nvanbenschoten/splitFail
Dec 17, 2017
Merged

storage: add permitLargeSnapshots flag to replica#20589
nvb merged 2 commits intocockroachdb:masterfrom
nvb:nvanbenschoten/splitFail

Conversation

@nvb
Copy link
Copy Markdown
Contributor

@nvb nvb commented Dec 9, 2017

In a privately reported user issue, we've seen that our attempts at preventing
large snapshots
can result in replica unavailability. Our current approach
to limiting large snapshots assumes is that it's ok to block snapshots indefinitely
while waiting for a range to first split. Unfortunately, this can create
a dependency cycle where a range requires a snapshot to split (because it
can't achieve an up-to-date quorum without it) but isn't allowed to perform
a snapshot until its size is reduced below the threshold. This can result
in unavailability even when a majority of replicas remain live.

We still need this snapshot size limit because unbounded snapshots
can result in OOM errors that crash entire nodes. However, once snapshots
are streamed from disk to disk, never needing to buffer in-memory on the
sending or receiving side, we should be able to remove any snapshot size
limit altogether (see #16954).

As a holdover, this change introduces a permitLargeSnapshots flag on a
replica which is set when the replica is too large to snapshot but observes
splits failing. When set, the flag allows snapshots to ignore the size
limit until the snapshot goes through and splits are able to succeed
again.

@nvb nvb requested a review from a team December 9, 2017 09:31
@cockroach-teamcity
Copy link
Copy Markdown
Member

This change is Reviewable

@petermattis
Copy link
Copy Markdown
Collaborator

Reviewed 5 of 5 files at r1.
Review status: 4 of 10 files reviewed at latest revision, 1 unresolved discussion, some commit checks failed.


pkg/storage/replica_raftstorage.go, line 406 at r2 (raw file):

	// sending or receiving side, we should be able to remove any snapshot size
	// limit. See #16954 for more.
	if r.exceedsDoubleSplitSizeRLocked() && !r.mu.permitLargeSnapshots {

Another option would be to always allow Raft snapshots (snapType == snapTypeRaft). The downside to that approach is that it won't be possible to move an unsplitable range or to up-replicate it. I think the approach in this PR is better.


Comments from Reviewable

@bdarnell
Copy link
Copy Markdown
Contributor

:lgtm:, but this should probably have a release note since it fixes/works around an externally-reported bug.


Reviewed 4 of 5 files at r1, 6 of 6 files at r2.
Review status: all files reviewed at latest revision, 2 unresolved discussions, some commit checks failed.


pkg/storage/replica.go, line 1220 at r2 (raw file):

}

// func (r *Replica) DoubleMaxBytes() {

Remove this.


Comments from Reviewable

@tbg
Copy link
Copy Markdown
Member

tbg commented Dec 11, 2017

:lgtm:


Reviewed 5 of 5 files at r1, 6 of 6 files at r2.
Review status: all files reviewed at latest revision, 5 unresolved discussions, some commit checks failed.


pkg/storage/client_split_test.go, line 916 at r2 (raw file):

}

// TestStoreRangeSplitAfterLargeSnapshot tests a scenerio where a range is too

scenario


pkg/storage/client_split_test.go, line 970 at r2 (raw file):

	header := roachpb.Header{RangeID: rangeID}

	// Deactivate split and snapshot queue.

At this point, you could've already experienced extra splits and so the check at the end of this method would always pass. Perhaps count the number of splits here after disabling the split queue and use that to compare later. Still racy (since the queue might still be processing after you've stopped it) but should be good enough.


pkg/storage/id_alloc_test.go, line 213 at r1 (raw file):

	// blocked. All attempts should hit a context deadline exceeded error.
	for i := 0; i < routines; i++ {
		ctx, _ := context.WithTimeout(context.Background(), 10*time.Millisecond)

Wanna move this up one line? Now you're doing 10ms back-to-back which makes this longer than it needs to be.


Comments from Reviewable

nvb added a commit to nvb/cockroach that referenced this pull request Dec 11, 2017
`TxnCoordSender` made sure to track and eventually clean up intents
for transactions that were put into Writing mode before seeing an
error. However, it did not previously make an effort to track and
clean up intents when it saw an `AmbiguousResultErrors` during the
first request of a transaction. This change fixes that.

This was giving me all kinds of trouble when trying to write a test
for cockroachdb#20589. Split requests continued to leave lots of dangling intents
around whenever they were canceled, which slowed down the test and
required lots of intent pushing. This change seems to resolve those
complications.

Release note (bug fix): clean up dangling intents eagerly when
AmbiguousResultErrors are seen.
@nvb nvb force-pushed the nvanbenschoten/splitFail branch from 5c21fa6 to 69a8d21 Compare December 12, 2017 06:07
nvb added a commit to nvb/cockroach that referenced this pull request Dec 12, 2017
`TxnCoordSender` made sure to track and eventually clean up intents
for transactions that were put into Writing mode before seeing an
error. However, it did not previously make an effort to track and
clean up intents when it saw an `AmbiguousResultErrors` during the
first request of a transaction. This change fixes that.

This was giving me all kinds of trouble when trying to write a test
for cockroachdb#20589. Split requests continued to leave lots of dangling intents
around whenever they were canceled, which slowed down the test and
required lots of intent pushing. This change seems to resolve those
complications.

Release note (bug fix): clean up dangling intents eagerly when
AmbiguousResultErrors are seen.
@nvb nvb force-pushed the nvanbenschoten/splitFail branch from aec9c98 to 5ba9cd5 Compare December 15, 2017 20:21
@nvb nvb force-pushed the nvanbenschoten/splitFail branch from 5ba9cd5 to 7ce4745 Compare December 15, 2017 20:24
In a privately reported user issue, we've seen that [our attempts](cockroachdb#7788)
at [preventing large snapshots](cockroachdb#7581)
can result in replica unavailability. Our current approach to limiting
large snapshots assumes is that its ok to block snapshots indefinitely
while waiting for a range to first split. Unfortunately, this can create
a dependency cycle where a range requires a snapshot to split (because it
can't achieve an up-to-date quorum without it) but isn't allowed to perform
a snapshot until its size is reduced below the threshold. This can result
in unavailability even when a majority of replicas remain live.

Currently, we still need this snapshot size limit because unbounded snapshots
can result in OOM errors that crash entire nodes. However, once snapshots
are streamed from disk to disk, never needing to buffer in-memory on the
sending or receiving side, we should be able to remove any snapshot size
limit (see cockroachdb#16954).

As a holdover, this change introduces a `permitLargeSnapshots` flag on a
replica which is set when the replica is too large to snapshot but observes
splits failing. When set, the flag allows snapshots to ignore the size
limit until the snapshot goes through and splits are able to succeed
again.

Release note (bug fix): Fixed a scenario where a range that is too big
to snapshot can lose availability even with a majority of nodes alive.
@nvb nvb force-pushed the nvanbenschoten/splitFail branch from 7ce4745 to 4b0232d Compare December 15, 2017 20:29
@nvb
Copy link
Copy Markdown
Contributor Author

nvb commented Dec 15, 2017

TFTRs. This test should no longer be flaky.


Review status: 2 of 14 files reviewed at latest revision, 6 unresolved discussions, some commit checks pending.


pkg/storage/client_split_test.go, line 916 at r2 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

scenario

Done.


pkg/storage/client_split_test.go, line 970 at r2 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

At this point, you could've already experienced extra splits and so the check at the end of this method would always pass. Perhaps count the number of splits here after disabling the split queue and use that to compare later. Still racy (since the queue might still be processing after you've stopped it) but should be good enough.

Done.


pkg/storage/client_test.go, line 887 at r5 (raw file):

	// the transport and end up reaching the store. This has been the cause of
	// flakiness in the past.
	m.transport.GetCircuitBreaker(m.idents[i].NodeID).Break()

@bdarnell this seems to address the in-flight message issue we discussed. The breakers are checked synchronously in SendAsync, so all messages sent to the stopped node should be dropped before they end up in a buffered channel (at which point we would have a lot more trouble making sure they were dropped).


pkg/storage/id_alloc_test.go, line 213 at r1 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

Wanna move this up one line? Now you're doing 10ms back-to-back which makes this longer than it needs to be.

Done.


pkg/storage/replica.go, line 1220 at r2 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

Remove this.

Done.


Comments from Reviewable

@bdarnell
Copy link
Copy Markdown
Contributor

:lgtm:


Reviewed 2 of 8 files at r3, 9 of 10 files at r4, 1 of 1 files at r5.
Review status: all files reviewed at latest revision, 4 unresolved discussions, all commit checks successful.


Comments from Reviewable

@nvb nvb merged commit 102aea5 into cockroachdb:master Dec 17, 2017
@nvb nvb deleted the nvanbenschoten/splitFail branch December 17, 2017 06:12
@nvb
Copy link
Copy Markdown
Contributor Author

nvb commented Dec 18, 2017

@bdarnell how do you feel about cherry-picking this into 1.1?

@bdarnell
Copy link
Copy Markdown
Contributor

Cherry-pick SGTM


Review status: all files reviewed at latest revision, 4 unresolved discussions, all commit checks successful.


Comments from Reviewable

nvb added a commit to nvb/cockroach that referenced this pull request Jan 25, 2018
Testing has demonstrated that hotspot workloads which write large amounts
of data to a small range of keys can overload the `splitQueue` and create
excessively large ranges. Previous efforts to parallelize replica splits
have helped improve the queue's ability to keep up with these workloads,
but speeding up splits alone will never be able to fully prevent unbounded
range growth. Large ranges (those in the GB range) are problematic for
multiple reasons including that they slow down snapshots, consistency checks,
and other processes that scan over entire ranges and haven't been built with
these size ranges in mind. In the past we've attempted to prevent these
processes from running when ranges get too large, but this has created
other issues like cockroachdb#20589.

This change introduces a proactive backpressure mechanism to delay writes
once a range gets too large while we wait for its split to succeed. This
accomplishes the task of preventing ranges from getting to large because
it stops the range from growing and prevents new writes from getting in
the way of the split attempt. This has been shown to create an effective
soft limit on the maximum size of the range. By default, this limit is set
to twice the `range_max_bytes` setting, but this change also introduces an
environment variable for configuring the limit.

Release note (performance improvement): Writes will now be backpressured
when ranges grow to large until the range is successfully able to split.
This prevents unbounded range growth and improves a clusters ability to
stay healthy under hotspot workloads.
nvb added a commit to nvb/cockroach that referenced this pull request Jan 30, 2018
Testing has demonstrated that hotspot workloads which write large amounts
of data to a small range of keys can overload the `splitQueue` and create
excessively large ranges. Previous efforts to parallelize replica splits
have helped improve the queue's ability to keep up with these workloads,
but speeding up splits alone will never be able to fully prevent unbounded
range growth. Large ranges (those in the GB range) are problematic for
multiple reasons including that they slow down snapshots, consistency checks,
and other processes that scan over entire ranges and haven't been built with
these size ranges in mind. In the past we've attempted to prevent these
processes from running when ranges get too large, but this has created
other issues like cockroachdb#20589.

This change introduces a proactive backpressure mechanism to delay writes
once a range gets too large while we wait for its split to succeed. This
accomplishes the task of preventing ranges from getting to large because
it stops the range from growing and prevents new writes from getting in
the way of the split attempt. This has been shown to create an effective
soft limit on the maximum size of the range. By default, this limit is set
to twice the `range_max_bytes` setting, but this change also introduces an
environment variable for configuring the limit.

Release note (performance improvement): Writes will now be backpressured
when ranges grow to large until the range is successfully able to split.
This prevents unbounded range growth and improves a clusters ability to
stay healthy under hotspot workloads.
tbg added a commit to tbg/cockroach that referenced this pull request Apr 16, 2018
A simple data generator that makes a single large range. The created dataset can
then be used for various tests, for example to exercises issues as the ones
ultimately leading to cockroachdb#20589, or to
make sure [large snapshots] work (once implemented).

This is a work-in-progress because I haven't reached clarity on what the best
way to hook things up in these tests would be. Do we want to create the datasets
and upload them somewhere? That has been fragile in the past as the upload
progress usually gets seldom exercised and thus rots. The alternative (which I'm
leaning towards) is to bundle this binary with the test code (either explicitly
or via use as a library) and create fresh test data every time (these tests
would run as nightlies and so dataset generation speed isn't the top concern).

In making these decisions, we should also take into account more involved datasets
that can't as easily be generated from a running cluster, such as [gcpressurizer].

For those, my current take is that we'll just generate an initialized data dir,
open the resulting RocksDB instance manually again, and write straight into it
(via some facility that updates stats correctly, i.e. presumably `MVCCPut` and
friends).

Release note: None

[large snapshots]: cockroachdb#16954
[gcpressurizer]: cockroachdb#18661
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants