storage: eagerly replicate split ranges by petermattis · Pull Request #7355 · cockroachdb/cockroach

petermattis · 2016-06-21T12:54:44Z

Eagerly add newly split ranges to the replicate queue. The replicate
queue only processes replicas that do not require splitting. This
reduces the time to replicate the initial ranges from ~30s to ~2s.

Fixes #7022.

This change is

petermattis · 2016-06-21T12:56:05Z

@tschottdorf, @bdarnell Any pointers on how to add a test for this? I suppose I could disable the replica scanner and still verify that the initial ranges are being replicated. Do we have any similar test?

cc @BramGruneir

petermattis · 2016-06-21T13:52:18Z

I've added a test which verifies that all of the initial ranges are in replicate queue purgatory after the initial splits occur and verified that this is not true if I disable the eager-replicate code path added in this PR.

tamird · 2016-06-21T14:01:13Z

LGTM

BramGruneir · 2016-06-21T14:27:10Z

This will work when up replicating a cluster at the start, but I don't think it will help in any way once a cluster is fully replicated and these calls won't do anything helpful but make splits take a little bit longer.

I think there's a better solution out there, perhaps a specialized initial range list in the replicate queue that are more aggressively looped. Or just enqueue the initial range more than once during bootstrapping (or some other part of the startup sequence). Or perhaps enqueue all ranges when a new node is gossiped (which will be helpful for rebalances).

BramGruneir · 2016-06-21T14:32:06Z

Or, maybe only enqueue after the split if the ranges are not fully replicated yet. That would avoid spamming the replicate queue when it is not needed (and I think we have that info already).

petermattis · 2016-06-21T14:32:41Z

@BramGruneir The impact on split performance will be minimal. We already do a ton of work at split time, this is a few extra checks. I'm not sure I understand your other concern. Once a cluster is fully replicated, a split range won't require additional replication. The alternative you outline seems quite a bit more work and less robust. There are two items blocking the initial ranges from being replicated: another node to replicate to and for the required splits to be complete. With this change, we trigger replication when both events occur.

petermattis · 2016-06-21T14:35:18Z

@BramGruneir Is calling replicateQueue.MaybeAdd that expensive? When we have thousands of ranges on a node we'll be calling that method dozens of times per second just due to the replica scanner. I have trouble imagining that the additional calls after a split will be problematic. And if they are, we should make replicateQueue.MaybeAdd faster.

cuongdo · 2016-06-21T14:40:56Z

Yeah, just judging from past runs of the existing
TestUpreplicate1To3Medium acceptance test, we consume quite a bit of CPU
in the steady state (no replica movement between nodes). A good chunk of
that is because of the replica scanner. So, I believe that moving to an
event-based model should be a win.

On Tue, Jun 21, 2016 at 10:35 AM Peter Mattis notifications@github.com
wrote:

@BramGruneir https://github.com/BramGruneir Is calling
replicateQueue.MaybeAdd that expensive? When we have thousands of ranges
on a node we'll be calling that method dozens of times per second just due
to the replica scanner. I have trouble imagining that the additional calls
after a split will be problematic. And if they are, we should make
replicateQueue.MaybeAdd faster.

—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
#7355 (comment),
or mute the thread
https://github.com/notifications/unsubscribe/ABffpkeJgvVjAFhzEwDrWGyZTLWwmGscks5qN_augaJpZM4I6s2o
.

tbg · 2016-06-21T14:47:22Z

Reviewed 1 of 1 files at r1, 5 of 5 files at r2.
Review status: all files reviewed at latest revision, 2 unresolved discussions, some commit checks failed.

storage/replica_command.go, line 2292 [r1] (raw file):
nit: was mildly confused by the second part. How about

// If the range was in not properly replicated before the split, the replicate queue
// may not have picked it up (due to the need for a split). Enqueue both new halves
// to speed up a potentially necessary replication. See #7022.

storage/replicate_test.go, line 42 [r2] (raw file):

  // should be present in replicate queue purgatory.
  expected := server.ExpectedInitialRangeCount()
  if n := store.ReplicateQueuePurgatoryLength(); expected != n {

I'm too unfamiliar with the code to figure out why the ranges end up in purgatory. Could you add a comment as to why that is?

Comments from Reviewable

BramGruneir · 2016-06-21T14:50:01Z

@petermattis I was just kind of spitballing a few alternate ideas that I had last night.

I just don't think adding the call on both sides of a split during normal operation will be useful. I can see us looking to speed up splits in the near future and adding unneeded checks won't help.

Is maybeAdd fast? It should be relatively quick, but we are possibly iterating over all stores. So it could be faster. We haven't worried about its performance before.

petermattis · 2016-06-21T15:04:09Z

Review status: all files reviewed at latest revision, 2 unresolved discussions, some commit checks failed.

storage/replica_command.go, line 2292 [r1] (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

nit: was mildly confused by the second part. How about

// If the range was in not properly replicated before the split, the replicate queue
// may not have picked it up (due to the need for a split). Enqueue both new halves
// to speed up a potentially necessary replication. See #7022.

Sure. Done.

storage/replicate_test.go, line 42 [r2] (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

I'm too unfamiliar with the code to figure out why the ranges end up in purgatory. Could you add a comment as to why that is?

Done. Added `(because we only have a single store in the test and thus replication cannot succeed)`.

Comments from Reviewable

BramGruneir · 2016-06-21T15:05:30Z

@petermattis Anyway, hard to argue with results, so this change works for me. I'm describing a potential problem that we haven't encountered yet. So let's go with this.

But also notice that this didn't solve the problem I'm investigating about the acceptance test failures, since 2 of them just failed.

I'll take a look at speeding up maybeAdd, I didn't know we were spending so much cpu when the cluster is idle. That seems like a problem on its own too.

petermattis · 2016-06-21T15:09:50Z

@cuongdo How many ranges did you have when you saw the replica scanner consuming CPU while idle?

cuongdo · 2016-06-21T15:12:17Z

2,265 per node (so multiply by 3)

On Tue, Jun 21, 2016 at 11:09 AM Peter Mattis notifications@github.com
wrote:

@cuongdo https://github.com/cuongdo How many ranges did you have when
you saw the replica scanner consuming CPU while idle?

—
You are receiving this because you were mentioned.

Reply to this email directly, view it on GitHub
#7355 (comment),
or mute the thread
https://github.com/notifications/unsubscribe/ABffpiWESeCMoYyaVcHDnfeekBpeb0HWks5qN_7GgaJpZM4I6s2o
.

cuongdo · 2016-06-21T15:12:38Z

108 GiB of data per node.

On Tue, Jun 21, 2016 at 11:12 AM Cuong Do cdo@cockroachlabs.com wrote:

2,265 per node (so multiply by 3)

On Tue, Jun 21, 2016 at 11:09 AM Peter Mattis notifications@github.com
wrote:

@cuongdo https://github.com/cuongdo How many ranges did you have when
you saw the replica scanner consuming CPU while idle?

—
You are receiving this because you were mentioned.

Reply to this email directly, view it on GitHub
#7355 (comment),
or mute the thread
https://github.com/notifications/unsubscribe/ABffpiWESeCMoYyaVcHDnfeekBpeb0HWks5qN_7GgaJpZM4I6s2o
.

petermattis · 2016-06-21T15:14:17Z

@cuongdo So we'd be scanning 3.7 replicas/second with the default scan time. Surprised that would show up as a CPU consumer, but perhaps the replica allocator has some low-hanging fruit. Let's file a separate issue about this.

bdarnell · 2016-06-21T15:31:38Z

If we make enough changes like this to make things event-driven, we may be able to slow down the replica scanner without ill effect. We should aim to make MaybeAdd fast enough that we can call it whenever anything interesting happens on a range. In fact, a split affects most of the queues (all of the ones with acceptsUnsplitRanges: false, plus the split queue itself may be ready to split the range again if it is in a new zone config or if it had gotten very large), so maybe we just should add the new ranges to all the queues after a split?

Review status: 4 of 6 files reviewed at latest revision, 2 unresolved discussions, some commit checks pending.

Comments from Reviewable

petermattis · 2016-06-21T15:37:07Z

Besides the replicateQueue, only gcQueue has acceptUnsplitRanges: false (unless I grep'd incorrectly).

Review status: 4 of 6 files reviewed at latest revision, 2 unresolved discussions, some commit checks pending.

Comments from Reviewable

Eagerly add newly split ranges to the replicate queue. The replicate queue only processes replicas that do not require splitting. This reduces the time to replicate the initial ranges from ~30s to ~2s. Fixes cockroachdb#7022.

tamird · 2016-06-21T16:57:32Z

Reviewed 4 of 5 files at r2, 2 of 2 files at r3.
Review status: all files reviewed at latest revision, 2 unresolved discussions, some commit checks failed.

Comments from Reviewable

Temporarily adjust the value of defaultDeclinedReservationsTimeout to 0s to avoid problems with initial range replication in 3 node clusters. See cockroachdb#7332.

petermattis · 2016-06-21T19:59:12Z

@BramGruneir I pushed another commit which sets defaultDeclinedReservationsTimeout to 0s as a temporary workaround to acceptance test flakiness. Before that change TestSingleKey was flaky about 1 out of every 5 runs. Afterwards, I've now seen 32 successful consecutive runs.

tbg · 2016-06-21T20:27:55Z

👍

Mark Replicas as destroyed in `Replica.Destroy` and do not perform any further raft operations after destruction. This addresses the underlying bug in cockroachdb#7386 that was exacerbated by cockroachdb#7355. The specific sequence of events in cockroachdb#7386 was that a replica was queued for replication and the test then removed the replica from the store. Removal of a replica from a store removes it from the various queues, though that is done asynchronously. The replica queue merrily tries to process the now destroyed replica and as a first step tries to acquire the leader lease. This leader lease acquisition is done via a raft command, but the state underlying the replica has already been deleted. Boom!

petermattis force-pushed the pmattis/eager-replication branch from b43b679 to 07e31ad Compare June 21, 2016 13:51

petermattis force-pushed the pmattis/eager-replication branch from 07e31ad to 7063c9f Compare June 21, 2016 15:04

storage: eagerly replicate split ranges

cf5fee1

Eagerly add newly split ranges to the replicate queue. The replicate queue only processes replicas that do not require splitting. This reduces the time to replicate the initial ranges from ~30s to ~2s. Fixes cockroachdb#7022.

petermattis mentioned this pull request Jun 21, 2016

storage: when stores are unavailable, don't send the replica to purgatory #7332

Merged

storage: temporarily default defaultDeclinedReservationsTimeout to 0s

988628a

Temporarily adjust the value of defaultDeclinedReservationsTimeout to 0s to avoid problems with initial range replication in 3 node clusters. See cockroachdb#7332.

petermattis force-pushed the pmattis/eager-replication branch from 7063c9f to 988628a Compare June 21, 2016 19:57

petermattis merged commit f447eb0 into cockroachdb:master Jun 22, 2016

petermattis deleted the pmattis/eager-replication branch June 22, 2016 13:30

mberhault mentioned this pull request Jun 22, 2016

beta: 20160623 #7390

Closed

petermattis mentioned this pull request Jun 22, 2016

storage: mark Replicas as destroyed #7404

Merged

Conversation

petermattis commented Jun 21, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

petermattis commented Jun 21, 2016

Uh oh!

petermattis commented Jun 21, 2016

Uh oh!

tamird commented Jun 21, 2016

Uh oh!

BramGruneir commented Jun 21, 2016

Uh oh!

BramGruneir commented Jun 21, 2016

Uh oh!

petermattis commented Jun 21, 2016

Uh oh!

petermattis commented Jun 21, 2016

Uh oh!

cuongdo commented Jun 21, 2016

Uh oh!

tbg commented Jun 21, 2016

Uh oh!

BramGruneir commented Jun 21, 2016

Uh oh!

petermattis commented Jun 21, 2016

Uh oh!

BramGruneir commented Jun 21, 2016

Uh oh!

petermattis commented Jun 21, 2016

Uh oh!

cuongdo commented Jun 21, 2016

Uh oh!

cuongdo commented Jun 21, 2016

Uh oh!

petermattis commented Jun 21, 2016

Uh oh!

bdarnell commented Jun 21, 2016

Uh oh!

petermattis commented Jun 21, 2016

Uh oh!

tamird commented Jun 21, 2016

Uh oh!

petermattis commented Jun 21, 2016

Uh oh!

tbg commented Jun 21, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

petermattis commented Jun 21, 2016 •

edited

Loading