Skip to content

storage: eagerly replicate split ranges#7355

Merged
petermattis merged 2 commits intocockroachdb:masterfrom
petermattis:pmattis/eager-replication
Jun 22, 2016
Merged

storage: eagerly replicate split ranges#7355
petermattis merged 2 commits intocockroachdb:masterfrom
petermattis:pmattis/eager-replication

Conversation

@petermattis
Copy link
Copy Markdown
Collaborator

@petermattis petermattis commented Jun 21, 2016

Eagerly add newly split ranges to the replicate queue. The replicate
queue only processes replicas that do not require splitting. This
reduces the time to replicate the initial ranges from ~30s to ~2s.

Fixes #7022.


This change is Reviewable

@petermattis
Copy link
Copy Markdown
Collaborator Author

@tschottdorf, @bdarnell Any pointers on how to add a test for this? I suppose I could disable the replica scanner and still verify that the initial ranges are being replicated. Do we have any similar test?

cc @BramGruneir

@petermattis petermattis force-pushed the pmattis/eager-replication branch from b43b679 to 07e31ad Compare June 21, 2016 13:51
@petermattis
Copy link
Copy Markdown
Collaborator Author

I've added a test which verifies that all of the initial ranges are in replicate queue purgatory after the initial splits occur and verified that this is not true if I disable the eager-replicate code path added in this PR.

@tamird
Copy link
Copy Markdown
Contributor

tamird commented Jun 21, 2016

LGTM

@BramGruneir
Copy link
Copy Markdown
Member

This will work when up replicating a cluster at the start, but I don't think it will help in any way once a cluster is fully replicated and these calls won't do anything helpful but make splits take a little bit longer.

I think there's a better solution out there, perhaps a specialized initial range list in the replicate queue that are more aggressively looped. Or just enqueue the initial range more than once during bootstrapping (or some other part of the startup sequence). Or perhaps enqueue all ranges when a new node is gossiped (which will be helpful for rebalances).

@BramGruneir
Copy link
Copy Markdown
Member

Or, maybe only enqueue after the split if the ranges are not fully replicated yet. That would avoid spamming the replicate queue when it is not needed (and I think we have that info already).

@petermattis
Copy link
Copy Markdown
Collaborator Author

@BramGruneir The impact on split performance will be minimal. We already do a ton of work at split time, this is a few extra checks. I'm not sure I understand your other concern. Once a cluster is fully replicated, a split range won't require additional replication. The alternative you outline seems quite a bit more work and less robust. There are two items blocking the initial ranges from being replicated: another node to replicate to and for the required splits to be complete. With this change, we trigger replication when both events occur.

@petermattis
Copy link
Copy Markdown
Collaborator Author

@BramGruneir Is calling replicateQueue.MaybeAdd that expensive? When we have thousands of ranges on a node we'll be calling that method dozens of times per second just due to the replica scanner. I have trouble imagining that the additional calls after a split will be problematic. And if they are, we should make replicateQueue.MaybeAdd faster.

@cuongdo
Copy link
Copy Markdown
Contributor

cuongdo commented Jun 21, 2016

Yeah, just judging from past runs of the existing
TestUpreplicate1To3Medium acceptance test, we consume quite a bit of CPU
in the steady state (no replica movement between nodes). A good chunk of
that is because of the replica scanner. So, I believe that moving to an
event-based model should be a win.

On Tue, Jun 21, 2016 at 10:35 AM Peter Mattis notifications@github.com
wrote:

@BramGruneir https://github.com/BramGruneir Is calling
replicateQueue.MaybeAdd that expensive? When we have thousands of ranges
on a node we'll be calling that method dozens of times per second just due
to the replica scanner. I have trouble imagining that the additional calls
after a split will be problematic. And if they are, we should make
replicateQueue.MaybeAdd faster.


You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
#7355 (comment),
or mute the thread
https://github.com/notifications/unsubscribe/ABffpkeJgvVjAFhzEwDrWGyZTLWwmGscks5qN_augaJpZM4I6s2o
.

@tbg
Copy link
Copy Markdown
Member

tbg commented Jun 21, 2016

:lgtm:


Reviewed 1 of 1 files at r1, 5 of 5 files at r2.
Review status: all files reviewed at latest revision, 2 unresolved discussions, some commit checks failed.


storage/replica_command.go, line 2292 [r1] (raw file):
nit: was mildly confused by the second part. How about

// If the range was in not properly replicated before the split, the replicate queue
// may not have picked it up (due to the need for a split). Enqueue both new halves
// to speed up a potentially necessary replication. See #7022.


storage/replicate_test.go, line 42 [r2] (raw file):

  // should be present in replicate queue purgatory.
  expected := server.ExpectedInitialRangeCount()
  if n := store.ReplicateQueuePurgatoryLength(); expected != n {

I'm too unfamiliar with the code to figure out why the ranges end up in purgatory. Could you add a comment as to why that is?


Comments from Reviewable

@BramGruneir
Copy link
Copy Markdown
Member

@petermattis I was just kind of spitballing a few alternate ideas that I had last night.

I just don't think adding the call on both sides of a split during normal operation will be useful. I can see us looking to speed up splits in the near future and adding unneeded checks won't help.

Is maybeAdd fast? It should be relatively quick, but we are possibly iterating over all stores. So it could be faster. We haven't worried about its performance before.

@petermattis petermattis force-pushed the pmattis/eager-replication branch from 07e31ad to 7063c9f Compare June 21, 2016 15:04
@petermattis
Copy link
Copy Markdown
Collaborator Author

Review status: all files reviewed at latest revision, 2 unresolved discussions, some commit checks failed.


storage/replica_command.go, line 2292 [r1] (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

nit: was mildly confused by the second part. How about

// If the range was in not properly replicated before the split, the replicate queue
// may not have picked it up (due to the need for a split). Enqueue both new halves
// to speed up a potentially necessary replication. See #7022.

Sure. Done.

storage/replicate_test.go, line 42 [r2] (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

I'm too unfamiliar with the code to figure out why the ranges end up in purgatory. Could you add a comment as to why that is?

Done. Added `(because we only have a single store in the test and thus replication cannot succeed)`.

Comments from Reviewable

@BramGruneir
Copy link
Copy Markdown
Member

@petermattis Anyway, hard to argue with results, so this change works for me. I'm describing a potential problem that we haven't encountered yet. So let's go with this.

But also notice that this didn't solve the problem I'm investigating about the acceptance test failures, since 2 of them just failed.

I'll take a look at speeding up maybeAdd, I didn't know we were spending so much cpu when the cluster is idle. That seems like a problem on its own too.

@petermattis
Copy link
Copy Markdown
Collaborator Author

@cuongdo How many ranges did you have when you saw the replica scanner consuming CPU while idle?

@cuongdo
Copy link
Copy Markdown
Contributor

cuongdo commented Jun 21, 2016

2,265 per node (so multiply by 3)

On Tue, Jun 21, 2016 at 11:09 AM Peter Mattis notifications@github.com
wrote:

@cuongdo https://github.com/cuongdo How many ranges did you have when
you saw the replica scanner consuming CPU while idle?


You are receiving this because you were mentioned.

Reply to this email directly, view it on GitHub
#7355 (comment),
or mute the thread
https://github.com/notifications/unsubscribe/ABffpiWESeCMoYyaVcHDnfeekBpeb0HWks5qN_7GgaJpZM4I6s2o
.

@cuongdo
Copy link
Copy Markdown
Contributor

cuongdo commented Jun 21, 2016

108 GiB of data per node.

On Tue, Jun 21, 2016 at 11:12 AM Cuong Do cdo@cockroachlabs.com wrote:

2,265 per node (so multiply by 3)

On Tue, Jun 21, 2016 at 11:09 AM Peter Mattis notifications@github.com
wrote:

@cuongdo https://github.com/cuongdo How many ranges did you have when
you saw the replica scanner consuming CPU while idle?


You are receiving this because you were mentioned.

Reply to this email directly, view it on GitHub
#7355 (comment),
or mute the thread
https://github.com/notifications/unsubscribe/ABffpiWESeCMoYyaVcHDnfeekBpeb0HWks5qN_7GgaJpZM4I6s2o
.

@petermattis
Copy link
Copy Markdown
Collaborator Author

@cuongdo So we'd be scanning 3.7 replicas/second with the default scan time. Surprised that would show up as a CPU consumer, but perhaps the replica allocator has some low-hanging fruit. Let's file a separate issue about this.

@bdarnell
Copy link
Copy Markdown
Contributor

:lgtm:

If we make enough changes like this to make things event-driven, we may be able to slow down the replica scanner without ill effect. We should aim to make MaybeAdd fast enough that we can call it whenever anything interesting happens on a range. In fact, a split affects most of the queues (all of the ones with acceptsUnsplitRanges: false, plus the split queue itself may be ready to split the range again if it is in a new zone config or if it had gotten very large), so maybe we just should add the new ranges to all the queues after a split?


Review status: 4 of 6 files reviewed at latest revision, 2 unresolved discussions, some commit checks pending.


Comments from Reviewable

@petermattis
Copy link
Copy Markdown
Collaborator Author

Besides the replicateQueue, only gcQueue has acceptUnsplitRanges: false (unless I grep'd incorrectly).


Review status: 4 of 6 files reviewed at latest revision, 2 unresolved discussions, some commit checks pending.


Comments from Reviewable

Eagerly add newly split ranges to the replicate queue. The replicate
queue only processes replicas that do not require splitting. This
reduces the time to replicate the initial ranges from ~30s to ~2s.

Fixes cockroachdb#7022.
@tamird
Copy link
Copy Markdown
Contributor

tamird commented Jun 21, 2016

:lgtm:


Reviewed 4 of 5 files at r2, 2 of 2 files at r3.
Review status: all files reviewed at latest revision, 2 unresolved discussions, some commit checks failed.


Comments from Reviewable

Temporarily adjust the value of defaultDeclinedReservationsTimeout to 0s
to avoid problems with initial range replication in 3 node clusters.

See cockroachdb#7332.
@petermattis petermattis force-pushed the pmattis/eager-replication branch from 7063c9f to 988628a Compare June 21, 2016 19:57
@petermattis
Copy link
Copy Markdown
Collaborator Author

@BramGruneir I pushed another commit which sets defaultDeclinedReservationsTimeout to 0s as a temporary workaround to acceptance test flakiness. Before that change TestSingleKey was flaky about 1 out of every 5 runs. Afterwards, I've now seen 32 successful consecutive runs.

@tbg
Copy link
Copy Markdown
Member

tbg commented Jun 21, 2016

👍

@petermattis petermattis merged commit f447eb0 into cockroachdb:master Jun 22, 2016
@petermattis petermattis deleted the pmattis/eager-replication branch June 22, 2016 13:30
@mberhault mberhault mentioned this pull request Jun 22, 2016
petermattis added a commit to petermattis/cockroach that referenced this pull request Jun 22, 2016
Mark Replicas as destroyed in `Replica.Destroy` and do not perform any
further raft operations after destruction.

This addresses the underlying bug in cockroachdb#7386 that was exacerbated by cockroachdb#7355.

The specific sequence of events in cockroachdb#7386 was that a replica was queued
for replication and the test then removed the replica from the
store. Removal of a replica from a store removes it from the various
queues, though that is done asynchronously. The replica queue merrily
tries to process the now destroyed replica and as a first step tries to
acquire the leader lease. This leader lease acquisition is done via a
raft command, but the state underlying the replica has already been
deleted. Boom!
petermattis added a commit to petermattis/cockroach that referenced this pull request Jun 22, 2016
Mark Replicas as destroyed in `Replica.Destroy` and do not perform any
further raft operations after destruction.

This addresses the underlying bug in cockroachdb#7386 that was exacerbated by cockroachdb#7355.

The specific sequence of events in cockroachdb#7386 was that a replica was queued
for replication and the test then removed the replica from the
store. Removal of a replica from a store removes it from the various
queues, though that is done asynchronously. The replica queue merrily
tries to process the now destroyed replica and as a first step tries to
acquire the leader lease. This leader lease acquisition is done via a
raft command, but the state underlying the replica has already been
deleted. Boom!
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants