storage: eagerly replicate split ranges#7355
Conversation
|
@tschottdorf, @bdarnell Any pointers on how to add a test for this? I suppose I could disable the replica scanner and still verify that the initial ranges are being replicated. Do we have any similar test? cc @BramGruneir |
b43b679 to
07e31ad
Compare
|
I've added a test which verifies that all of the initial ranges are in replicate queue purgatory after the initial splits occur and verified that this is not true if I disable the eager-replicate code path added in this PR. |
|
LGTM |
|
This will work when up replicating a cluster at the start, but I don't think it will help in any way once a cluster is fully replicated and these calls won't do anything helpful but make splits take a little bit longer. I think there's a better solution out there, perhaps a specialized initial range list in the replicate queue that are more aggressively looped. Or just enqueue the initial range more than once during bootstrapping (or some other part of the startup sequence). Or perhaps enqueue all ranges when a new node is gossiped (which will be helpful for rebalances). |
|
Or, maybe only enqueue after the split if the ranges are not fully replicated yet. That would avoid spamming the replicate queue when it is not needed (and I think we have that info already). |
|
@BramGruneir The impact on split performance will be minimal. We already do a ton of work at split time, this is a few extra checks. I'm not sure I understand your other concern. Once a cluster is fully replicated, a split range won't require additional replication. The alternative you outline seems quite a bit more work and less robust. There are two items blocking the initial ranges from being replicated: another node to replicate to and for the required splits to be complete. With this change, we trigger replication when both events occur. |
|
@BramGruneir Is calling |
|
Yeah, just judging from past runs of the existing On Tue, Jun 21, 2016 at 10:35 AM Peter Mattis notifications@github.com
|
|
Reviewed 1 of 1 files at r1, 5 of 5 files at r2. storage/replica_command.go, line 2292 [r1] (raw file):
storage/replicate_test.go, line 42 [r2] (raw file):
I'm too unfamiliar with the code to figure out why the ranges end up in purgatory. Could you add a comment as to why that is? Comments from Reviewable |
|
@petermattis I was just kind of spitballing a few alternate ideas that I had last night. I just don't think adding the call on both sides of a split during normal operation will be useful. I can see us looking to speed up splits in the near future and adding unneeded checks won't help. Is maybeAdd fast? It should be relatively quick, but we are possibly iterating over all stores. So it could be faster. We haven't worried about its performance before. |
07e31ad to
7063c9f
Compare
|
Review status: all files reviewed at latest revision, 2 unresolved discussions, some commit checks failed. storage/replica_command.go, line 2292 [r1] (raw file):
|
|
@petermattis Anyway, hard to argue with results, so this change works for me. I'm describing a potential problem that we haven't encountered yet. So let's go with this. But also notice that this didn't solve the problem I'm investigating about the acceptance test failures, since 2 of them just failed. I'll take a look at speeding up maybeAdd, I didn't know we were spending so much cpu when the cluster is idle. That seems like a problem on its own too. |
|
@cuongdo How many ranges did you have when you saw the replica scanner consuming CPU while idle? |
|
2,265 per node (so multiply by 3) On Tue, Jun 21, 2016 at 11:09 AM Peter Mattis notifications@github.com
|
|
108 GiB of data per node. On Tue, Jun 21, 2016 at 11:12 AM Cuong Do cdo@cockroachlabs.com wrote:
|
|
@cuongdo So we'd be scanning 3.7 replicas/second with the default scan time. Surprised that would show up as a CPU consumer, but perhaps the replica allocator has some low-hanging fruit. Let's file a separate issue about this. |
|
If we make enough changes like this to make things event-driven, we may be able to slow down the replica scanner without ill effect. We should aim to make Review status: 4 of 6 files reviewed at latest revision, 2 unresolved discussions, some commit checks pending. Comments from Reviewable |
|
Besides the Review status: 4 of 6 files reviewed at latest revision, 2 unresolved discussions, some commit checks pending. Comments from Reviewable |
Eagerly add newly split ranges to the replicate queue. The replicate queue only processes replicas that do not require splitting. This reduces the time to replicate the initial ranges from ~30s to ~2s. Fixes cockroachdb#7022.
|
Reviewed 4 of 5 files at r2, 2 of 2 files at r3. Comments from Reviewable |
Temporarily adjust the value of defaultDeclinedReservationsTimeout to 0s to avoid problems with initial range replication in 3 node clusters. See cockroachdb#7332.
7063c9f to
988628a
Compare
|
@BramGruneir I pushed another commit which sets |
|
👍 |
Mark Replicas as destroyed in `Replica.Destroy` and do not perform any further raft operations after destruction. This addresses the underlying bug in cockroachdb#7386 that was exacerbated by cockroachdb#7355. The specific sequence of events in cockroachdb#7386 was that a replica was queued for replication and the test then removed the replica from the store. Removal of a replica from a store removes it from the various queues, though that is done asynchronously. The replica queue merrily tries to process the now destroyed replica and as a first step tries to acquire the leader lease. This leader lease acquisition is done via a raft command, but the state underlying the replica has already been deleted. Boom!
Mark Replicas as destroyed in `Replica.Destroy` and do not perform any further raft operations after destruction. This addresses the underlying bug in cockroachdb#7386 that was exacerbated by cockroachdb#7355. The specific sequence of events in cockroachdb#7386 was that a replica was queued for replication and the test then removed the replica from the store. Removal of a replica from a store removes it from the various queues, though that is done asynchronously. The replica queue merrily tries to process the now destroyed replica and as a first step tries to acquire the leader lease. This leader lease acquisition is done via a raft command, but the state underlying the replica has already been deleted. Boom!
Eagerly add newly split ranges to the replicate queue. The replicate
queue only processes replicas that do not require splitting. This
reduces the time to replicate the initial ranges from ~30s to ~2s.
Fixes #7022.
This change is