storage: when stores are unavailable, don't send the replica to purgatory#7332
storage: when stores are unavailable, don't send the replica to purgatory#7332BramGruneir merged 2 commits intocockroachdb:masterfrom
Conversation
|
I'm going to spam circle tonight to see if I can get more failures. |
|
Review status: 0 of 1 files reviewed at latest revision, 1 unresolved discussion, some commit checks pending. acceptance/replication_test.go, line 36 [r1] (raw file):
Not terribly thrilled about this. @WillHaack was investigating initial replication delays recently. Comments from Reviewable |
|
Review status: 0 of 1 files reviewed at latest revision, 1 unresolved discussion, some commit checks pending. acceptance/replication_test.go, line 36 [r1] (raw file):
|
|
Review status: 0 of 2 files reviewed at latest revision, 1 unresolved discussion, some commit checks pending. acceptance/replication_test.go, line 36 [r1] (raw file):
|
|
I've been spamming circle. There should are a number of examples of this failure still happening. https://circleci.com/gh/BramGruneir/cockroach/tree/7329 I'll be investigating them tomorrow. Review status: 0 of 2 files reviewed at latest revision, 1 unresolved discussion, some commit checks pending. Comments from Reviewable |
|
From run #631: E160621 05:56:11.919336 storage/queue.go:449 [replicate] range=3 Not sure if it's relevant. Just happened to be one that I looked at. On Tue, Jun 21, 2016 at 1:44 AM Bram Gruneir notifications@github.com
-- Tobias |
|
@BramGruneir There might be other problems with initial replication, but see my latest comment in #7022. |
|
@tschottdorf That's range 3, not range 1. So it doesn't cause the slow down in the test. I'm going to dig through a bunch of the logs today. Will report back with more. @petermattis I'm going to see if that will help, I'll rebase off your change and see if we get any repeats of the failures.. |
|
Looking into acceptance test failures I'm seeing for #7355, I turned on some additional logging and saw: Note that this occurred after we had replicated ranges When a replica is placed in purgatory it isn't removed until something kicks the |
|
Sounds like we're causing reservation-declined replicas to go into purgatory, which should probably never happen (reservations have their own throttling). |
|
Yes, when a reservation is declined we set |
|
Hacky fix is to set |
|
I'm working on a real fix. Comments from Reviewable |
|
I'm curious to see what that real fix is. Review status: 0 of 2 files reviewed at latest revision, 1 unresolved discussion, all commit checks successful. acceptance/replication_test.go, line 36 [r1] (raw file):
|
|
Review status: 0 of 2 files reviewed at latest revision, 1 unresolved discussion, all commit checks successful. acceptance/replication_test.go, line 36 [r1] (raw file):
|
Temporarily adjust the value of defaultDeclinedReservationsTimeout to 0s to avoid problems with initial range replication in 3 node clusters. See cockroachdb#7332.
|
I've updated the PR with a new commit that should fix the problem. I'm testing it both locally and in circle to see if it works. Review status: 0 of 4 files reviewed at latest revision, 1 unresolved discussion, some commit checks pending. Comments from Reviewable |
|
Review status: 0 of 4 files reviewed at latest revision, 3 unresolved discussions, all commit checks successful. storage/allocator.go, line 213 [r4] (raw file):
This is vlogged by storage/store_pool.go, line 99 [r4] (raw file):
Comments from Reviewable |
|
Review status: 0 of 4 files reviewed at latest revision, 3 unresolved discussions, some commit checks pending. storage/allocator.go, line 213 [r4] (raw file):
|
|
I don't see how this prevents stores from getting sent to purgatory, and I don't see a test that looks at that case. Can you add one, please? Reviewed 1 of 1 files at r2, 1 of 3 files at r4, 2 of 2 files at r5. storage/allocator.go, line 211 [r5] (raw file):
why fmt.Errorf rather than util.Errorf? storage/replicate_queue.go, line 138 [r5] (raw file):
throughout: why don't you just log the replica descriptor? storage/store_pool.go, line 99 [r4] (raw file):
|
|
Replicas are sent to purgatory when the allocator returns an error that implements the purgatoryError interface. Review status: 1 of 5 files reviewed at latest revision, 9 unresolved discussions, some commit checks pending. storage/allocator.go, line 211 [r5] (raw file):
|
|
Reviewed 1 of 1 files at r1, 1 of 1 files at r2, 1 of 1 files at r3, 3 of 3 files at r6, 4 of 4 files at r7. storage/allocator_test.go, line 1021 [r7] (raw file):
throughout: plz just use storage/allocator_test.go, line 1027 [r7] (raw file):
you don't need this - the check on the next line will catch it. storage/allocator_test.go, line 1031 [r7] (raw file):
use %v so that nil prints nicely storage/allocator_test.go, line 1035 [r7] (raw file):
so this makes allocation succeed? fun. storage/allocator_test.go, line 1045 [r7] (raw file):
isn't this tested elsewhere? doesn't seem useful in this test. storage/allocator_test.go, line 1064 [r7] (raw file):
not needed storage/allocator_test.go, line 1068 [r7] (raw file):
%v storage/replicate_queue.go, line 138 [r5] (raw file):
|
…tory When a store becomes unavailable due to a declined reservation, we should not be sending any replicas to purgatory. Purgatory should only be used for when there are no matching and alive stores. Fixes cockroachdb#7329
|
Review status: all files reviewed at latest revision, 11 unresolved discussions, all commit checks successful. storage/allocator_test.go, line 1021 [r7] (raw file):
|
|
Reviewed 4 of 4 files at r8, 5 of 5 files at r9. storage/allocator_test.go, line 1035 [r7] (raw file):
|
When a store becomes unavailable due to a declined reservation, we should not
be sending any replicas to purgatory. Purgatory should only be used for when
there are no matching and alive stores.
Fixes #7329
This change is