Skip to content

storage: when stores are unavailable, don't send the replica to purgatory#7332

Merged
BramGruneir merged 2 commits intocockroachdb:masterfrom
BramGruneir:7329
Jun 22, 2016
Merged

storage: when stores are unavailable, don't send the replica to purgatory#7332
BramGruneir merged 2 commits intocockroachdb:masterfrom
BramGruneir:7329

Conversation

@BramGruneir
Copy link
Copy Markdown
Member

@BramGruneir BramGruneir commented Jun 20, 2016

When a store becomes unavailable due to a declined reservation, we should not
be sending any replicas to purgatory. Purgatory should only be used for when
there are no matching and alive stores.

Fixes #7329


This change is Reviewable

@BramGruneir
Copy link
Copy Markdown
Member Author

I'm going to spam circle tonight to see if I can get more failures.

@petermattis
Copy link
Copy Markdown
Collaborator

Review status: 0 of 1 files reviewed at latest revision, 1 unresolved discussion, some commit checks pending.


acceptance/replication_test.go, line 36 [r1] (raw file):

// TODO(bram): Hoping that a longer timeout will fix #7329. Remove this when we
// find the root cause.
const waitForReplicationDuration = 30 * time.Second

Not terribly thrilled about this. @WillHaack was investigating initial replication delays recently.


Comments from Reviewable

@tbg
Copy link
Copy Markdown
Member

tbg commented Jun 20, 2016

Review status: 0 of 1 files reviewed at latest revision, 1 unresolved discussion, some commit checks pending.


acceptance/replication_test.go, line 36 [r1] (raw file):

Previously, petermattis (Peter Mattis) wrote…

Not terribly thrilled about this. @WillHaack was investigating initial replication delays recently.

Also, if the logs weren't helpful - why not add more logging? Are we running the containers at V(1) already? If not, that's a worthwhile change, too. There are some `ChangeReplicas` pushes. If we log enough to, for example, find out that one of the `ChangeReplicas` simply doesn't finish, that would be valuable piece of information.

Comments from Reviewable

@BramGruneir
Copy link
Copy Markdown
Member Author

Review status: 0 of 2 files reviewed at latest revision, 1 unresolved discussion, some commit checks pending.


acceptance/replication_test.go, line 36 [r1] (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

Also, if the logs weren't helpful - why not add more logging? Are we running the containers at V(1) already? If not, that's a worthwhile change, too. There are some ChangeReplicas pushes. If we log enough to, for example, find out that one of the ChangeReplicas simply doesn't finish, that would be valuable piece of information.

I'm not happy about it either.

The logs of the containers are now at v1, I made that change the last week. There was no change replica delays or failures from what I could tell form the logs, but just that range 1 was only replicated once. All other ranges were fully replicated.

This doesn't happen all the time, but seems to happen quite often. And I can't replicate it locally, so I wanted to see if giving it a bit more time to complete would at least cut down on the number of failures.

But I take your point. I've added some logging to the replicate queue at v1 to gain more insight into the goings on there.


Comments from Reviewable

@BramGruneir
Copy link
Copy Markdown
Member Author

I've been spamming circle. There should are a number of examples of this failure still happening. https://circleci.com/gh/BramGruneir/cockroach/tree/7329

I'll be investigating them tomorrow.


Review status: 0 of 2 files reviewed at latest revision, 1 unresolved discussion, some commit checks pending.


Comments from Reviewable

@tbg
Copy link
Copy Markdown
Member

tbg commented Jun 21, 2016

From run #631:

E160621 05:56:11.919336 storage/queue.go:449 [replicate] range=3
[/Table/12-/Table/13): error: storage/replica_command.go:2611: change
replicas of 3 failed: reservation declined:header:<node_id:3
store_id:3 > from_node_id:1 from_store_id:1 range_id:3
range_size:67108864

Not sure if it's relevant. Just happened to be one that I looked at.

On Tue, Jun 21, 2016 at 1:44 AM Bram Gruneir notifications@github.com
wrote:

I've been spamming circle. There should are a number of examples of this
failure still happening.
https://circleci.com/gh/BramGruneir/cockroach/tree/7329

I'll be investigating them tomorrow.

Review status: 0 of 2 files reviewed at latest revision, 1 unresolved

discussion, some commit checks pending.

Comments from Reviewable
https://reviewable.io:443/reviews/cockroachdb/cockroach/7332#-:-KKlTqGuwkbhFqcFDhAK:-1492416922


You are receiving this because you commented.

Reply to this email directly, view it on GitHub
#7332 (comment),
or mute the thread
https://github.com/notifications/unsubscribe/AE135DEQqQOlt_D_80KI3iLjh1zRDp_Rks5qN3ojgaJpZM4I6Ad7
.

-- Tobias

@petermattis
Copy link
Copy Markdown
Collaborator

@BramGruneir There might be other problems with initial replication, but see my latest comment in #7022.

@BramGruneir
Copy link
Copy Markdown
Member Author

@tschottdorf That's range 3, not range 1. So it doesn't cause the slow down in the test.

I'm going to dig through a bunch of the logs today. Will report back with more.

@petermattis I'm going to see if that will help, I'll rebase off your change and see if we get any repeats of the failures..

@petermattis
Copy link
Copy Markdown
Collaborator

Looking into acceptance test failures I'm seeing for #7355, I turned on some additional logging and saw:

I160621 16:10:40.588263 storage/queue.go:323  [replicate] range=1 [/Min-/Table/11): adding: priority=1000.000
I160621 16:10:40.589059 storage/queue.go:438  [replicate] range=1 [/Min-/Table/11): processing
I160621 16:10:40.589401 storage/queue.go:467  [replicate] range=1 [/Min-/Table/11) (purgatory): error: 0 of 3 stores with an attribute matching []; likely not enough nodes in cluster

Note that this occurred after we had replicated ranges 2-5. Slightly earlier in the log I see a reservation for the replication of range 1 was declined. And then range 1 gets placed in purgatory:

I160621 16:10:40.589401 storage/queue.go:467  [replicate] range=1 [/Min-/Table/11) (purgatory): error: 0 of 3 stores with an attribute matching []; likely not enough nodes in cluster

When a replica is placed in purgatory it isn't removed until something kicks the purgatoryChan(), but there is no outside event to do so in this instance.

@tamird
Copy link
Copy Markdown
Contributor

tamird commented Jun 21, 2016

Sounds like we're causing reservation-declined replicas to go into purgatory, which should probably never happen (reservations have their own throttling).

@petermattis
Copy link
Copy Markdown
Collaborator

Yes, when a reservation is declined we set storeDetail.unavailableUntil to a time in the future. We then refuse to match that store so that StorePool.getStoreList returns only a subset of the stores in the cluster. This causes the replica that we're trying to be rebalanced to be placed in purgatory from which it can't leave until we receive another gossiped store descriptor. By default, we only gossip store descriptors once per minute.

@petermattis
Copy link
Copy Markdown
Collaborator

Hacky fix is to set defaultDeclinedReservationsTimeout to 0.

@BramGruneir
Copy link
Copy Markdown
Member Author

I'm working on a real fix.


Comments from Reviewable

@petermattis
Copy link
Copy Markdown
Collaborator

I'm curious to see what that real fix is.


Review status: 0 of 2 files reviewed at latest revision, 1 unresolved discussion, all commit checks successful.


acceptance/replication_test.go, line 36 [r1] (raw file):

Previously, BramGruneir (Bram Gruneir) wrote…

I'm not happy about it either.

The logs of the containers are now at v1, I made that change the last week. There was no change replica delays or failures from what I could tell form the logs, but just that range 1 was only replicated once. All other ranges were fully replicated.

This doesn't happen all the time, but seems to happen quite often. And I can't replicate it locally, so I wanted to see if giving it a bit more time to complete would at least cut down on the number of failures.

But I take your point. I've added some logging to the replicate queue at v1 to gain more insight into the goings on there.

Btw, given what we know now, this change wouldn't have helped.

Comments from Reviewable

@BramGruneir
Copy link
Copy Markdown
Member Author

Review status: 0 of 2 files reviewed at latest revision, 1 unresolved discussion, all commit checks successful.


acceptance/replication_test.go, line 36 [r1] (raw file):

Previously, petermattis (Peter Mattis) wrote…

Btw, given what we know now, this change wouldn't have helped.

Yeah, I'm going to revert this part, but leave in the extended logging. It's going to be useful in the future.

Comments from Reviewable

petermattis added a commit to petermattis/cockroach that referenced this pull request Jun 21, 2016
Temporarily adjust the value of defaultDeclinedReservationsTimeout to 0s
to avoid problems with initial range replication in 3 node clusters.

See cockroachdb#7332.
@BramGruneir BramGruneir changed the title acceptance: extended how long we wait for the first range to be fully replicated storage: when stores are unavailable, don't send the replica to purgatory Jun 21, 2016
@BramGruneir
Copy link
Copy Markdown
Member Author

I've updated the PR with a new commit that should fix the problem. I'm testing it both locally and in circle to see if it works.


Review status: 0 of 4 files reviewed at latest revision, 1 unresolved discussion, some commit checks pending.


Comments from Reviewable

@petermattis
Copy link
Copy Markdown
Collaborator

:lgtm:


Review status: 0 of 4 files reviewed at latest revision, 3 unresolved discussions, all commit checks successful.


storage/allocator.go, line 213 [r4] (raw file):

          if log.V(1) {
              log.Infof("all %d matching stores are currently unavailable", unavailableStoreCount)
          }

s/all //g

This is vlogged by queue.go as well. I'm not sure if it is not necessary to repeat that logging here.


storage/store_pool.go, line 99 [r4] (raw file):

  storeMatchDead        storeMatch = iota // The store is not yet available or has been timed out.
  storeMatchAlive                         // The store is alive, but its attributes didn't match the required ones.
  storeMatchUnavailable                   // The store is alive and its attributes matched, but it no available.

s/but it no available/but unavailable/g


Comments from Reviewable

@BramGruneir
Copy link
Copy Markdown
Member Author

Review status: 0 of 4 files reviewed at latest revision, 3 unresolved discussions, some commit checks pending.


storage/allocator.go, line 213 [r4] (raw file):

Previously, petermattis (Peter Mattis) wrote…

s/all //g

This is vlogged by queue.go as well. I'm not sure if it is not necessary to repeat that logging here.

done. And good point.

storage/store_pool.go, line 99 [r4] (raw file):

Previously, petermattis (Peter Mattis) wrote…

s/but it no available/but unavailable/g

Done.

Comments from Reviewable

@tamird
Copy link
Copy Markdown
Contributor

tamird commented Jun 21, 2016

I don't see how this prevents stores from getting sent to purgatory, and I don't see a test that looks at that case. Can you add one, please?


Reviewed 1 of 1 files at r2, 1 of 3 files at r4, 2 of 2 files at r5.
Review status: all files reviewed at latest revision, 9 unresolved discussions, all commit checks successful.


storage/allocator.go, line 211 [r5] (raw file):

      // the replica to purgatory or even consider relaxing the constraints.
      if unavailableStoreCount > 0 {
          return nil, fmt.Errorf("%d matching stores are currently unavailable", unavailableStoreCount)

why fmt.Errorf rather than util.Errorf?


storage/replicate_queue.go, line 138 [r5] (raw file):

      }
      if log.V(1) {
          log.Infof("adding replica for under-replicated range:%d to node:%d, store:%d",

throughout: why don't you just log the replica descriptor?


storage/store_pool.go, line 99 [r4] (raw file):

Previously, BramGruneir (Bram Gruneir) wrote…

Done.

more nit (sorry): "but it is unavailable.", otherwise it sounds like the attributes are unavailable.

storage/store_pool.go, line 100 [r5] (raw file):

  storeMatchAlive                         // The store is alive, but its attributes didn't match the required ones.
  storeMatchUnavailable                   // The store is alive and its attributes matched, but unavailable.
  storeMatchAvailable                     // The store is alive, available and its attributes matched.

oxford comma!


storage/store_pool.go, line 390 [r5] (raw file):

// GetStoreList returns a storeList that contains all active stores that
// contain the required attributes and their associated stats. It also returns
// total number of alive stores and the number of unavailable stores.

missing article


storage/store_pool.go, line 417 [r5] (raw file):

      case storeMatchAlive:
          aliveStoreCount++
      case storeMatchUnavailable:

"unavailable" is too overloaded for this use - even having read the comments on the enum i had to scratch my head to figure out why this case increments both alive and unavailable counts. perhaps "throttling" or some other term that implies that the node is there and responsive but has politely asked not to be a rebalance target.


storage/store_pool_test.go, line 222 [r5] (raw file):

  sl, aliveStoreCount, unavailableStoreCount := sp.getStoreList(roachpb.Attributes{Attrs: requiredAttrs}, false)
  if aliveStoreCount != expectedAliveStoreCount {
      return fmt.Errorf("expected AliveStoreCount %d does not match actual %d",

throughout: maybe these want to be util.Errorfs?


Comments from Reviewable

@BramGruneir
Copy link
Copy Markdown
Member Author

Replicas are sent to purgatory when the allocator returns an error that implements the purgatoryError interface.
Interestingly, there was never a test for checking to see if something was destined for purgatory or not. So I added one.


Review status: 1 of 5 files reviewed at latest revision, 9 unresolved discussions, some commit checks pending.


storage/allocator.go, line 211 [r5] (raw file):

Previously, tamird (Tamir Duberstein) wrote…

why fmt.Errorf rather than util.Errorf?

Done.

storage/replicate_queue.go, line 138 [r5] (raw file):

Previously, tamird (Tamir Duberstein) wrote…

throughout: why don't you just log the replica descriptor?

There's no need for the extra info. We already know that the replica is on this store and node. So really we just need the replica ID and where it's going to (or being removed from).

storage/store_pool.go, line 99 [r4] (raw file):

Previously, tamird (Tamir Duberstein) wrote…

more nit (sorry): "but it is unavailable.", otherwise it sounds like the attributes are unavailable.

Done.

storage/store_pool.go, line 100 [r5] (raw file):

Previously, tamird (Tamir Duberstein) wrote…

oxford comma!

I am not a fan of the oxford comma. :) If there are ambiguities, one should use semi-colons.

storage/store_pool.go, line 390 [r5] (raw file):

Previously, tamird (Tamir Duberstein) wrote…

missing article

Done.

storage/store_pool.go, line 417 [r5] (raw file):

Previously, tamird (Tamir Duberstein) wrote…

"unavailable" is too overloaded for this use - even having read the comments on the enum i had to scratch my head to figure out why this case increments both alive and unavailable counts. perhaps "throttling" or some other term that implies that the node is there and responsive but has politely asked not to be a rebalance target.

Replaced unavailable with throttled throughout. That's a much better term.

storage/store_pool_test.go, line 222 [r5] (raw file):

Previously, tamird (Tamir Duberstein) wrote…

throughout: maybe these want to be util.Errorfs?

Indeed. I think this would be a great case for a new vet.

Comments from Reviewable

@tamird
Copy link
Copy Markdown
Contributor

tamird commented Jun 22, 2016

Reviewed 1 of 1 files at r1, 1 of 1 files at r2, 1 of 1 files at r3, 3 of 3 files at r6, 4 of 4 files at r7.
Review status: all files reviewed at latest revision, 11 unresolved discussions, all commit checks successful.


storage/allocator_test.go, line 1021 [r7] (raw file):

  // First test to make sure we would send the replica to purgatory.
  _, errPurgatory := a.AllocateTarget(

throughout: plz just use err. you never have more than one in scope.


storage/allocator_test.go, line 1027 [r7] (raw file):

      nil,
  )
  if errPurgatory == nil {

you don't need this - the check on the next line will catch it.


storage/allocator_test.go, line 1031 [r7] (raw file):

  }
  if _, ok := errPurgatory.(purgatoryError); !ok {
      t.Fatalf("expected a purgatory error, got: %s", errPurgatory)

use %v so that nil prints nicely


storage/allocator_test.go, line 1035 [r7] (raw file):

  // Second, test the normal case in which we can allocate to the store.
  gossiputil.NewStoreGossiper(g).GossipStores(singleStore, t)

so this makes allocation succeed? fun.


storage/allocator_test.go, line 1045 [r7] (raw file):

      t.Fatalf("unable to perform allocation: %s", errNone)
  }
  if result.Node.NodeID != 1 || result.StoreID != 1 {

isn't this tested elsewhere? doesn't seem useful in this test.


storage/allocator_test.go, line 1064 [r7] (raw file):

      nil,
  )
  if errNonPurgatory == nil {

not needed


storage/allocator_test.go, line 1068 [r7] (raw file):

  }
  if _, ok := errNonPurgatory.(purgatoryError); ok {
      t.Fatalf("expected a non purgatory error, got: %s", errNonPurgatory)

%v


storage/replicate_queue.go, line 138 [r5] (raw file):

Previously, BramGruneir (Bram Gruneir) wrote…

There's no need for the extra info. We already know that the replica is on this store and node. So really we just need the replica ID and where it's going to (or being removed from).

Huh? You're logging the node and store, and there is no replica ID. I'm suggesting that you log the replica ID so that if more information is added to it in the future, it is naturally logged here (and doesn't require a manual update).

Comments from Reviewable

…tory

When a store becomes unavailable due to a declined reservation, we should not
be sending any replicas to purgatory. Purgatory should only be used for when
there are no matching and alive stores.

Fixes cockroachdb#7329
@BramGruneir
Copy link
Copy Markdown
Member Author

Review status: all files reviewed at latest revision, 11 unresolved discussions, all commit checks successful.


storage/allocator_test.go, line 1021 [r7] (raw file):

Previously, tamird (Tamir Duberstein) wrote…

throughout: plz just use err. you never have more than one in scope.

Done.

storage/allocator_test.go, line 1027 [r7] (raw file):

Previously, tamird (Tamir Duberstein) wrote…

you don't need this - the check on the next line will catch it.

Done.

storage/allocator_test.go, line 1031 [r7] (raw file):

Previously, tamird (Tamir Duberstein) wrote…

use %v so that nil prints nicely

Done.

storage/allocator_test.go, line 1035 [r7] (raw file):

Previously, tamird (Tamir Duberstein) wrote…

so this makes allocation succeed? fun.

We need a store in the store pool that's alive and non-throttled.

storage/allocator_test.go, line 1045 [r7] (raw file):

Previously, tamird (Tamir Duberstein) wrote…

isn't this tested elsewhere? doesn't seem useful in this test.

This is tested elsewhere, but it doesn't hurt to make sure that basic allocation is working. I'd at least like to keep result != nil, so I figured why not also check out node and store IDs.

storage/allocator_test.go, line 1064 [r7] (raw file):

Previously, tamird (Tamir Duberstein) wrote…

not needed

Done.

storage/allocator_test.go, line 1068 [r7] (raw file):

Previously, tamird (Tamir Duberstein) wrote…

%v

Done.

storage/replicate_queue.go, line 138 [r5] (raw file):

Previously, tamird (Tamir Duberstein) wrote…

Huh? You're logging the node and store, and there is no replica ID. I'm suggesting that you log the replica ID so that if more information is added to it in the future, it is naturally logged here (and doesn't require a manual update).

Ah, I see what you mean. I meant RangeID in my previous comment.

For add replica, the replica ID is always 0.

Anyway, added.


Comments from Reviewable

@tamird
Copy link
Copy Markdown
Contributor

tamird commented Jun 22, 2016

:lgtm:


Reviewed 4 of 4 files at r8, 5 of 5 files at r9.
Review status: all files reviewed at latest revision, 4 unresolved discussions, some commit checks pending.


storage/allocator_test.go, line 1035 [r7] (raw file):

Previously, BramGruneir (Bram Gruneir) wrote…

We need a store in the store pool that's alive and non-throttled.

Sure, that's fine, just saying that it's opaque to the lay reader of this test (me).

Comments from Reviewable

@BramGruneir BramGruneir merged commit aaedd80 into cockroachdb:master Jun 22, 2016
@BramGruneir BramGruneir deleted the 7329 branch June 22, 2016 21:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants