storage: when stores are unavailable, don't send the replica to purgatory by BramGruneir · Pull Request #7332 · cockroachdb/cockroach

BramGruneir · 2016-06-20T18:51:16Z

When a store becomes unavailable due to a declined reservation, we should not
be sending any replicas to purgatory. Purgatory should only be used for when
there are no matching and alive stores.

Fixes #7329

This change is

BramGruneir · 2016-06-20T18:56:16Z

I'm going to spam circle tonight to see if I can get more failures.

petermattis · 2016-06-20T18:58:33Z

Review status: 0 of 1 files reviewed at latest revision, 1 unresolved discussion, some commit checks pending.

acceptance/replication_test.go, line 36 [r1] (raw file):

// TODO(bram): Hoping that a longer timeout will fix #7329. Remove this when we
// find the root cause.
const waitForReplicationDuration = 30 * time.Second

Not terribly thrilled about this. @WillHaack was investigating initial replication delays recently.

Comments from Reviewable

tbg · 2016-06-20T19:10:38Z

Review status: 0 of 1 files reviewed at latest revision, 1 unresolved discussion, some commit checks pending.

acceptance/replication_test.go, line 36 [r1] (raw file):

Previously, petermattis (Peter Mattis) wrote…

Not terribly thrilled about this. @WillHaack was investigating initial replication delays recently.

Also, if the logs weren't helpful - why not add more logging? Are we running the containers at V(1) already? If not, that's a worthwhile change, too. There are some `ChangeReplicas` pushes. If we log enough to, for example, find out that one of the `ChangeReplicas` simply doesn't finish, that would be valuable piece of information.

Comments from Reviewable

BramGruneir · 2016-06-21T04:12:07Z

Review status: 0 of 2 files reviewed at latest revision, 1 unresolved discussion, some commit checks pending.

acceptance/replication_test.go, line 36 [r1] (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

Also, if the logs weren't helpful - why not add more logging? Are we running the containers at V(1) already? If not, that's a worthwhile change, too. There are some ChangeReplicas pushes. If we log enough to, for example, find out that one of the ChangeReplicas simply doesn't finish, that would be valuable piece of information.

I'm not happy about it either.

The logs of the containers are now at v1, I made that change the last week. There was no change replica delays or failures from what I could tell form the logs, but just that range 1 was only replicated once. All other ranges were fully replicated.

This doesn't happen all the time, but seems to happen quite often. And I can't replicate it locally, so I wanted to see if giving it a bit more time to complete would at least cut down on the number of failures.

But I take your point. I've added some logging to the replicate queue at v1 to gain more insight into the goings on there.

Comments from Reviewable

BramGruneir · 2016-06-21T05:43:50Z

I've been spamming circle. There should are a number of examples of this failure still happening. https://circleci.com/gh/BramGruneir/cockroach/tree/7329

I'll be investigating them tomorrow.

Review status: 0 of 2 files reviewed at latest revision, 1 unresolved discussion, some commit checks pending.

Comments from Reviewable

tbg · 2016-06-21T11:19:10Z

From run #631:

E160621 05:56:11.919336 storage/queue.go:449 [replicate] range=3
[/Table/12-/Table/13): error: storage/replica_command.go:2611: change
replicas of 3 failed: reservation declined:header:<node_id:3
store_id:3 > from_node_id:1 from_store_id:1 range_id:3
range_size:67108864

Not sure if it's relevant. Just happened to be one that I looked at.

On Tue, Jun 21, 2016 at 1:44 AM Bram Gruneir notifications@github.com
wrote:

I've been spamming circle. There should are a number of examples of this
failure still happening.
https://circleci.com/gh/BramGruneir/cockroach/tree/7329

I'll be investigating them tomorrow.

Review status: 0 of 2 files reviewed at latest revision, 1 unresolved

discussion, some commit checks pending.

Comments from Reviewable
https://reviewable.io:443/reviews/cockroachdb/cockroach/7332#-:-KKlTqGuwkbhFqcFDhAK:-1492416922

—
You are receiving this because you commented.

Reply to this email directly, view it on GitHub
#7332 (comment),
or mute the thread
https://github.com/notifications/unsubscribe/AE135DEQqQOlt_D_80KI3iLjh1zRDp_Rks5qN3ojgaJpZM4I6Ad7
.

-- Tobias

petermattis · 2016-06-21T12:51:49Z

@BramGruneir There might be other problems with initial replication, but see my latest comment in #7022.

BramGruneir · 2016-06-21T14:29:25Z

@tschottdorf That's range 3, not range 1. So it doesn't cause the slow down in the test.

I'm going to dig through a bunch of the logs today. Will report back with more.

@petermattis I'm going to see if that will help, I'll rebase off your change and see if we get any repeats of the failures..

petermattis · 2016-06-21T16:22:15Z

Looking into acceptance test failures I'm seeing for #7355, I turned on some additional logging and saw:

I160621 16:10:40.588263 storage/queue.go:323  [replicate] range=1 [/Min-/Table/11): adding: priority=1000.000
I160621 16:10:40.589059 storage/queue.go:438  [replicate] range=1 [/Min-/Table/11): processing
I160621 16:10:40.589401 storage/queue.go:467  [replicate] range=1 [/Min-/Table/11) (purgatory): error: 0 of 3 stores with an attribute matching []; likely not enough nodes in cluster

Note that this occurred after we had replicated ranges 2-5. Slightly earlier in the log I see a reservation for the replication of range 1 was declined. And then range 1 gets placed in purgatory:

I160621 16:10:40.589401 storage/queue.go:467  [replicate] range=1 [/Min-/Table/11) (purgatory): error: 0 of 3 stores with an attribute matching []; likely not enough nodes in cluster

When a replica is placed in purgatory it isn't removed until something kicks the purgatoryChan(), but there is no outside event to do so in this instance.

tamird · 2016-06-21T16:37:01Z

Sounds like we're causing reservation-declined replicas to go into purgatory, which should probably never happen (reservations have their own throttling).

petermattis · 2016-06-21T16:58:10Z

Yes, when a reservation is declined we set storeDetail.unavailableUntil to a time in the future. We then refuse to match that store so that StorePool.getStoreList returns only a subset of the stores in the cluster. This causes the replica that we're trying to be rebalanced to be placed in purgatory from which it can't leave until we receive another gossiped store descriptor. By default, we only gossip store descriptors once per minute.

petermattis · 2016-06-21T17:00:09Z

Hacky fix is to set defaultDeclinedReservationsTimeout to 0.

BramGruneir · 2016-06-21T17:05:08Z

I'm working on a real fix.

Comments from Reviewable

petermattis · 2016-06-21T17:26:26Z

I'm curious to see what that real fix is.

Review status: 0 of 2 files reviewed at latest revision, 1 unresolved discussion, all commit checks successful.

acceptance/replication_test.go, line 36 [r1] (raw file):

Previously, BramGruneir (Bram Gruneir) wrote…

I'm not happy about it either.

The logs of the containers are now at v1, I made that change the last week. There was no change replica delays or failures from what I could tell form the logs, but just that range 1 was only replicated once. All other ranges were fully replicated.

This doesn't happen all the time, but seems to happen quite often. And I can't replicate it locally, so I wanted to see if giving it a bit more time to complete would at least cut down on the number of failures.

But I take your point. I've added some logging to the replicate queue at v1 to gain more insight into the goings on there.

Btw, given what we know now, this change wouldn't have helped.

Comments from Reviewable

BramGruneir · 2016-06-21T17:39:56Z

Review status: 0 of 2 files reviewed at latest revision, 1 unresolved discussion, all commit checks successful.

acceptance/replication_test.go, line 36 [r1] (raw file):

Previously, petermattis (Peter Mattis) wrote…

Btw, given what we know now, this change wouldn't have helped.

Yeah, I'm going to revert this part, but leave in the extended logging. It's going to be useful in the future.

Comments from Reviewable

Temporarily adjust the value of defaultDeclinedReservationsTimeout to 0s to avoid problems with initial range replication in 3 node clusters. See cockroachdb#7332.

BramGruneir · 2016-06-21T20:32:30Z

I've updated the PR with a new commit that should fix the problem. I'm testing it both locally and in circle to see if it works.

Review status: 0 of 4 files reviewed at latest revision, 1 unresolved discussion, some commit checks pending.

Comments from Reviewable

petermattis · 2016-06-21T21:00:29Z

Review status: 0 of 4 files reviewed at latest revision, 3 unresolved discussions, all commit checks successful.

storage/allocator.go, line 213 [r4] (raw file):

          if log.V(1) {
              log.Infof("all %d matching stores are currently unavailable", unavailableStoreCount)
          }

s/all //g

This is vlogged by queue.go as well. I'm not sure if it is not necessary to repeat that logging here.

storage/store_pool.go, line 99 [r4] (raw file):

  storeMatchDead        storeMatch = iota // The store is not yet available or has been timed out.
  storeMatchAlive                         // The store is alive, but its attributes didn't match the required ones.
  storeMatchUnavailable                   // The store is alive and its attributes matched, but it no available.

s/but it no available/but unavailable/g

Comments from Reviewable

BramGruneir · 2016-06-21T21:18:19Z

Review status: 0 of 4 files reviewed at latest revision, 3 unresolved discussions, some commit checks pending.

storage/allocator.go, line 213 [r4] (raw file):

Previously, petermattis (Peter Mattis) wrote…

s/all //g

This is vlogged by queue.go as well. I'm not sure if it is not necessary to repeat that logging here.

done. And good point.

storage/store_pool.go, line 99 [r4] (raw file):

Previously, petermattis (Peter Mattis) wrote…

s/but it no available/but unavailable/g

Done.

Comments from Reviewable

tamird · 2016-06-21T22:01:09Z

I don't see how this prevents stores from getting sent to purgatory, and I don't see a test that looks at that case. Can you add one, please?

Reviewed 1 of 1 files at r2, 1 of 3 files at r4, 2 of 2 files at r5.
Review status: all files reviewed at latest revision, 9 unresolved discussions, all commit checks successful.

storage/allocator.go, line 211 [r5] (raw file):

      // the replica to purgatory or even consider relaxing the constraints.
      if unavailableStoreCount > 0 {
          return nil, fmt.Errorf("%d matching stores are currently unavailable", unavailableStoreCount)

why fmt.Errorf rather than util.Errorf?

storage/replicate_queue.go, line 138 [r5] (raw file):

      }
      if log.V(1) {
          log.Infof("adding replica for under-replicated range:%d to node:%d, store:%d",

throughout: why don't you just log the replica descriptor?

storage/store_pool.go, line 99 [r4] (raw file):

Previously, BramGruneir (Bram Gruneir) wrote…

Done.

more nit (sorry): "but it is unavailable.", otherwise it sounds like the attributes are unavailable.

storage/store_pool.go, line 100 [r5] (raw file):

  storeMatchAlive                         // The store is alive, but its attributes didn't match the required ones.
  storeMatchUnavailable                   // The store is alive and its attributes matched, but unavailable.
  storeMatchAvailable                     // The store is alive, available and its attributes matched.

oxford comma!

storage/store_pool.go, line 390 [r5] (raw file):

// GetStoreList returns a storeList that contains all active stores that
// contain the required attributes and their associated stats. It also returns
// total number of alive stores and the number of unavailable stores.

missing article

storage/store_pool.go, line 417 [r5] (raw file):

      case storeMatchAlive:
          aliveStoreCount++
      case storeMatchUnavailable:

"unavailable" is too overloaded for this use - even having read the comments on the enum i had to scratch my head to figure out why this case increments both alive and unavailable counts. perhaps "throttling" or some other term that implies that the node is there and responsive but has politely asked not to be a rebalance target.

storage/store_pool_test.go, line 222 [r5] (raw file):

  sl, aliveStoreCount, unavailableStoreCount := sp.getStoreList(roachpb.Attributes{Attrs: requiredAttrs}, false)
  if aliveStoreCount != expectedAliveStoreCount {
      return fmt.Errorf("expected AliveStoreCount %d does not match actual %d",

throughout: maybe these want to be util.Errorfs?

Comments from Reviewable

BramGruneir · 2016-06-22T19:12:25Z

Replicas are sent to purgatory when the allocator returns an error that implements the purgatoryError interface.
Interestingly, there was never a test for checking to see if something was destined for purgatory or not. So I added one.

Review status: 1 of 5 files reviewed at latest revision, 9 unresolved discussions, some commit checks pending.

storage/allocator.go, line 211 [r5] (raw file):

Previously, tamird (Tamir Duberstein) wrote…

why fmt.Errorf rather than util.Errorf?

Done.

storage/replicate_queue.go, line 138 [r5] (raw file):

Previously, tamird (Tamir Duberstein) wrote…

throughout: why don't you just log the replica descriptor?

There's no need for the extra info. We already know that the replica is on this store and node. So really we just need the replica ID and where it's going to (or being removed from).

storage/store_pool.go, line 99 [r4] (raw file):

Previously, tamird (Tamir Duberstein) wrote…

more nit (sorry): "but it is unavailable.", otherwise it sounds like the attributes are unavailable.

Done.

storage/store_pool.go, line 100 [r5] (raw file):

Previously, tamird (Tamir Duberstein) wrote…

oxford comma!

I am not a fan of the oxford comma. :) If there are ambiguities, one should use semi-colons.

storage/store_pool.go, line 390 [r5] (raw file):

Previously, tamird (Tamir Duberstein) wrote…

missing article

Done.

storage/store_pool.go, line 417 [r5] (raw file):

Previously, tamird (Tamir Duberstein) wrote…

"unavailable" is too overloaded for this use - even having read the comments on the enum i had to scratch my head to figure out why this case increments both alive and unavailable counts. perhaps "throttling" or some other term that implies that the node is there and responsive but has politely asked not to be a rebalance target.

Replaced unavailable with throttled throughout. That's a much better term.

storage/store_pool_test.go, line 222 [r5] (raw file):

Previously, tamird (Tamir Duberstein) wrote…

throughout: maybe these want to be util.Errorfs?

Indeed. I think this would be a great case for a new vet.

Comments from Reviewable

tamird · 2016-06-22T19:57:56Z

Reviewed 1 of 1 files at r1, 1 of 1 files at r2, 1 of 1 files at r3, 3 of 3 files at r6, 4 of 4 files at r7.
Review status: all files reviewed at latest revision, 11 unresolved discussions, all commit checks successful.

storage/allocator_test.go, line 1021 [r7] (raw file):

  // First test to make sure we would send the replica to purgatory.
  _, errPurgatory := a.AllocateTarget(

throughout: plz just use err. you never have more than one in scope.

storage/allocator_test.go, line 1027 [r7] (raw file):

      nil,
  )
  if errPurgatory == nil {

you don't need this - the check on the next line will catch it.

storage/allocator_test.go, line 1031 [r7] (raw file):

  }
  if _, ok := errPurgatory.(purgatoryError); !ok {
      t.Fatalf("expected a purgatory error, got: %s", errPurgatory)

use %v so that nil prints nicely

storage/allocator_test.go, line 1035 [r7] (raw file):

  // Second, test the normal case in which we can allocate to the store.
  gossiputil.NewStoreGossiper(g).GossipStores(singleStore, t)

so this makes allocation succeed? fun.

storage/allocator_test.go, line 1045 [r7] (raw file):

      t.Fatalf("unable to perform allocation: %s", errNone)
  }
  if result.Node.NodeID != 1 || result.StoreID != 1 {

isn't this tested elsewhere? doesn't seem useful in this test.

storage/allocator_test.go, line 1064 [r7] (raw file):

      nil,
  )
  if errNonPurgatory == nil {

not needed

storage/allocator_test.go, line 1068 [r7] (raw file):

  }
  if _, ok := errNonPurgatory.(purgatoryError); ok {
      t.Fatalf("expected a non purgatory error, got: %s", errNonPurgatory)

%v

storage/replicate_queue.go, line 138 [r5] (raw file):

Previously, BramGruneir (Bram Gruneir) wrote…

There's no need for the extra info. We already know that the replica is on this store and node. So really we just need the replica ID and where it's going to (or being removed from).

Huh? You're logging the node and store, and there is no replica ID. I'm suggesting that you log the replica ID so that if more information is added to it in the future, it is naturally logged here (and doesn't require a manual update).

Comments from Reviewable

…tory When a store becomes unavailable due to a declined reservation, we should not be sending any replicas to purgatory. Purgatory should only be used for when there are no matching and alive stores. Fixes cockroachdb#7329

BramGruneir · 2016-06-22T20:50:38Z

Review status: all files reviewed at latest revision, 11 unresolved discussions, all commit checks successful.

storage/allocator_test.go, line 1021 [r7] (raw file):

Previously, tamird (Tamir Duberstein) wrote…

throughout: plz just use err. you never have more than one in scope.

Done.

storage/allocator_test.go, line 1027 [r7] (raw file):

Previously, tamird (Tamir Duberstein) wrote…

you don't need this - the check on the next line will catch it.

Done.

storage/allocator_test.go, line 1031 [r7] (raw file):

Previously, tamird (Tamir Duberstein) wrote…

use %v so that nil prints nicely

Done.

storage/allocator_test.go, line 1035 [r7] (raw file):

Previously, tamird (Tamir Duberstein) wrote…

so this makes allocation succeed? fun.

We need a store in the store pool that's alive and non-throttled.

storage/allocator_test.go, line 1045 [r7] (raw file):

Previously, tamird (Tamir Duberstein) wrote…

isn't this tested elsewhere? doesn't seem useful in this test.

This is tested elsewhere, but it doesn't hurt to make sure that basic allocation is working. I'd at least like to keep result != nil, so I figured why not also check out node and store IDs.

storage/allocator_test.go, line 1064 [r7] (raw file):

Previously, tamird (Tamir Duberstein) wrote…

not needed

Done.

storage/allocator_test.go, line 1068 [r7] (raw file):

Previously, tamird (Tamir Duberstein) wrote…

%v

Done.

storage/replicate_queue.go, line 138 [r5] (raw file):

Previously, tamird (Tamir Duberstein) wrote…

Huh? You're logging the node and store, and there is no replica ID. I'm suggesting that you log the replica ID so that if more information is added to it in the future, it is naturally logged here (and doesn't require a manual update).

Ah, I see what you mean. I meant RangeID in my previous comment.

For add replica, the replica ID is always 0.

Anyway, added.

Comments from Reviewable

tamird · 2016-06-22T20:56:25Z

Reviewed 4 of 4 files at r8, 5 of 5 files at r9.
Review status: all files reviewed at latest revision, 4 unresolved discussions, some commit checks pending.

storage/allocator_test.go, line 1035 [r7] (raw file):

Previously, BramGruneir (Bram Gruneir) wrote…

We need a store in the store pool that's alive and non-throttled.

Sure, that's fine, just saying that it's opaque to the lay reader of this test (me).

Comments from Reviewable

BramGruneir force-pushed the 7329 branch from 85abcef to d50ec9b Compare June 21, 2016 04:11

BramGruneir force-pushed the 7329 branch from d50ec9b to d503fbe Compare June 21, 2016 20:24

BramGruneir changed the title ~~acceptance: extended how long we wait for the first range to be fully replicated~~ storage: when stores are unavailable, don't send the replica to purgatory Jun 21, 2016

petermattis mentioned this pull request Jun 21, 2016

storage: preemptive snapshots are being dropped on the floor #7372

Closed

BramGruneir force-pushed the 7329 branch from d503fbe to ebb8d16 Compare June 21, 2016 21:18

BramGruneir force-pushed the 7329 branch from ebb8d16 to bb71129 Compare June 22, 2016 19:12

BramGruneir added 2 commits June 22, 2016 16:50

storage: add more logging to the replicate queue

74b8539

BramGruneir force-pushed the 7329 branch from bb71129 to a146ac4 Compare June 22, 2016 20:50

BramGruneir merged commit aaedd80 into cockroachdb:master Jun 22, 2016

BramGruneir deleted the 7329 branch June 22, 2016 21:07

Conversation

BramGruneir commented Jun 20, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

BramGruneir commented Jun 20, 2016

Uh oh!

petermattis commented Jun 20, 2016

Uh oh!

tbg commented Jun 20, 2016

Uh oh!

BramGruneir commented Jun 21, 2016

Uh oh!

BramGruneir commented Jun 21, 2016

Uh oh!

tbg commented Jun 21, 2016

I'll be investigating them tomorrow.

discussion, some commit checks pending.

Uh oh!

petermattis commented Jun 21, 2016

Uh oh!

BramGruneir commented Jun 21, 2016

Uh oh!

petermattis commented Jun 21, 2016

Uh oh!

tamird commented Jun 21, 2016

Uh oh!

petermattis commented Jun 21, 2016

Uh oh!

petermattis commented Jun 21, 2016

Uh oh!

BramGruneir commented Jun 21, 2016

Uh oh!

petermattis commented Jun 21, 2016

Uh oh!

BramGruneir commented Jun 21, 2016

Uh oh!

BramGruneir commented Jun 21, 2016

Uh oh!

petermattis commented Jun 21, 2016

Uh oh!

BramGruneir commented Jun 21, 2016

Uh oh!

tamird commented Jun 21, 2016

Uh oh!

BramGruneir commented Jun 22, 2016

Uh oh!

tamird commented Jun 22, 2016

Uh oh!

BramGruneir commented Jun 22, 2016

Uh oh!

tamird commented Jun 22, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

BramGruneir commented Jun 20, 2016 •

edited

Loading