Skip to content

storage: check leader not leaseholder condition on all commands#12323

Merged
spencerkimball merged 1 commit intomasterfrom
spencerkimball/leaseholder-not-leader-check
Dec 20, 2016
Merged

storage: check leader not leaseholder condition on all commands#12323
spencerkimball merged 1 commit intomasterfrom
spencerkimball/leaseholder-not-leader-check

Conversation

@spencerkimball
Copy link
Copy Markdown
Member

@spencerkimball spencerkimball commented Dec 13, 2016

Previously, we only checked for leader-not-leaseholder when completing a
lease operation. This relied on the constant renewals of expiration-based
leases to transfer Raft leadership reliably.


This change is Reviewable

@tamird
Copy link
Copy Markdown
Contributor

tamird commented Dec 13, 2016

:lgtm:


Reviewed 4 of 4 files at r1.
Review status: all files reviewed at latest revision, 9 unresolved discussions, all commit checks successful.


pkg/storage/client_raft_test.go, line 2917 at r1 (raw file):

	status := repl0.RaftStatus()
	if status != nil && status.Lead != 1 {

while you're here nil should probably not be allowed


pkg/storage/client_raft_test.go, line 2921 at r1 (raw file):

	}

	// Force a read on Store 2 to request a new lease. Other moving parts in

this is no longer testing the mechanism by which raft leadership transfers happen, right? seems odd to see this code here now that transfers are triggered by all commands.

on second thought, maybe it is just that this comment is outdated.


pkg/storage/client_raft_test.go, line 2943 at r1 (raw file):

	// Wait for raft leadership transferring to be finished.
	testutils.SucceedsSoon(t, func() error {
		status = repl0.RaftStatus()

nit: :=


pkg/storage/client_raft_test.go, line 2963 at r1 (raw file):

			return err
		}
		status = repl1.RaftStatus()

nit: :=


pkg/storage/client_raft_test.go, line 2964 at r1 (raw file):

		}
		status = repl1.RaftStatus()
		if status.Lead != 1 {

nit: this 1 should probably be rd0.ReplicaID?


pkg/storage/client_raft_test.go, line 2982 at r1 (raw file):

			return pErr.GetDetail()
		}
		status = repl0.RaftStatus()

nit: :=


pkg/storage/helpers_test.go, line 264 at r1 (raw file):

// command will silently fail if the target is not an up-to-date
// replica.
func (r *Replica) RaftTransferLeader(target roachpb.ReplicaID) error {

this function is very similar to (*Replica).maybeTransferRaftLeadership - could you reduce the probability of rot by combining (or extracting a common method used by both)?


pkg/storage/replica.go, line 3473 at r1 (raw file):

		// quick command application (requests generally need to make it to both the
		// lease holder and the raft leader before being applied by other replicas).
		if raftCmd.BatchRequest != nil {

why the nil check?


pkg/storage/replica_proposal.go, line 430 at r1 (raw file):

			if l := r.mu.state.Lease; !l.OwnedBy(r.store.StoreID()) &&
				r.isLeaseValidLocked(l, r.store.Clock().Now()) {
				log.Infof(ctx, "transferring raft leadership to replica ID %v", l.Replica.ReplicaID)

while you're here, log the whole target replica? log.Infof(ctx, "transferring raft leadership to %+v", l.Replica)


Comments from Reviewable

@andreimatei
Copy link
Copy Markdown
Contributor

In that past when this came up there was concern about attempting too many of these raft transfers in case they're constantly failing. I don't understand how these transfers work and under which conditions they might fail, but maybe some throttling is in order?

Btw, @ben, @petermattis, is it time to give https://reviewable.io/reviews/cockroachdb/cockroach/8837 another look? It was trying to address #8816 by transferring raft leadership before transferring leases.


Review status: all files reviewed at latest revision, 9 unresolved discussions, all commit checks successful.


Comments from Reviewable

@petermattis
Copy link
Copy Markdown
Collaborator

Review status: all files reviewed at latest revision, 10 unresolved discussions, all commit checks successful.


pkg/storage/client_raft_test.go, line 2917 at r1 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

while you're here nil should probably not be allowed

Right: status == nil || status.Lead != 1


pkg/storage/replica_proposal.go, line 429 at r1 (raw file):

		if raftGroup.Status().RaftState == raft.StateLeader {
			if l := r.mu.state.Lease; !l.OwnedBy(r.store.StoreID()) &&
				r.isLeaseValidLocked(l, r.store.Clock().Now()) {

Can we do the lease check first? raftGroup.Status() feels a bit more expensive and the order of the checks shouldn't matter.


Comments from Reviewable

@bdarnell
Copy link
Copy Markdown
Contributor

:lgtm:

The concern about leadership transfers failing was partially FUD, and I'm less concerned about it now since this appears to have been working reliably so far. Transfers can fail if network messages get reordered, but that shouldn't happen with our raft transport. We should have a metric for how often these transfers are attempted, but I don't think we need to build throttling unless we see a problem. (if anything, throttling could mask a problem by allowing a range to limp along with limited availability; without throttling the problem will be more visible).

The combination of this change and #8837 would be bad: the node that receives the leadership transfer under #8837 would be likely to transfer it right back to the former leader under this change (and I think that's more of an argument against #8837 than against this change). I'll take a look at #8837 and #8816.

The commit message should be updated to say "on all write commands"; read-only traffic won't trigger this check. Should it? It doesn't matter in practice since a stable range with read-only traffic doesn't really care whether the lease holder and raft leader are the same, but it does mean that the leader-not-leaseholder metric wouldn't reliably fall to zero. (for that matter, the range could be completely idle. Should we have a scanner queue to perform this transfer as a time-based last resort?)


Reviewed 4 of 4 files at r1.
Review status: all files reviewed at latest revision, 11 unresolved discussions, all commit checks successful.


pkg/storage/replica_proposal.go, line 431 at r1 (raw file):

				r.isLeaseValidLocked(l, r.store.Clock().Now()) {
				log.Infof(ctx, "transferring raft leadership to replica ID %v", l.Replica.ReplicaID)
				raftGroup.TransferLeader(uint64(l.Replica.ReplicaID))

We should increment a metric here.


Comments from Reviewable

@spencerkimball
Copy link
Copy Markdown
Member Author

I've changed this to instead tee these up during metrics computation when we realize the condition exists. Otherwise, I keep seeing small numbers of long-running replicas with the leader-not-leaseholder condition.


Review status: all files reviewed at latest revision, 11 unresolved discussions, all commit checks successful.


pkg/storage/client_raft_test.go, line 2917 at r1 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Right: status == nil || status.Lead != 1

Done.


pkg/storage/client_raft_test.go, line 2921 at r1 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

this is no longer testing the mechanism by which raft leadership transfers happen, right? seems odd to see this code here now that transfers are triggered by all commands.

on second thought, maybe it is just that this comment is outdated.

I've mostly restored this unittest as I decided to make the transfers happen during metrics computation instead.


pkg/storage/client_raft_test.go, line 2943 at r1 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

nit: :=

Done.


pkg/storage/client_raft_test.go, line 2963 at r1 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

nit: :=

Done.


pkg/storage/client_raft_test.go, line 2964 at r1 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

nit: this 1 should probably be rd0.ReplicaID?

Done.


pkg/storage/client_raft_test.go, line 2982 at r1 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

nit: :=

Done.


pkg/storage/helpers_test.go, line 264 at r1 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

this function is very similar to (*Replica).maybeTransferRaftLeadership - could you reduce the probability of rot by combining (or extracting a common method used by both)?

It's simplified now.


pkg/storage/replica.go, line 3473 at r1 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

why the nil check?

This is gone now. I was avoiding doing making this check after Raft leadership transfers, which always send a nil request. It made testing impossible.


pkg/storage/replica_proposal.go, line 429 at r1 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Can we do the lease check first? raftGroup.Status() feels a bit more expensive and the order of the checks shouldn't matter.

Done.


pkg/storage/replica_proposal.go, line 430 at r1 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

while you're here, log the whole target replica? log.Infof(ctx, "transferring raft leadership to %+v", l.Replica)

I've reverted to old code here.


pkg/storage/replica_proposal.go, line 431 at r1 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

We should increment a metric here.

We already have a raft.rcvd.transferleader which is incremented elsewhere. Do you think we need both? In any case, I've reverted the changes to this file.


Comments from Reviewable

@spencerkimball spencerkimball force-pushed the spencerkimball/leaseholder-not-leader-check branch from 55df017 to 854703f Compare December 14, 2016 15:20
@petermattis
Copy link
Copy Markdown
Collaborator

Review status: 0 of 4 files reviewed at latest revision, 12 unresolved discussions, some commit checks failed.


pkg/storage/replica.go, line 4472 at r2 (raw file):

	// Try to correct leader-not-leaseholder condition, if encountered.
	if m.leaseValid && !m.leaseholder && m.leader {
		r.maybeTransferRaftLeadership(ctx, status.lease.Replica.ReplicaID)

It's very strange for a method named metrics to have the side effect of transferring the Raft leadership. Can we pull this up to the caller? Even doing this in the caller is a bit strange, though not quite so egregious.


Comments from Reviewable

@tamird
Copy link
Copy Markdown
Contributor

tamird commented Dec 14, 2016

Reviewed 4 of 4 files at r2.
Review status: all files reviewed at latest revision, 4 unresolved discussions, some commit checks failed.


pkg/storage/client_raft_test.go, line 2966 at r2 (raw file):

		status := repl1.RaftStatus()
		if status.Lead != rd0.ReplicaID {
			return errors.Errorf("expected raft leader be 1; got %d", status.Lead)

this "1" is still hard coded


Comments from Reviewable

@spencerkimball
Copy link
Copy Markdown
Member Author

Review status: all files reviewed at latest revision, 4 unresolved discussions, some commit checks failed.


pkg/storage/client_raft_test.go, line 2966 at r2 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

this "1" is still hard coded

Done.


pkg/storage/replica.go, line 4472 at r2 (raw file):

Previously, petermattis (Peter Mattis) wrote…

It's very strange for a method named metrics to have the side effect of transferring the Raft leadership. Can we pull this up to the caller? Even doing this in the caller is a bit strange, though not quite so egregious.

Done.


Comments from Reviewable

@spencerkimball spencerkimball force-pushed the spencerkimball/leaseholder-not-leader-check branch from 854703f to 9e4e348 Compare December 14, 2016 18:06
@spencerkimball
Copy link
Copy Markdown
Member Author

This PR causes the cluster to lock up. Need to investigate why, but my suspicion is there's something happening in raft.RawNode.TransferLeader.

@petermattis
Copy link
Copy Markdown
Collaborator

:lgtm:


Review status: 1 of 4 files reviewed at latest revision, 2 unresolved discussions, some commit checks pending.


Comments from Reviewable

@tamird
Copy link
Copy Markdown
Contributor

tamird commented Dec 14, 2016

:lgtm:


Reviewed 4 of 4 files at r3.
Review status: all files reviewed at latest revision, 2 unresolved discussions, all commit checks successful.


pkg/storage/client_raft_test.go, line 2944 at r3 (raw file):

	testutils.SucceedsSoon(t, func() error {
		status := repl0.RaftStatus()
		if status.Lead != 2 {

bunch more hardcoded IDs here, but not a blocker.


Comments from Reviewable

@spencerkimball spencerkimball force-pushed the spencerkimball/leaseholder-not-leader-check branch 2 times, most recently from a8d9e31 to ca91356 Compare December 14, 2016 20:50
@spencerkimball
Copy link
Copy Markdown
Member Author

Review status: 2 of 4 files reviewed at latest revision, 2 unresolved discussions, some commit checks pending.


pkg/storage/client_raft_test.go, line 2944 at r3 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

bunch more hardcoded IDs here, but not a blocker.

OK


Comments from Reviewable

@tamird
Copy link
Copy Markdown
Contributor

tamird commented Dec 14, 2016

Reviewed 2 of 2 files at r4.
Review status: all files reviewed at latest revision, 3 unresolved discussions, some commit checks failed.


pkg/storage/client_raft_test.go, line 2930 at r4 (raw file):

	status := repl0.RaftStatus()
	if status == nil || status.Lead != 1 {

this "1" (and on the next line) is rd0.ReplicaID. should be easy now that you've moved the definition of rd0 to the top


pkg/storage/replica_proposal.go, line 439 at r4 (raw file):

		// Only the raft leader can attempt a leadership transfer.
		if status := raftGroup.Status(); status.RaftState == raft.StateLeader {
			// Only attempt this if the leaseholder has all the log entries.

did this solve the cluster lockup problem?


Comments from Reviewable

@petermattis
Copy link
Copy Markdown
Collaborator

Review status: all files reviewed at latest revision, 4 unresolved discussions, some commit checks failed.


pkg/storage/replica_proposal.go, line 440 at r4 (raw file):

		if status := raftGroup.Status(); status.RaftState == raft.StateLeader {
			// Only attempt this if the leaseholder has all the log entries.
			if pr, ok := status.Progress[uint64(target)]; ok && pr.Match == r.mu.lastIndex {

Unless I'm misreading, the check in the raft code is status.Progress[leader] == r.mu.lastIndex.


Comments from Reviewable

@tamird
Copy link
Copy Markdown
Contributor

tamird commented Dec 14, 2016

Review status: all files reviewed at latest revision, 5 unresolved discussions, some commit checks failed.


pkg/storage/replica_proposal.go, line 439 at r4 (raw file):

		// Only the raft leader can attempt a leadership transfer.
		if status := raftGroup.Status(); status.RaftState == raft.StateLeader {
			// Only attempt this if the leaseholder has all the log entries.

s/leaseholder/target/ - only the caller knows that the target is the leaseholder


Comments from Reviewable

@spencerkimball
Copy link
Copy Markdown
Member Author

Review status: all files reviewed at latest revision, 5 unresolved discussions, some commit checks failed.


pkg/storage/client_raft_test.go, line 2930 at r4 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

this "1" (and on the next line) is rd0.ReplicaID. should be easy now that you've moved the definition of rd0 to the top

Done.


pkg/storage/replica_proposal.go, line 439 at r4 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

did this solve the cluster lockup problem?

No. Still looking. This PR will continue to change.


pkg/storage/replica_proposal.go, line 439 at r4 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

s/leaseholder/target/ - only the caller knows that the target is the leaseholder

Done.


pkg/storage/replica_proposal.go, line 440 at r4 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Unless I'm misreading, the check in the raft code is status.Progress[leader] == r.mu.lastIndex.

You're misreading. The msg.From field is set to the transferee, and that is the progress which is checked in raft.stepLeader.


Comments from Reviewable

@bdarnell
Copy link
Copy Markdown
Contributor

My guess is that quiescence is preventing a leadership transfer from being canceled in https://github.com/coreos/etcd/blob/d8513adf1d2058047929b1ed7c038fbe023ebe3a/raft/raft.go#L548 (but heartbeats are still flowing and preventing anyone but our designated target from campaigning). This could occur whenever we transfer leadership to a node that is going away, but this was unlikely when we only transferred to a node that successfully proposed and acquired a lease. Here, it's possible that we're transferring leadership to a node just as its replica is being removed. I think we need to prevent quiescence when a leader transfer is in progress.


Reviewed 2 of 4 files at r3, 2 of 2 files at r4.
Review status: all files reviewed at latest revision, 5 unresolved discussions, some commit checks failed.


pkg/storage/replica_proposal.go, line 431 at r1 (raw file):

Previously, spencerkimball (Spencer Kimball) wrote…

We already have a raft.rcvd.transferleader which is incremented elsewhere. Do you think we need both? In any case, I've reverted the changes to this file.

raft.rcvd.transferleader is always zero because MsgTransferLeader is used internally by raft and not sent on the wire. raft.rcvd.timeoutnow should work, though (but it's not quite the same, since some calls to TransferLeader may fail without sending MsgTimeoutNow).


Comments from Reviewable

@petermattis
Copy link
Copy Markdown
Collaborator

@bdarnell I'm not following how your scenario could occur. The leaseholder is the one which performs replica rebalancing for a range and it will never remove itself from the range. Instead, it first transfers its lease somewhere else and the recipient will perform the removal. I suppose it is possible for this lease transfer and removal to happen fast enough the the raft leadership transfer overlaps, but that seems unlikely. That said, I'll play around with some scenarios here as clearly we're seeing something on blue which seems to disappear when quiescence is disabled.


Review status: all files reviewed at latest revision, 5 unresolved discussions, some commit checks failed.


Comments from Reviewable

@bdarnell
Copy link
Copy Markdown
Contributor

Maybe it's racing with an addition instead of a removal, then: We see that the local node is overloaded while another node is underloaded. We rebalance a replica onto it, then as soon as the ChangeReplicas commits and we've processed it locally (but the target node hasn't), we transfer the lease, and then the raft leadership, all before that node has applied the EndTransaction that adds it (so it drops the raft MsgTimeoutNow message). The timing on that is really tight, but it could be possible, and that would have the same quiescence problem. There might be some quirk of preemptive snapshot processing (and its aftermath) that makes the timing more likely.

@spencerkimball
Copy link
Copy Markdown
Member Author

Damn the problem disappeared after the run where I changed redirectOnOrAcquireLease to always load the Raft group. In other words, that change fixed the problem permanently. I'd also experimented with always incrementing the liveness epoch and that presumably also should have fixed it but it didn't seem to; perhaps I just did something wrong. Unfortunately it's too late to go back and debug further. It seems as if some kind of shutdown scenario left one replica believing an old lease was active and then one replica and the other two played ping pong with lease expectation.

@spencerkimball
Copy link
Copy Markdown
Member Author

Review status: all files reviewed at latest revision, 4 unresolved discussions, some commit checks failed.


pkg/storage/replica_proposal.go, line 431 at r1 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

raft.rcvd.transferleader is always zero because MsgTransferLeader is used internally by raft and not sent on the wire. raft.rcvd.timeoutnow should work, though (but it's not quite the same, since some calls to TransferLeader may fail without sending MsgTimeoutNow).

OK, added a metric.


Comments from Reviewable

@spencerkimball spencerkimball force-pushed the spencerkimball/leaseholder-not-leader-check branch from ca91356 to e1c7af6 Compare December 17, 2016 20:40
@tamird
Copy link
Copy Markdown
Contributor

tamird commented Dec 17, 2016

Reviewed 3 of 3 files at r5.
Review status: all files reviewed at latest revision, 3 unresolved discussions, some commit checks pending.


pkg/storage/client_raft_test.go, line 2930 at r4 (raw file):

Previously, spencerkimball (Spencer Kimball) wrote…

Done.

?


pkg/storage/replica_proposal.go, line 439 at r4 (raw file):

Previously, spencerkimball (Spencer Kimball) wrote…

Done.

?


Comments from Reviewable

@spencerkimball
Copy link
Copy Markdown
Member Author

Review status: all files reviewed at latest revision, 3 unresolved discussions, some commit checks failed.


pkg/storage/client_raft_test.go, line 2930 at r4 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

?

Damn, my changes here got reverted accidentally.


pkg/storage/replica_proposal.go, line 439 at r4 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

?

Done.


Comments from Reviewable

@spencerkimball spencerkimball force-pushed the spencerkimball/leaseholder-not-leader-check branch 2 times, most recently from 58e9908 to 0f200dd Compare December 17, 2016 22:11
@tamird
Copy link
Copy Markdown
Contributor

tamird commented Dec 18, 2016

Reviewed 2 of 2 files at r6.
Review status: all files reviewed at latest revision, 2 unresolved discussions, some commit checks failed.


pkg/storage/client_raft_test.go, line 2930 at r4 (raw file):

Previously, spencerkimball (Spencer Kimball) wrote…

Damn, my changes here got reverted accidentally.

the error message below still says "1", but lg otherwise


Comments from Reviewable

@spencerkimball spencerkimball force-pushed the spencerkimball/leaseholder-not-leader-check branch from 0f200dd to 288b8a0 Compare December 19, 2016 22:46
@spencerkimball
Copy link
Copy Markdown
Member Author

spencerkimball commented Dec 19, 2016

@petermattis take a look at the most recent change. I've decided to check for the condition at maybeQuiesce time because I realized during my testing this weekend that we should not quiesce ranges where the leaseholder and leader aren't the same replica.

@spencerkimball
Copy link
Copy Markdown
Member Author

Review status: 1 of 5 files reviewed at latest revision, 2 unresolved discussions.


pkg/storage/client_raft_test.go, line 2930 at r4 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

the error message below still says "1", but lg otherwise

Done.


Comments from Reviewable

Previously, we only checked for leader-not-leaseholder when completing a
lease operation. This relied on the constant renewals of expiration-based
leases to transfer Raft leadership reliably. We now do this while checking
if a range can be quiesced. This is convenient because we want to also
avoid quiescing if we notice this problem.

Also, only attempt to transfer raft leader to the leaseholder if the
leaseholder (the transferee) has fully applied the raft log. Otherwise,
the leader will send the presumably unapplied log suffix to the transferee
again and again.
@spencerkimball spencerkimball force-pushed the spencerkimball/leaseholder-not-leader-check branch from 288b8a0 to 647008c Compare December 19, 2016 22:49
@bdarnell
Copy link
Copy Markdown
Contributor

:lgtm:


Reviewed 1 of 3 files at r5, 5 of 5 files at r7.
Review status: all files reviewed at latest revision, 1 unresolved discussion, all commit checks successful.


Comments from Reviewable

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants