Skip to content

raft: transfer leader feature#4916

Merged
xiang90 merged 1 commit intoetcd-io:masterfrom
es-chow:transfer-leader
Apr 8, 2016
Merged

raft: transfer leader feature#4916
xiang90 merged 1 commit intoetcd-io:masterfrom
es-chow:transfer-leader

Conversation

@es-chow
Copy link
Copy Markdown
Contributor

@es-chow es-chow commented Mar 31, 2016

Try to implement transfer leader as per raft thesis 3.10 Leadership transfer extension.

@xiang90
Copy link
Copy Markdown
Contributor

xiang90 commented Mar 31, 2016

@es-chow Can you talk a little bit about your use case or potential use case? I know some one has already done this for etcd/raft. I can ask him to review the code too.

@es-chow
Copy link
Copy Markdown
Contributor Author

es-chow commented Mar 31, 2016

The background for this is that Cockroachdb has an idea to coincide raft leader with range leader, https://github.com/cockroachdb/cockroach/blob/master/docs/design.md#relationship-to-raft-leadership.

This coincide may be achieved by leader transfer.

@xiang90
Copy link
Copy Markdown
Contributor

xiang90 commented Apr 1, 2016

@bdarnell Can you confirm cockroach need this feature?

raft/raft.go Outdated
r.Step(pb.Message{From: r.id, Type: pb.MsgBeat})
}

// If current leader cannot transfer leader in election timeout, become leader again.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be inside the r.electionElapsed >= r.electionTimeout block above at line 440?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, will fix it.

@bdarnell
Copy link
Copy Markdown
Contributor

bdarnell commented Apr 2, 2016

Yes, we could use some sort of orderly leadership handoff to allow us to choose the best leader.

I don't like doing this by introducing a new StateTransferLeader that is almost exactly like StateLeader, though. I think it would be better to stay in StateLeader throughout the process (until the transferee sends us a MsgVote and the old leader becomes a follower). In stepLeader's handling of MsgProp, we could drop proposals whenever transferee is non-zero.

@xiang90
Copy link
Copy Markdown
Contributor

xiang90 commented Apr 2, 2016

Yes, we could use some sort of orderly leadership handoff to allow us to choose the best leader.

OK. I will take a closer look at this pr.

raft/raft.go Outdated

logger Logger
// Leader transfer target.
transferee uint64
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move this to line 162?

@es-chow
Copy link
Copy Markdown
Contributor Author

es-chow commented Apr 4, 2016

@bdarnell, as transferee will be reset to zero after sending out a MsgTimeoutNow to prevent multiple MsgTimeoutNow to send out when pipelining MsgAppResp been received. So if we judge if a MsgProp can be honored by check the value of transferee, it may make a MsgProp being honored just after a MsgTimeoutNow been sent out. Or we have to introduce one more flag to distinguish this.

@xiang90
Copy link
Copy Markdown
Contributor

xiang90 commented Apr 4, 2016

I believe @dterei has implemented this feature for etcd/raft. So we would love to hear his opinion!

@xiang90
Copy link
Copy Markdown
Contributor

xiang90 commented Apr 4, 2016

@es-chow I agree with @bdarnell. If the current leader receives a transfer it can still be in leader state. But it starts to block all incoming proposals for some timeout and sends all pending entries to the transferee. Adding one or two flags is probably better than introducing a whole new state?

}
} else {
if msgt == raftpb.MsgBeat || msgt == raftpb.MsgHup || msgt == raftpb.MsgUnreachable || msgt == raftpb.MsgSnapStatus || msgt == raftpb.MsgCheckQuorum {
if msgt == raftpb.MsgBeat || msgt == raftpb.MsgHup || msgt == raftpb.MsgUnreachable || msgt == raftpb.MsgSnapStatus || msgt == raftpb.MsgCheckQuorum || msgt == raftpb.MsgTransferLeader {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably it is time for us to make this multiple lines.

@es-chow es-chow force-pushed the transfer-leader branch 4 times, most recently from e465718 to e17ed54 Compare April 5, 2016 07:12
@es-chow
Copy link
Copy Markdown
Contributor Author

es-chow commented Apr 5, 2016

@xiang90 , @bdarnell , @dterei, updated as per your comments. PTAL.

raft/raft.go Outdated
lead uint64

// Leader transfer target.
transferee uint64
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

// leadTransfree is id of the leader transfer target when its value is not zero.
leadTransfree?

nt := newNetwork(nil, nil, nil)
nt.send(pb.Message{From: 1, To: 1, Type: pb.MsgHup})

for j := 0; j <= 5; j++ {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we send 5 proposals in this test case?

@xiang90
Copy link
Copy Markdown
Contributor

xiang90 commented Apr 5, 2016

@es-chow Looks good overall. Please fix the comments. Add LeaderTransfere prefix to all tests. Thanks!

@es-chow es-chow force-pushed the transfer-leader branch 3 times, most recently from 0768bc4 to 24e2dbf Compare April 6, 2016 08:44
@es-chow
Copy link
Copy Markdown
Contributor Author

es-chow commented Apr 6, 2016

@xiang90 , @bdarnell, updated as per the comments. PTAL.

raft/raft.go Outdated
case len(r.votes) - gr:
r.becomeFollower(r.Term, None)
}
case pb.MsgTransferLeader:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you move these to line 555. (same as how we handle m.Type == pb.MsgHup )

@xiang90
Copy link
Copy Markdown
Contributor

xiang90 commented Apr 7, 2016

LGTM. Defer to @bdarnell or anyone from cockroach team.

@ngaut
Copy link
Copy Markdown
Contributor

ngaut commented Apr 7, 2016

@xiang90 I'd like to see this happen too. And i will port the feature to TiKV later.

}
// Transfer leadership is in progress.
if m.From == r.leadTransferee && pr.Match == r.raftLog.lastIndex() && !r.timeoutNowSent {
r.logger.Infof("%x sent MsgTimeoutNow to %x after received MsgAppResp", r.id, m.From)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we assume that the transport (usually) preserves order, we could send the MsgTimeoutNow as soon as we have sent a MsgApp containing the last index, without waiting for the MsgAppResp. This is similar to the way we optimistically increase Progress.Next when sending MsgApp.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bdarnell, @xiang90, I feel we cannot send MsgTimeoutNow as soon as we have sent a MsgApp containing the last index in all cases, if we add that code and remove the code to send MsgTimeoutNow when a MsgAppResp is received, the test case TestLeaderTransferToSlowFollower cannot be passed as node 3 will reject that MsgApp just before the MsgTimeoutNow triggered by pb.MsgTransferLeader as its log is not up-to-date, but the following MsgTimeoutNow will trigger node3 to start an election which also will fail.

  1. Maybe one way is we keep the code which send MsgTimeoutNow as receiving MsgAppResp, but add one optimistic sending MsgTimeoutNow only when we optimistically update the pr.Next, but this may also have the risk that peer may reject that MsgApp and MsgTimeoutNow will make the group unavailable for one election timeout. And also we have to keep the timeoutNowSent flag.
  2. We keep the current code as we can also remove the timeoutNowSent flag. This is the updated commit.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel this is an optimization. We can think this more and do it in another pr. @es-chow Can you set up another issue to track this potential optimization?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this optimization can wait. One thing we could do to avoid the scenario in which the follower rejects the MsgApp and then starts an election that fails is to include the leader's last log index in the MsgTimeoutNow. The follower would only start campaigning if it matches the index; if it doesn't then it knows it can't win.

@bdarnell
Copy link
Copy Markdown
Contributor

bdarnell commented Apr 7, 2016

LGTM. Thanks @es-chow!

@xiang90
Copy link
Copy Markdown
Contributor

xiang90 commented Apr 7, 2016

@es-chow Can you fix the last few comments? We are getting really close to get this merged! Thanks!

if lastLeadTransferee == None {
r.logger.Debugf("%x is already leader. Ignored transfer leadership to %x", r.id, r.id)
} else {
r.logger.Debugf("%x abort transfer leadership to %x, transfer to current leader %x.", r.id, lastLeadTransferee, r.id)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it seems we never run to this line, otherwise r.abortLeaderTransfer() should be add.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree /cc @es-chow can you also take a look?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Development

Successfully merging this pull request may close these issues.

5 participants