Skip to content

raftstore: Introduce force leader state#11932

Merged
ti-chi-bot merged 52 commits intotikv:masterfrom
Connor1996:force-leader
Apr 12, 2022
Merged

raftstore: Introduce force leader state#11932
ti-chi-bot merged 52 commits intotikv:masterfrom
Connor1996:force-leader

Conversation

@Connor1996
Copy link
Member

@Connor1996 Connor1996 commented Feb 7, 2022

What is changed and how it works?

Issue Number: Close #6107

What's Changed:

Part of #10483, Introduce force leader state to advance commit index even if the quorum is not formed. This is used for online unsafe recover.

Related changes

  • PR to update pingcap/docs/pingcap/docs-cn:
  • PR to update pingcap/tidb-ansible:
  • Need to cherry-pick to the release branch

Check List

Tests

  • Unit test
  • Integration test

Release note

None 

Signed-off-by: Connor1996 <zbk602423539@gmail.com>
Signed-off-by: Connor1996 <zbk602423539@gmail.com>
Signed-off-by: Connor1996 <zbk602423539@gmail.com>
Signed-off-by: Connor1996 <zbk602423539@gmail.com>
@ti-chi-bot
Copy link
Member

ti-chi-bot commented Feb 7, 2022

[REVIEW NOTIFICATION]

This pull request has been approved by:

  • BusyJay
  • NingLin-P

To complete the pull request process, please ask the reviewers in the list to review by filling /cc @reviewer in the comment.
After your PR has acquired the required number of LGTMs, you can assign this pull request to the committer in the list by filling /assign @committer in the comment to help you merge this pull request.

The full list of commands accepted by this bot can be found here.

Details

Reviewer can indicate their review by submitting an approval review.
Reviewer can cancel approval by submitting a request changes review.

@ti-chi-bot ti-chi-bot added release-note-none Denotes a PR that doesn't merit a release note. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Feb 7, 2022
@Connor1996 Connor1996 requested review from BusyJay and NingLin-P and removed request for NingLin-P February 7, 2022 09:32
Signed-off-by: Connor1996 <zbk602423539@gmail.com>
@Connor1996
Copy link
Member Author

PTAL too @v01dstar

Copy link
Member

@v01dstar v01dstar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Losing leadership in election timeout while check_quorum = true should be taken into consideration?

@Connor1996
Copy link
Member Author

Losing leadership in election timeout while check_quorum = true should be taken into consideration?

Yes, do you want to add an API to change that? If not, I'll do it.

Signed-off-by: Connor1996 <zbk602423539@gmail.com>
@ti-chi-bot ti-chi-bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Feb 11, 2022
@NingLin-P
Copy link
Member

I think there is a remaining problem about the learner role with the force leader mechanism, see #6107 (comment)

@Connor1996
Copy link
Member Author

I think there is a remaining problem about the learner role with the force leader mechanism, see #6107 (comment)

Good point. Adding a test to see if the learner can call became_leader successfully.

Signed-off-by: Connor1996 <zbk602423539@gmail.com>
@Connor1996
Copy link
Member Author

I think there is a remaining problem about the learner role with the force leader mechanism, see #6107 (comment)

Add a case to force the learner to be the leader. Seems no problem, as we can promote it to be a voter once in force leader state.

Signed-off-by: Connor1996 <zbk602423539@gmail.com>
@ti-chi-bot ti-chi-bot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Feb 15, 2022
@NingLin-P
Copy link
Member

Add a case to force the learner to be the leader. Seems no problem, as we can promote it to be a voter once in force leader state.

It may more than just whether a learner can become leader, I worry about what may happen after a learner became leader as we always assume that only voter can become leader. /cc @BusyJay

@ti-chi-bot ti-chi-bot added the status/LGT1 Indicates that a PR has LGTM 1. label Apr 11, 2022
Signed-off-by: Connor1996 <zbk602423539@gmail.com>
Signed-off-by: Connor1996 <zbk602423539@gmail.com>
Signed-off-by: Connor1996 <zbk602423539@gmail.com>
Signed-off-by: Connor1996 <zbk602423539@gmail.com>
Signed-off-by: Connor1996 <zbk602423539@gmail.com>
.raft_group
.raft
.prs()
.has_quorum(&expected_alive_voter)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Supposing (a, b, c, d, e) -> (a, b, c, d, f) -> (a, b, c, g, f), and a, b, g, f all fail. c is isolated during the whole time. It may think it still has quorum (c, d, e).

Copy link
Member Author

@Connor1996 Connor1996 Apr 12, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Emmm.... In this case, force leader can't be performed as well, as we add pre-force-leader check expecting all votes from expected alive voters

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed. Perhaps what we should do is check if there is rejection instead of expecting all approvals.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logic of forwarding the commit index also has the same issue. I'd like to leverage the mechanism introduced by this PR #7636 to filter the peers.

It's an extreme corner case, I'd record an issue now and fix it after the whole procedure is workable.

Copy link
Member Author

@Connor1996 Connor1996 Apr 12, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a TODO item in #10483

Signed-off-by: Connor1996 <zbk602423539@gmail.com>
Signed-off-by: Connor1996 <zbk602423539@gmail.com>
@Connor1996 Connor1996 requested a review from BusyJay April 12, 2022 09:32
Signed-off-by: Connor1996 <zbk602423539@gmail.com>
@ti-chi-bot ti-chi-bot added status/LGT2 Indicates that a PR has LGTM 2. and removed status/LGT1 Indicates that a PR has LGTM 1. labels Apr 12, 2022
@Connor1996
Copy link
Member Author

/merge

@ti-chi-bot
Copy link
Member

@Connor1996: It seems you want to merge this PR, I will help you trigger all the tests:

/run-all-tests

You only need to trigger /merge once, and if the CI test fails, you just re-trigger the test that failed and the bot will merge the PR for you after the CI passes.

If you have any questions about the PR merge process, please refer to pr process.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the ti-community-infra/tichi repository.

@ti-chi-bot
Copy link
Member

This pull request has been accepted and is ready to merge.

DetailsCommit hash: cd6f7f2

@ti-chi-bot ti-chi-bot added the status/can-merge Indicates a PR has been approved by a committer. label Apr 12, 2022
@Connor1996
Copy link
Member Author

/test

@ti-chi-bot ti-chi-bot merged commit b602719 into tikv:master Apr 12, 2022
Connor1996 added a commit to Connor1996/tikv that referenced this pull request May 18, 2022
close tikv#6107, ref tikv#10483

Signed-off-by: Connor1996 <zbk602423539@gmail.com>
Connor1996 added a commit to Connor1996/tikv that referenced this pull request May 18, 2022
close tikv#6107, ref tikv#10483

Signed-off-by: Connor1996 <zbk602423539@gmail.com>
CalvinNeo pushed a commit to pingcap/tidb-engine-ext that referenced this pull request May 18, 2022
* raftstore: Introduce force leader state (tikv#11932)

close tikv#6107, ref tikv#10483

Signed-off-by: Connor1996 <zbk602423539@gmail.com>

* raftstore: Wait ticks for hibernated peer when doing force leader (tikv#12364)

ref tikv#10483

Force leader is rejected on a peer who is already a leader. For the hibernated leader,
it doesn't step down to follower when quorum is missing due to not ticking election. 
So wait ticks in force leader process for hibernated peers to make sure election ticking
is performed.

Signed-off-by: Connor1996 <zbk602423539@gmail.com>

Co-authored-by: Ti Chi Robot <ti-community-prow-bot@tidb.io>

* raftstore: Make unsafe recovery wait apply cover snapshot apply cases ref tikv#10483 (tikv#12308)

ref tikv#10483

Signed-off-by: v01dstar <yang.zhang@pingcap.com>

* raftstore: Execute recovery plan via raft (tikv#12022)

Signed-off-by: Connor1996 <zbk602423539@gmail.com>

Co-authored-by: Ti Chi Robot <ti-community-prow-bot@tidb.io>
Co-authored-by: Yang Zhang <yang.zhang@pingcap.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

release-note-none Denotes a PR that doesn't merit a release note. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. status/can-merge Indicates a PR has been approved by a committer. status/LGT2 Indicates that a PR has LGTM 2.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

tikv-ctl unsafe recover can lead to inconsistent replica list

7 participants