raftstore: fix missing pending peers in region heartbeat after split#18032
raftstore: fix missing pending peers in region heartbeat after split#18032ti-chi-bot[bot] merged 7 commits intotikv:masterfrom
Conversation
Signed-off-by: Bisheng Huang <hbisheng@gmail.com>
Signed-off-by: Bisheng Huang <hbisheng@gmail.com>
I gave up on option 1 because blocking a split after it's committed is non-intuitive and prone to more race conditions. I'm now looking into option 2 to prevent the merge from happening in the first place. |
83f52df to
c105465
Compare
Signed-off-by: Bisheng Huang <hbisheng@gmail.com>
| } | ||
|
|
||
| if status.progress.is_none() { | ||
| // If `progress` is None, the current node is not the leader. Since |
There was a problem hiding this comment.
NIT: it's better to add extra annotations here to specify that the calling of this part will only be executed by a new split peer with already campaigned, who will most likely be elected as the new leader of the region.
There was a problem hiding this comment.
Added more comment as suggested
Signed-off-by: Bisheng Huang <hbisheng@gmail.com>
| } | ||
|
|
||
| if status.progress.is_none() { | ||
| // If `progress` is None, the current node is not the leader. |
There was a problem hiding this comment.
How about adding an assert here
There was a problem hiding this comment.
You meant asserting that the peer is not a leader? That should always be the case according to the logic in raft-rs. So yeah I can add an assert here.
There was a problem hiding this comment.
It might seem a bit unnecessary though because status() was just called a few lines before (on line 2102)
There was a problem hiding this comment.
Added the assertion as suggested
overvenus
left a comment
There was a problem hiding this comment.
Should we also reject PrepareMerge when there is a pending peer to further strengthen the region merge process?
I hoped to do that too but it seems more complex than expected. The PrepareMerge proposal is sent to the merge source, but we actually need to check the pending peer on the merge target. |
Signed-off-by: Bisheng Huang <hbisheng@gmail.com>
Signed-off-by: Bisheng Huang <hbisheng@gmail.com>
|
@Connor1996 @overvenus @LykxSassinator PTAL again, thanks! |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: Connor1996, LykxSassinator The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
[LGTM Timeline notifier]Timeline:
|
What is changed and how it works?
Issue Number: Close #17992
Since #7672, TiKV has relied on the guarantee that all merge target peers must exist before PD initiates a merge. However, this guarantee was broken in the scenario in #17992, ultimately leading to a panic.
PD uses
pending_peersto verify whether all target peers exist before merging[1][2]. Investigations pointed to an issue in TiKV where it incorrectly reported missing pending peers after a split. More specifically, right after a split, TiKV triggers a new peer to upload a heartbeat[3], but at this point, the peer may not have been elected as the leader yet. Before this PR, a non-leader would always set thepending_peersto empty, causing PD to receive an incorrect region state.Since a non-leader has no knowledge of the progress of other peers, it should consevatively mark all other peers as pending. This fix ensures that PD receives correct
pending_peersinfo, which should prevent the immature merge and avoiding the potential panic.[1] components/raftstore/src/store/fsm/peer.rs#L4686
[2] schedule/checker/merge_checker.go#L219
[3] schedule/filter/healthy.go#L24
What's Changed:
Note: This issue is unlikely to occur in production since PD enforces a split-merge-interval (default: 1h) between split and merge operations. The issue only happens if a region is merged immediately after a split.
Related changes
pingcap/docs/pingcap/docs-cn:Check List
Tests
Side effects
Release note