raftstore: rely on the all-target-peer-exist guarantee during merging#7672
raftstore: rely on the all-target-peer-exist guarantee during merging#7672sre-bot merged 19 commits intotikv:masterfrom
Conversation
|
I suggest to keep those code for safe check. Otherwise, incorrect implement of PD can lead to disaster. |
Signed-off-by: Liqi Geng <gengliqiii@gmail.com>
|
No release note, Please follow https://github.com/pingcap/community/blob/master/contributors/release-note-checker.md |
Signed-off-by: Liqi Geng <gengliqiii@gmail.com>
Signed-off-by: Liqi Geng <gengliqiii@gmail.com>
Signed-off-by: Liqi Geng <gengliqiii@gmail.com>
|
/run-all-tests |
There was a problem hiding this comment.
I think this is useless. It's optimized out in release mode, and we don't test debug build with PD, so the check can't really detect anything about real PD, it just checks for nothing.
There was a problem hiding this comment.
Maybe using something else instead.
I think we need a method to assert in release mode in the development environment.
There are some cases that using assert! directly is a little bit overkill in the production environment but necessary in the development environment. Then if enough tests have been passed, maybe we can change it to assert!.
There was a problem hiding this comment.
I don't think it needs to be assert in any cases. Even this is not hold, it doesn't effect normal operations of TiKV, it just leads to some unremovable replicas. I think just an alert is enough and leave an API to allow user to clean up stale replicas.
There was a problem hiding this comment.
It's ok to not change to assert. but an assert method is needed to detect some bugs in our tests in the development environment.
There was a problem hiding this comment.
Yes, that's why I suggest an alert.
Signed-off-by: Liqi Geng <gengliqiii@gmail.com>
caba904 to
3576c7f
Compare
| #[doc(hidden)] | ||
| #[serde(skip_serializing)] | ||
| #[config(skip)] | ||
| pub ensure_all_target_peer_exist: bool, |
There was a problem hiding this comment.
Should be a better name so that it can be reused in other cases besides merge.
| future_poll_size: 1, | ||
| hibernate_regions: true, | ||
| early_apply: true, | ||
| ensure_all_target_peer_exist: true, |
There was a problem hiding this comment.
It should be true. If so, an error log will be printed when something is wrong and breaks this guarantee.
There was a problem hiding this comment.
How? From what I can see, it just debug_assert.
There was a problem hiding this comment.
I misrepresented it. It should be false only in test just for passing some tests and not panic. debug_assert will be replaced to another one.
Signed-off-by: Liqi Geng <gengliqiii@gmail.com>
8f14b07 to
e6dcdd9
Compare
4ff0a7c to
f8cfc9c
Compare
| } | ||
| } else { | ||
| if ctx.cfg.dev_assert { | ||
| assert!(false, "{} failed to get peer {} from cache", self.tag, id); |
components/test_raftstore/src/pd.rs
Outdated
| region_replication_status: HashMap<u64, RegionReplicationStatus>, | ||
|
|
||
| // for merging | ||
| pub merge_ensure_all_target_peer_exist: bool, |
There was a problem hiding this comment.
How about check_merge_target_integrity?
| } | ||
| drop(meta); | ||
|
|
||
| // Check whether target peer is set to tombstone already. |
There was a problem hiding this comment.
Can they be kept in case PD is not working as expected?
There was a problem hiding this comment.
I don't think so. The query to PD can only handle some cases not all of them. I think it has very little possibility that PD does not ensure all target peer exists because I have already checked them. Also, if PD really does not ensure it, the consequence is not serious and we can remove the region manually.
| /// It should be called when target region is not in region map in memory. | ||
| /// If everything is ok, the answer should always be true because PD should ensure all target peers exist. | ||
| /// So if not, error log will be printed and return false. | ||
| fn is_local_merge_target_region_fresher(&self, target_region: &metapb::Region) -> Result<bool> { |
There was a problem hiding this comment.
How about is_merge_target_outdated?
There was a problem hiding this comment.
I think the original name is fine 🤣 because I want to emphasize local_target_region
There was a problem hiding this comment.
The original name is too long, we usually use three to four words for naming. And using fresher here is very weird. A more up to date local target region means target_region to be checked is outdated.
| target_region.get_region_epoch(), | ||
| target_state.get_region().get_region_epoch(), | ||
| ) { | ||
| return Ok(true); |
There was a problem hiding this comment.
What if the source peer applies slowly and is still in an outdated merge?
There was a problem hiding this comment.
I don't get it. This function is used for the source peer to check whether its target peer is fresher. If so, the merge process may need to rollback or the merge succeeds and the source peer needs to destroy itself.
Signed-off-by: Liqi Geng <gengliqiii@gmail.com>
Signed-off-by: Liqi Geng <gengliqiii@gmail.com>
BusyJay
left a comment
There was a problem hiding this comment.
fresher in comments need to be updated too. Rest LGTM.
| "target_region_id" => region_id, | ||
| ); | ||
| return Ok(false); | ||
| Ok(true) => Err(box_err!("region {} is destroyed", target_region_id)), |
There was a problem hiding this comment.
This is not correct. Target peer doesn't have to be destroyed to reach this branch.
There was a problem hiding this comment.
If the target peer is not destroyed, it should in region map, but it's not.
|
@Connor1996 @NingLin-P PTAL |
Signed-off-by: Liqi Geng <gengliqiii@gmail.com>
|
/run-all-tests |
Signed-off-by: sre-bot <sre-bot@pingcap.com>
|
cherry pick to release-3.1 in PR #8003 |
Signed-off-by: sre-bot <sre-bot@pingcap.com>
|
cherry pick to release-4.0 in PR #8004 |
Signed-off-by: sre-bot <sre-bot@pingcap.com>
|
cherry pick to release-3.0 in PR #8005 |
Signed-off-by: Liqi Geng gengliqiii@gmail.com
What problem does this PR solve?
Problem Summary:
In the past, we assume some target peers may not exist during merging. It introduces many complexities that we need to think about how to handle it.
Also, there are some cases we can not handle.
For example, a target peer does not exist during merging. Its corresponding source peer is waiting for it to be created. Then other target peers do the conf change and remove it(though it never existed). Next, the target region merges to another target region.
In this case, the waiting source peer can not find out whether it should be removed. (It's very difficult, the PD knows nothing because the target region has merged to another one)
Luckily, in tikv/pd#839 (the earliest PR about merging in PD), it checks the target peer's pending peer count must equal to zero.
So we can rely on the all-target-peer-exist guarantee to make it easier and save our brain power.
What is changed and how it works?
What's Changed:
Rely on the all-target-peer-exist guarantee during merging.
Remove some code about handling some cases when a target peer may not exist during merging.
Related changes
Check List
Tests
Side effects
We should check it twice whether the earliest version can give us this guarantee.
Otherwise, it will break backward compatibility.
Release note