Skip to content

raftstore: rely on the all-target-peer-exist guarantee during merging#7672

Merged
sre-bot merged 19 commits intotikv:masterfrom
gengliqi:remove-some-code
Jun 1, 2020
Merged

raftstore: rely on the all-target-peer-exist guarantee during merging#7672
sre-bot merged 19 commits intotikv:masterfrom
gengliqi:remove-some-code

Conversation

@gengliqi
Copy link
Member

@gengliqi gengliqi commented Apr 27, 2020

Signed-off-by: Liqi Geng gengliqiii@gmail.com

What problem does this PR solve?

Problem Summary:

In the past, we assume some target peers may not exist during merging. It introduces many complexities that we need to think about how to handle it.

Also, there are some cases we can not handle.

For example, a target peer does not exist during merging. Its corresponding source peer is waiting for it to be created. Then other target peers do the conf change and remove it(though it never existed). Next, the target region merges to another target region.

In this case, the waiting source peer can not find out whether it should be removed. (It's very difficult, the PD knows nothing because the target region has merged to another one)

Luckily, in tikv/pd#839 (the earliest PR about merging in PD), it checks the target peer's pending peer count must equal to zero.
So we can rely on the all-target-peer-exist guarantee to make it easier and save our brain power.

What is changed and how it works?

What's Changed:

Rely on the all-target-peer-exist guarantee during merging.
Remove some code about handling some cases when a target peer may not exist during merging.

Related changes

  • Need to cherry-pick to the release branch

Check List

Tests

  • No code

Side effects

We should check it twice whether the earliest version can give us this guarantee.
Otherwise, it will break backward compatibility.

Release note

  • Fix a case that a peer can not be removed when its store is isolated during multiple merge process.

@gengliqi gengliqi requested review from BusyJay and Connor1996 April 27, 2020 17:43
@BusyJay
Copy link
Member

BusyJay commented Apr 27, 2020

I suggest to keep those code for safe check. Otherwise, incorrect implement of PD can lead to disaster.

Signed-off-by: Liqi Geng <gengliqiii@gmail.com>
@gengliqi gengliqi force-pushed the remove-some-code branch from b9486bc to ba54942 Compare May 6, 2020 08:12
@sre-bot
Copy link
Contributor

sre-bot commented May 9, 2020

Signed-off-by: Liqi Geng <gengliqiii@gmail.com>
@gengliqi gengliqi force-pushed the remove-some-code branch from 6a99f1c to 01cd206 Compare May 9, 2020 03:22
gengliqi added 2 commits May 9, 2020 14:55
Signed-off-by: Liqi Geng <gengliqiii@gmail.com>
Signed-off-by: Liqi Geng <gengliqiii@gmail.com>
@gengliqi gengliqi requested a review from BusyJay May 9, 2020 07:00
Signed-off-by: Liqi Geng <gengliqiii@gmail.com>
@gengliqi gengliqi force-pushed the remove-some-code branch from 455a4d3 to 9c605d4 Compare May 9, 2020 09:49
@gengliqi
Copy link
Member Author

gengliqi commented May 9, 2020

/run-all-tests

@gengliqi gengliqi added sig/raft Component: Raft, RaftStore, etc. sig/raft type/bugfix This PR fixes a bug. and removed sig/raft labels May 11, 2020
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is useless. It's optimized out in release mode, and we don't test debug build with PD, so the check can't really detect anything about real PD, it just checks for nothing.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe using something else instead.
I think we need a method to assert in release mode in the development environment.
There are some cases that using assert! directly is a little bit overkill in the production environment but necessary in the development environment. Then if enough tests have been passed, maybe we can change it to assert!.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it needs to be assert in any cases. Even this is not hold, it doesn't effect normal operations of TiKV, it just leads to some unremovable replicas. I think just an alert is enough and leave an API to allow user to clean up stale replicas.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's ok to not change to assert. but an assert method is needed to detect some bugs in our tests in the development environment.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's why I suggest an alert.

Signed-off-by: Liqi Geng <gengliqiii@gmail.com>
@gengliqi gengliqi force-pushed the remove-some-code branch from caba904 to 3576c7f Compare May 19, 2020 05:12
#[doc(hidden)]
#[serde(skip_serializing)]
#[config(skip)]
pub ensure_all_target_peer_exist: bool,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be a better name so that it can be reused in other cases besides merge.

future_poll_size: 1,
hibernate_regions: true,
early_apply: true,
ensure_all_target_peer_exist: true,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Default should be false.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be true. If so, an error log will be printed when something is wrong and breaks this guarantee.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How? From what I can see, it just debug_assert.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I misrepresented it. It should be false only in test just for passing some tests and not panic. debug_assert will be replaced to another one.

@gengliqi gengliqi requested a review from NingLin-P May 22, 2020 06:35
Signed-off-by: Liqi Geng <gengliqiii@gmail.com>
@gengliqi gengliqi force-pushed the remove-some-code branch 3 times, most recently from 8f14b07 to e6dcdd9 Compare May 22, 2020 11:13
@gengliqi gengliqi force-pushed the remove-some-code branch from 4ff0a7c to f8cfc9c Compare May 27, 2020 11:43
}
} else {
if ctx.cfg.dev_assert {
assert!(false, "{} failed to get peer {} from cache", self.tag, id);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about just panic?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

region_replication_status: HashMap<u64, RegionReplicationStatus>,

// for merging
pub merge_ensure_all_target_peer_exist: bool,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about check_merge_target_integrity?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

}
drop(meta);

// Check whether target peer is set to tombstone already.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can they be kept in case PD is not working as expected?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so. The query to PD can only handle some cases not all of them. I think it has very little possibility that PD does not ensure all target peer exists because I have already checked them. Also, if PD really does not ensure it, the consequence is not serious and we can remove the region manually.

/// It should be called when target region is not in region map in memory.
/// If everything is ok, the answer should always be true because PD should ensure all target peers exist.
/// So if not, error log will be printed and return false.
fn is_local_merge_target_region_fresher(&self, target_region: &metapb::Region) -> Result<bool> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about is_merge_target_outdated?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the original name is fine 🤣 because I want to emphasize local_target_region

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The original name is too long, we usually use three to four words for naming. And using fresher here is very weird. A more up to date local target region means target_region to be checked is outdated.

target_region.get_region_epoch(),
target_state.get_region().get_region_epoch(),
) {
return Ok(true);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if the source peer applies slowly and is still in an outdated merge?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't get it. This function is used for the source peer to check whether its target peer is fresher. If so, the merge process may need to rollback or the merge succeeds and the source peer needs to destroy itself.

gengliqi added 3 commits May 28, 2020 15:50
Signed-off-by: Liqi Geng <gengliqiii@gmail.com>
Signed-off-by: Liqi Geng <gengliqiii@gmail.com>
Signed-off-by: Liqi Geng <gengliqiii@gmail.com>
Copy link
Member

@BusyJay BusyJay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fresher in comments need to be updated too. Rest LGTM.

"target_region_id" => region_id,
);
return Ok(false);
Ok(true) => Err(box_err!("region {} is destroyed", target_region_id)),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not correct. Target peer doesn't have to be destroyed to reach this branch.

Copy link
Member Author

@gengliqi gengliqi May 29, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the target peer is not destroyed, it should in region map, but it's not.

@BusyJay
Copy link
Member

BusyJay commented May 29, 2020

@Connor1996 @NingLin-P PTAL

Signed-off-by: Liqi Geng <gengliqiii@gmail.com>
Copy link
Member

@Connor1996 Connor1996 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@gengliqi gengliqi added the status/can-merge Indicates a PR has been approved by a committer. label Jun 1, 2020
@sre-bot
Copy link
Contributor

sre-bot commented Jun 1, 2020

/run-all-tests

@sre-bot sre-bot merged commit ed337a4 into tikv:master Jun 1, 2020
@gengliqi gengliqi added needs-cherry-pick-release-3.0 Type: Need cherry pick to release 3.0 needs-cherry-pick-release-3.1 Type: Need cherry pick to release 3.1 needs-cherry-pick-release-4.0 Type: Need cherry pick to release 4.0 labels Jun 3, 2020
sre-bot pushed a commit to sre-bot/tikv that referenced this pull request Jun 3, 2020
Signed-off-by: sre-bot <sre-bot@pingcap.com>
@sre-bot
Copy link
Contributor

sre-bot commented Jun 3, 2020

cherry pick to release-3.1 in PR #8003

sre-bot pushed a commit to sre-bot/tikv that referenced this pull request Jun 3, 2020
Signed-off-by: sre-bot <sre-bot@pingcap.com>
@sre-bot
Copy link
Contributor

sre-bot commented Jun 3, 2020

cherry pick to release-4.0 in PR #8004

sre-bot pushed a commit to sre-bot/tikv that referenced this pull request Jun 3, 2020
Signed-off-by: sre-bot <sre-bot@pingcap.com>
@sre-bot
Copy link
Contributor

sre-bot commented Jun 3, 2020

cherry pick to release-3.0 in PR #8005

@gengliqi gengliqi deleted the remove-some-code branch June 5, 2020 07:26
hicqu added a commit that referenced this pull request Jun 28, 2020
…#7672) (#8005)

Co-authored-by: Liqi Geng <gengliqiii@gmail.com>
Co-authored-by: qupeng <qupeng@pingcap.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

needs-cherry-pick-release-3.0 Type: Need cherry pick to release 3.0 needs-cherry-pick-release-3.1 Type: Need cherry pick to release 3.1 needs-cherry-pick-release-4.0 Type: Need cherry pick to release 4.0 sig/raft Component: Raft, RaftStore, etc. status/can-merge Indicates a PR has been approved by a committer. type/bugfix This PR fixes a bug.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants