Skip to content

Fix deadlock on removeExpiredSnapshots (#2461)#2567

Merged
JaySon-Huang merged 3 commits intopingcap:release-4.0from
ti-srebot:release-4.0-548b3a150f04
Aug 10, 2021
Merged

Fix deadlock on removeExpiredSnapshots (#2461)#2567
JaySon-Huang merged 3 commits intopingcap:release-4.0from
ti-srebot:release-4.0-548b3a150f04

Conversation

@ti-srebot
Copy link
Collaborator

@ti-srebot ti-srebot commented Aug 4, 2021

cherry-pick #2461 to release-4.0
You can switch your code base to this Pull Request by using git-extras:

# In tics repo:
git pr https://github.com/pingcap/tics/pull/2567

After apply modifications, you can push your change to this PR via:

git push git@github.com:ti-srebot/tics.git pr/2567:release-4.0-548b3a150f04

Signed-off-by: JaySon-Huang jayson.hjs@gmail.com

What problem does this PR solve?

Issue Number: close #2456

Problem Summary:
Similar to issue: #2249 PR: #2277
In removeExpiredSnapshots, it may trigger the ~Snapshot under read_write_mutex being locked. It causes incursive deadlock.

And we will call removeExpiredSnapshots every minute after #2431 that reports the oldest snapshot to Prometheus.

https://github.com/pingcap/tics/blob/e9f28c717902a4b22855970752a08ceca214cd01/dbms/src/Storages/Page/mvcc/VersionSetWithDelta.h#L313-L340

What is changed and how it works?

  • Save the valid snapshots into a vector under a lock on read_write_mutex, and only release those snapshots after the lock gets released
  • Add some failpoint to test this case

Related changes

  • PR to update pingcap/docs/pingcap/docs-cn:
  • Need to cherry-pick to the release branch:

Check List

Tests

  • Manual test (add detailed scripts or steps below)
    Tested by the stress testing in Fix broken page storage stress testing #2297 with 4 writers, 128 readers, and failpoint random_slow_page_storage_remove_expired_snapshots enabled

Side effects

Release note

  • No release note

Signed-off-by: ti-srebot <ti-srebot@pingcap.com>
@ti-srebot ti-srebot added CHERRY-PICK cherry pick status/LGT1 Indicates that a PR has LGTM 1. labels Aug 4, 2021
@ti-srebot ti-srebot requested a review from flowbehappy August 4, 2021 08:38
@ti-srebot ti-srebot added this to the v4.0.15 milestone Aug 4, 2021
@JaySon-Huang JaySon-Huang added the type/bugfix This PR fixes a bug. label Aug 4, 2021
Signed-off-by: JaySon-Huang <jayson.hjs@gmail.com>
@JaySon-Huang JaySon-Huang force-pushed the release-4.0-548b3a150f04 branch from ee41c60 to ec7b043 Compare August 4, 2021 13:47
@JaySon-Huang
Copy link
Contributor

Also pick #2229 to release-4.0 to show warnings when there are snapshots being held for a long time.

Because there could be unknow bugs other that MPP that hold snapshots for a long time: #2529

@JaySon-Huang
Copy link
Contributor

/run-all-tests

@JaySon-Huang
Copy link
Contributor

/merge

@ti-chi-bot
Copy link
Member

@JaySon-Huang: It seems you want to merge this PR, I will help you trigger all the tests:

/run-all-tests

You only need to trigger /merge once, and if the CI test fails, you just re-trigger the test that failed and the bot will merge the PR for you after the CI passes.

If you have any questions about the PR merge process, please refer to pr process.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the ti-community-infra/tichi repository.

@ti-chi-bot
Copy link
Member

This pull request has been accepted and is ready to merge.

DetailsCommit hash: 6cb5e7b

@ti-chi-bot ti-chi-bot added the status/can-merge Indicates a PR has been approved by a committer. label Aug 10, 2021
@JaySon-Huang JaySon-Huang merged commit 20238d3 into pingcap:release-4.0 Aug 10, 2021
@JaySon-Huang JaySon-Huang deleted the release-4.0-548b3a150f04 branch August 10, 2021 10:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CHERRY-PICK cherry pick status/can-merge Indicates a PR has been approved by a committer. status/LGT1 Indicates that a PR has LGTM 1. type/bugfix This PR fixes a bug.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants