Skip to content

Fix deadlock on removeExpiredSnapshots#2461

Merged
ti-srebot merged 1 commit intopingcap:masterfrom
JaySon-Huang:fix_ps_snap_deadlock
Jul 22, 2021
Merged

Fix deadlock on removeExpiredSnapshots#2461
ti-srebot merged 1 commit intopingcap:masterfrom
JaySon-Huang:fix_ps_snap_deadlock

Conversation

@JaySon-Huang
Copy link
Contributor

@JaySon-Huang JaySon-Huang commented Jul 22, 2021

Signed-off-by: JaySon-Huang jayson.hjs@gmail.com

What problem does this PR solve?

Issue Number: close #2456

Problem Summary:
Similar to issue: #2249 PR: #2277
In removeExpiredSnapshots, it may trigger the ~Snapshot under read_write_mutex being locked. It causes incursive deadlock.

And we will call removeExpiredSnapshots every minute after #2431 that reports the oldest snapshot to Prometheus.

https://github.com/pingcap/tics/blob/e9f28c717902a4b22855970752a08ceca214cd01/dbms/src/Storages/Page/mvcc/VersionSetWithDelta.h#L313-L340

What is changed and how it works?

  • Save the valid snapshots into a vector under a lock on read_write_mutex, and only release those snapshots after the lock gets released
  • Add some failpoint to test this case

Related changes

  • PR to update pingcap/docs/pingcap/docs-cn:
  • Need to cherry-pick to the release branch:

Check List

Tests

  • Manual test (add detailed scripts or steps below)
    Tested by the stress testing in Fix broken page storage stress testing #2297 with 4 writers, 128 readers, and failpoint random_slow_page_storage_remove_expired_snapshots enabled

Side effects

Release note

  • No release note

Add failpoint for testing

Signed-off-by: JaySon-Huang <jayson.hjs@gmail.com>
Copy link
Contributor

@flowbehappy flowbehappy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ti-srebot ti-srebot added the status/LGT1 Indicates that a PR has LGTM 1. label Jul 22, 2021
@JaySon-Huang
Copy link
Contributor Author

/merge

@ti-srebot ti-srebot added the status/can-merge Indicates a PR has been approved by a committer. label Jul 22, 2021
@ti-srebot
Copy link
Collaborator

/run-all-tests

@ti-srebot ti-srebot merged commit 548b3a1 into pingcap:master Jul 22, 2021
@JaySon-Huang JaySon-Huang deleted the fix_ps_snap_deadlock branch July 22, 2021 07:47
JaySon-Huang added a commit to JaySon-Huang/tiflash that referenced this pull request Aug 4, 2021
JaySon-Huang added a commit to JaySon-Huang/tiflash that referenced this pull request Aug 4, 2021
@JaySon-Huang JaySon-Huang added the needs-cherry-pick-release-4.0 PR which needs to be cherry-picked to release-4.0 label Aug 4, 2021
@JaySon-Huang
Copy link
Contributor Author

/run-cherry-pick

@JaySon-Huang
Copy link
Contributor Author

cherry-pick to release-5.1 in #2564

@ti-srebot
Copy link
Collaborator

cherry pick to release-4.0 in PR #2567

JaySon-Huang added a commit to JaySon-Huang/tiflash that referenced this pull request Aug 4, 2021
@JaySon-Huang
Copy link
Contributor Author

cherry-pick to release-5.0 in #2568

flowbehappy pushed a commit that referenced this pull request Aug 4, 2021
* Ignore sequence hole among PageFile meta (#2312)

* Fix bug for GC may skip unexpected WriteBatches (#2356)

* Add length check while running PageStorage GC (#2394)

* PageStorage skip non continuous sequence safely (#2435)

* Fix PageStorage GC with high valid rate PageFile (#2436)

* More debug info for DeltaTree (query_id, snapshot lifetime) (#2431)

* Fix deadlock on `removeExpiredSnapshots` (#2461)

* Add grafana panels for write throughput per instance (#2524)
JaySon-Huang added a commit that referenced this pull request Aug 4, 2021
* More debug info for DeltaTree (query_id, snapshot lifetime) (#2431)
* Fix deadlock on `removeExpiredSnapshots` (#2461)
JaySon-Huang added a commit that referenced this pull request Aug 10, 2021
* cherry pick #2461 to release-4.0

Co-authored-by: JaySon <tshent@qq.com>
Co-authored-by: JaySon-Huang <jayson.hjs@gmail.com>
ti-chi-bot pushed a commit that referenced this pull request Sep 1, 2021
windtalker added a commit that referenced this pull request Sep 6, 2021
Co-authored-by: JaySon <tshent@qq.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

needs-cherry-pick-release-4.0 PR which needs to be cherry-picked to release-4.0 status/can-merge Indicates a PR has been approved by a committer. status/LGT1 Indicates that a PR has LGTM 1.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

coredump with the error of "std::system_error what(): Resource deadlock avoided"

3 participants