[fix][broker] Fix deadlock while skip non-recoverable ledgers. by hrzzzz · Pull Request #21915 · apache/pulsar

hrzzzz · 2024-01-18T03:31:58Z

Fixes #21914

Motivation

Resolved the deadlock issue that occurred when the org.apache.bookkeeper.mledger.impl.ManagedCursorImpl#skipNonRecoverableLedger method and the org.apache.bookkeeper.mledger.impl.ManagedCursorImpl#asyncDelete method were called concurrently.

Modifications

The reason for the deadlock is that the org.apache.bookkeeper.mledger.impl.ManagedCursorImpl#skipNonRecoverableLedger method internally acquired a write lock with a large scope, seemingly just to check !individualDeletedMessages.contains(ledgerId, i). However, the org.apache.bookkeeper.mledger.impl.ManagedCursorImpl#asyncDelete method actually performs this check again internally to ensure that already deleted Positions are not deleted repeatedly. Therefore, we can remove the write lock in the org.apache.bookkeeper.mledger.impl.ManagedCursorImpl#skipNonRecoverableLedger method and omit the !individualDeletedMessages.contains(ledgerId, i) check, instead directly calling the org.apache.bookkeeper.mledger.impl.ManagedCursorImpl#asyncDelete method.

Verifying this change

Make sure that the change passes the CI checks.

Does this pull request potentially affect one of the following parts:

If the box was checked, please highlight the changes

Documentation

doc
doc-required
doc-not-needed
doc-complete

Matching PR in forked repository

PR in forked repository: hrzzzz#4

hrzzzz · 2024-01-18T08:07:35Z

@poorbarcode @codelipenghui PTAL, thanks

lhotari

It would be great to have the Iterable optimization proposed in a review comment.

codecov-commenter · 2024-01-21T16:48:55Z

Codecov Report

❌ Patch coverage is 80.00000% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 73.62%. Comparing base (17bb322) to head (13bb9bb).
⚠️ Report is 1386 commits behind head on master.

Files with missing lines	Patch %	Lines
...che/bookkeeper/mledger/impl/ManagedCursorImpl.java	80.00%	1 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff              @@
##             master   #21915      +/-   ##
============================================
+ Coverage     73.59%   73.62%   +0.03%     
- Complexity    32417    32427      +10     
============================================
  Files          1861     1861              
  Lines        138678   138675       -3     
  Branches      15188    15185       -3     
============================================
+ Hits         102060   102106      +46     
+ Misses        28715    28682      -33     
+ Partials       7903     7887      -16

Flag	Coverage Δ
inttests	`24.07% <0.00%> (-0.06%)`	⬇️
systests	`23.63% <0.00%> (-0.05%)`	⬇️
unittests	`72.92% <80.00%> (+0.05%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines	Coverage Δ
...che/bookkeeper/mledger/impl/ManagedCursorImpl.java	`79.37% <80.00%> (-0.14%)`	⬇️

... and 74 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

### Motivation The sequence of events leading to the deadlock when methods from org.apache.bookkeeper.mledger.impl.ManagedCursorImpl are invoked concurrently is as follows: 1. Thread A calls asyncDelete, which then goes on to internally call internalAsyncMarkDelete. This results in acquiring a lock on pendingMarkDeleteOps through synchronized (pendingMarkDeleteOps). 2. Inside internalAsyncMarkDelete, internalMarkDelete is called which subsequently calls persistPositionToLedger. At the start of persistPositionToLedger, buildIndividualDeletedMessageRanges is invoked, where it tries to acquire a read lock using lock.readLock().lock(). At this point, if the write lock is being held by another thread, Thread A will block waiting for the read lock. 3. Concurrently, Thread B executes skipNonRecoverableLedger which first obtains a write lock using lock.writeLock().lock() and then proceeds to call asyncDelete. 4. At this moment, Thread B already holds the write lock and is attempting to acquire the synchronized lock on pendingMarkDeleteOps that Thread A already holds, while Thread A is waiting for the read lock that Thread B needs to release. In code, the deadlock appears as follows: Thread A: synchronized (pendingMarkDeleteOps) -> lock.readLock().lock() (waiting) Thread B: lock.writeLock().lock() -> synchronized (pendingMarkDeleteOps) (waiting) ### Modifications Avoid using a long-range lock. Co-authored-by: ruihongzhou <ruihongzhou@tencent.com> Co-authored-by: Jiwe Guo <technoboy@apache.org> Co-authored-by: Lari Hotari <lhotari@apache.org>

…e#21915) ### Motivation The sequence of events leading to the deadlock when methods from org.apache.bookkeeper.mledger.impl.ManagedCursorImpl are invoked concurrently is as follows: 1. Thread A calls asyncDelete, which then goes on to internally call internalAsyncMarkDelete. This results in acquiring a lock on pendingMarkDeleteOps through synchronized (pendingMarkDeleteOps). 2. Inside internalAsyncMarkDelete, internalMarkDelete is called which subsequently calls persistPositionToLedger. At the start of persistPositionToLedger, buildIndividualDeletedMessageRanges is invoked, where it tries to acquire a read lock using lock.readLock().lock(). At this point, if the write lock is being held by another thread, Thread A will block waiting for the read lock. 3. Concurrently, Thread B executes skipNonRecoverableLedger which first obtains a write lock using lock.writeLock().lock() and then proceeds to call asyncDelete. 4. At this moment, Thread B already holds the write lock and is attempting to acquire the synchronized lock on pendingMarkDeleteOps that Thread A already holds, while Thread A is waiting for the read lock that Thread B needs to release. In code, the deadlock appears as follows: Thread A: synchronized (pendingMarkDeleteOps) -> lock.readLock().lock() (waiting) Thread B: lock.writeLock().lock() -> synchronized (pendingMarkDeleteOps) (waiting) ### Modifications Avoid using a long-range lock. Co-authored-by: ruihongzhou <ruihongzhou@tencent.com> Co-authored-by: Jiwe Guo <technoboy@apache.org> Co-authored-by: Lari Hotari <lhotari@apache.org> (cherry picked from commit 37fc40c)

[fix][broker] Fix deadlock while skip non-recoverable ledgers.

fe6d9e9

github-actions Bot added the doc-not-needed Your PR changes do not impact docs label Jan 18, 2024

lhotari requested changes Jan 19, 2024

View reviewed changes

Comment thread managed-ledger/src/main/java/org/apache/bookkeeper/mledger/impl/ManagedCursorImpl.java

lhotari approved these changes Jan 19, 2024

View reviewed changes

lhotari reviewed Jan 19, 2024

View reviewed changes

Comment thread managed-ledger/src/main/java/org/apache/bookkeeper/mledger/impl/ManagedCursorImpl.java Outdated

Technoboy- assigned hrzzzz Jan 21, 2024

Technoboy- added this to the 3.2.0 milestone Jan 21, 2024

Technoboy- added release/3.0.3 release/3.1.3 release/blocker Indicate the PR or issue that should block the release until it gets resolved labels Jan 21, 2024

address comment

161c3d8

Technoboy- requested a review from lhotari January 21, 2024 15:37

lhotari requested changes Jan 21, 2024

View reviewed changes

lhotari added 2 commits January 21, 2024 20:14

Merge remote-tracking branch 'origin/master' into fix-deadlock

b171ef6

Pass plain Iterable without collecting a list

13bb9bb

lhotari approved these changes Jan 21, 2024

View reviewed changes

lhotari added the ready-to-test label Jan 21, 2024

Technoboy- requested review from codelipenghui and poorbarcode January 22, 2024 03:31

poorbarcode approved these changes Jan 22, 2024

View reviewed changes

poorbarcode reviewed Jan 22, 2024

View reviewed changes

Comment thread managed-ledger/src/main/java/org/apache/bookkeeper/mledger/impl/ManagedCursorImpl.java

poorbarcode approved these changes Jan 22, 2024

View reviewed changes

poorbarcode merged commit 5a65e98 into apache:master Jan 22, 2024

hrzzzz deleted the fix-deadlock branch January 25, 2024 10:17

Technoboy- added the cherry-picked/branch-3.1 label Jan 31, 2024

Technoboy- added the cherry-picked/branch-3.0 label Feb 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[fix][broker] Fix deadlock while skip non-recoverable ledgers.#21915

[fix][broker] Fix deadlock while skip non-recoverable ledgers.#21915
poorbarcode merged 4 commits into
apache:masterfrom
hrzzzz:fix-deadlock

hrzzzz commented Jan 18, 2024

Uh oh!

hrzzzz commented Jan 18, 2024

Uh oh!

Uh oh!

Uh oh!

lhotari left a comment

Uh oh!

codecov-commenter commented Jan 21, 2024 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

hrzzzz commented Jan 18, 2024

Motivation

Modifications

Verifying this change

Does this pull request potentially affect one of the following parts:

Documentation

Matching PR in forked repository

Uh oh!

hrzzzz commented Jan 18, 2024

Uh oh!

Uh oh!

Uh oh!

lhotari left a comment

Choose a reason for hiding this comment

Uh oh!

codecov-commenter commented Jan 21, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

codecov-commenter commented Jan 21, 2024 •

edited

Loading