Skip to content

[fix][broker] Fix deadlock while skip non-recoverable ledgers.#21915

Merged
poorbarcode merged 4 commits into
apache:masterfrom
hrzzzz:fix-deadlock
Jan 22, 2024
Merged

[fix][broker] Fix deadlock while skip non-recoverable ledgers.#21915
poorbarcode merged 4 commits into
apache:masterfrom
hrzzzz:fix-deadlock

Conversation

@hrzzzz

@hrzzzz hrzzzz commented Jan 18, 2024

Copy link
Copy Markdown
Contributor

Fixes #21914

Motivation

Resolved the deadlock issue that occurred when the org.apache.bookkeeper.mledger.impl.ManagedCursorImpl#skipNonRecoverableLedger method and the org.apache.bookkeeper.mledger.impl.ManagedCursorImpl#asyncDelete method were called concurrently.

Modifications

The reason for the deadlock is that the org.apache.bookkeeper.mledger.impl.ManagedCursorImpl#skipNonRecoverableLedger method internally acquired a write lock with a large scope, seemingly just to check !individualDeletedMessages.contains(ledgerId, i). However, the org.apache.bookkeeper.mledger.impl.ManagedCursorImpl#asyncDelete method actually performs this check again internally to ensure that already deleted Positions are not deleted repeatedly. Therefore, we can remove the write lock in the org.apache.bookkeeper.mledger.impl.ManagedCursorImpl#skipNonRecoverableLedger method and omit the !individualDeletedMessages.contains(ledgerId, i) check, instead directly calling the org.apache.bookkeeper.mledger.impl.ManagedCursorImpl#asyncDelete method.

Verifying this change

  • Make sure that the change passes the CI checks.

Does this pull request potentially affect one of the following parts:

If the box was checked, please highlight the changes

  • Dependencies (add or upgrade a dependency)
  • The public API
  • The schema
  • The default values of configurations
  • The threading model
  • The binary protocol
  • The REST endpoints
  • The admin CLI options
  • The metrics
  • Anything that affects deployment

Documentation

  • doc
  • doc-required
  • doc-not-needed
  • doc-complete

Matching PR in forked repository

PR in forked repository: hrzzzz#4

@github-actions github-actions Bot added the doc-not-needed Your PR changes do not impact docs label Jan 18, 2024
@hrzzzz

hrzzzz commented Jan 18, 2024

Copy link
Copy Markdown
Contributor Author

@poorbarcode @codelipenghui PTAL, thanks

@Technoboy- Technoboy- added this to the 3.2.0 milestone Jan 21, 2024
@Technoboy- Technoboy- added release/3.0.3 release/3.1.3 release/blocker Indicate the PR or issue that should block the release until it gets resolved labels Jan 21, 2024
@Technoboy- Technoboy- requested a review from lhotari January 21, 2024 15:37

@lhotari lhotari left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be great to have the Iterable optimization proposed in a review comment.

@codecov-commenter

codecov-commenter commented Jan 21, 2024

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 80.00000% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 73.62%. Comparing base (17bb322) to head (13bb9bb).
⚠️ Report is 1386 commits behind head on master.

Files with missing lines Patch % Lines
...che/bookkeeper/mledger/impl/ManagedCursorImpl.java 80.00% 1 Missing ⚠️
Additional details and impacted files

Impacted file tree graph

@@             Coverage Diff              @@
##             master   #21915      +/-   ##
============================================
+ Coverage     73.59%   73.62%   +0.03%     
- Complexity    32417    32427      +10     
============================================
  Files          1861     1861              
  Lines        138678   138675       -3     
  Branches      15188    15185       -3     
============================================
+ Hits         102060   102106      +46     
+ Misses        28715    28682      -33     
+ Partials       7903     7887      -16     
Flag Coverage Δ
inttests 24.07% <0.00%> (-0.06%) ⬇️
systests 23.63% <0.00%> (-0.05%) ⬇️
unittests 72.92% <80.00%> (+0.05%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
...che/bookkeeper/mledger/impl/ManagedCursorImpl.java 79.37% <80.00%> (-0.14%) ⬇️

... and 74 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@poorbarcode poorbarcode merged commit 5a65e98 into apache:master Jan 22, 2024
Technoboy- added a commit that referenced this pull request Jan 22, 2024
### Motivation
The sequence of events leading to the deadlock when methods from org.apache.bookkeeper.mledger.impl.ManagedCursorImpl are invoked concurrently is as follows:

1. Thread A calls asyncDelete, which then goes on to internally call internalAsyncMarkDelete. This results in acquiring a lock on pendingMarkDeleteOps through synchronized (pendingMarkDeleteOps).

2. Inside internalAsyncMarkDelete, internalMarkDelete is called which subsequently calls persistPositionToLedger. At the start of persistPositionToLedger, buildIndividualDeletedMessageRanges is invoked, where it tries to acquire a read lock using lock.readLock().lock(). At this point, if the write lock is being held by another thread, Thread A will block waiting for the read lock.

3. Concurrently, Thread B executes skipNonRecoverableLedger which first obtains a write lock using lock.writeLock().lock() and then proceeds to call asyncDelete.

4. At this moment, Thread B already holds the write lock and is attempting to acquire the synchronized lock on pendingMarkDeleteOps that Thread A already holds, while Thread A is waiting for the read lock that Thread B needs to release.

In code, the deadlock appears as follows:

Thread A: synchronized (pendingMarkDeleteOps) -> lock.readLock().lock() (waiting)
Thread B: lock.writeLock().lock() -> synchronized (pendingMarkDeleteOps) (waiting)

### Modifications

Avoid using a long-range lock.

Co-authored-by: ruihongzhou <ruihongzhou@tencent.com>
Co-authored-by: Jiwe Guo <technoboy@apache.org>
Co-authored-by: Lari Hotari <lhotari@apache.org>
@hrzzzz hrzzzz deleted the fix-deadlock branch January 25, 2024 10:17
Technoboy- added a commit that referenced this pull request Jan 31, 2024
### Motivation
The sequence of events leading to the deadlock when methods from org.apache.bookkeeper.mledger.impl.ManagedCursorImpl are invoked concurrently is as follows:

1. Thread A calls asyncDelete, which then goes on to internally call internalAsyncMarkDelete. This results in acquiring a lock on pendingMarkDeleteOps through synchronized (pendingMarkDeleteOps).

2. Inside internalAsyncMarkDelete, internalMarkDelete is called which subsequently calls persistPositionToLedger. At the start of persistPositionToLedger, buildIndividualDeletedMessageRanges is invoked, where it tries to acquire a read lock using lock.readLock().lock(). At this point, if the write lock is being held by another thread, Thread A will block waiting for the read lock.

3. Concurrently, Thread B executes skipNonRecoverableLedger which first obtains a write lock using lock.writeLock().lock() and then proceeds to call asyncDelete.

4. At this moment, Thread B already holds the write lock and is attempting to acquire the synchronized lock on pendingMarkDeleteOps that Thread A already holds, while Thread A is waiting for the read lock that Thread B needs to release.

In code, the deadlock appears as follows:

Thread A: synchronized (pendingMarkDeleteOps) -> lock.readLock().lock() (waiting)
Thread B: lock.writeLock().lock() -> synchronized (pendingMarkDeleteOps) (waiting)

### Modifications

Avoid using a long-range lock.

Co-authored-by: ruihongzhou <ruihongzhou@tencent.com>
Co-authored-by: Jiwe Guo <technoboy@apache.org>
Co-authored-by: Lari Hotari <lhotari@apache.org>
Technoboy- added a commit that referenced this pull request Feb 5, 2024
### Motivation
The sequence of events leading to the deadlock when methods from org.apache.bookkeeper.mledger.impl.ManagedCursorImpl are invoked concurrently is as follows:

1. Thread A calls asyncDelete, which then goes on to internally call internalAsyncMarkDelete. This results in acquiring a lock on pendingMarkDeleteOps through synchronized (pendingMarkDeleteOps).

2. Inside internalAsyncMarkDelete, internalMarkDelete is called which subsequently calls persistPositionToLedger. At the start of persistPositionToLedger, buildIndividualDeletedMessageRanges is invoked, where it tries to acquire a read lock using lock.readLock().lock(). At this point, if the write lock is being held by another thread, Thread A will block waiting for the read lock.

3. Concurrently, Thread B executes skipNonRecoverableLedger which first obtains a write lock using lock.writeLock().lock() and then proceeds to call asyncDelete.

4. At this moment, Thread B already holds the write lock and is attempting to acquire the synchronized lock on pendingMarkDeleteOps that Thread A already holds, while Thread A is waiting for the read lock that Thread B needs to release.

In code, the deadlock appears as follows:

Thread A: synchronized (pendingMarkDeleteOps) -> lock.readLock().lock() (waiting)
Thread B: lock.writeLock().lock() -> synchronized (pendingMarkDeleteOps) (waiting)

### Modifications

Avoid using a long-range lock.

Co-authored-by: ruihongzhou <ruihongzhou@tencent.com>
Co-authored-by: Jiwe Guo <technoboy@apache.org>
Co-authored-by: Lari Hotari <lhotari@apache.org>
mukesh-ctds pushed a commit to datastax/pulsar that referenced this pull request Mar 1, 2024
…e#21915)

### Motivation
The sequence of events leading to the deadlock when methods from org.apache.bookkeeper.mledger.impl.ManagedCursorImpl are invoked concurrently is as follows:

1. Thread A calls asyncDelete, which then goes on to internally call internalAsyncMarkDelete. This results in acquiring a lock on pendingMarkDeleteOps through synchronized (pendingMarkDeleteOps).

2. Inside internalAsyncMarkDelete, internalMarkDelete is called which subsequently calls persistPositionToLedger. At the start of persistPositionToLedger, buildIndividualDeletedMessageRanges is invoked, where it tries to acquire a read lock using lock.readLock().lock(). At this point, if the write lock is being held by another thread, Thread A will block waiting for the read lock.

3. Concurrently, Thread B executes skipNonRecoverableLedger which first obtains a write lock using lock.writeLock().lock() and then proceeds to call asyncDelete.

4. At this moment, Thread B already holds the write lock and is attempting to acquire the synchronized lock on pendingMarkDeleteOps that Thread A already holds, while Thread A is waiting for the read lock that Thread B needs to release.

In code, the deadlock appears as follows:

Thread A: synchronized (pendingMarkDeleteOps) -> lock.readLock().lock() (waiting)
Thread B: lock.writeLock().lock() -> synchronized (pendingMarkDeleteOps) (waiting)

### Modifications

Avoid using a long-range lock.

Co-authored-by: ruihongzhou <ruihongzhou@tencent.com>
Co-authored-by: Jiwe Guo <technoboy@apache.org>
Co-authored-by: Lari Hotari <lhotari@apache.org>
(cherry picked from commit 37fc40c)
mukesh-ctds pushed a commit to datastax/pulsar that referenced this pull request Mar 6, 2024
…e#21915)

### Motivation
The sequence of events leading to the deadlock when methods from org.apache.bookkeeper.mledger.impl.ManagedCursorImpl are invoked concurrently is as follows:

1. Thread A calls asyncDelete, which then goes on to internally call internalAsyncMarkDelete. This results in acquiring a lock on pendingMarkDeleteOps through synchronized (pendingMarkDeleteOps).

2. Inside internalAsyncMarkDelete, internalMarkDelete is called which subsequently calls persistPositionToLedger. At the start of persistPositionToLedger, buildIndividualDeletedMessageRanges is invoked, where it tries to acquire a read lock using lock.readLock().lock(). At this point, if the write lock is being held by another thread, Thread A will block waiting for the read lock.

3. Concurrently, Thread B executes skipNonRecoverableLedger which first obtains a write lock using lock.writeLock().lock() and then proceeds to call asyncDelete.

4. At this moment, Thread B already holds the write lock and is attempting to acquire the synchronized lock on pendingMarkDeleteOps that Thread A already holds, while Thread A is waiting for the read lock that Thread B needs to release.

In code, the deadlock appears as follows:

Thread A: synchronized (pendingMarkDeleteOps) -> lock.readLock().lock() (waiting)
Thread B: lock.writeLock().lock() -> synchronized (pendingMarkDeleteOps) (waiting)

### Modifications

Avoid using a long-range lock.

Co-authored-by: ruihongzhou <ruihongzhou@tencent.com>
Co-authored-by: Jiwe Guo <technoboy@apache.org>
Co-authored-by: Lari Hotari <lhotari@apache.org>
(cherry picked from commit 37fc40c)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cherry-picked/branch-3.0 cherry-picked/branch-3.1 doc-not-needed Your PR changes do not impact docs ready-to-test release/blocker Indicate the PR or issue that should block the release until it gets resolved release/3.0.3 release/3.1.3

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug] Deadlock occurs when skipping non-recoverable ledger

5 participants