-
Notifications
You must be signed in to change notification settings - Fork 3.7k
[fix][broker] Fix ManagedCursor state management race conditions and lifecycle issues #24569
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[fix][broker] Fix ManagedCursor state management race conditions and lifecycle issues #24569
Conversation
…agement for pending reads
|
I learned through some failing tests that there are some really hairy details of the |
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## master #24569 +/- ##
============================================
+ Coverage 73.57% 74.32% +0.74%
- Complexity 32624 32674 +50
============================================
Files 1877 1880 +3
Lines 139502 146446 +6944
Branches 15299 16791 +1492
============================================
+ Hits 102638 108844 +6206
- Misses 28908 28960 +52
- Partials 7956 8642 +686
Flags with carried forward coverage won't be shown. Click here to find out more.
🚀 New features to boost your workflow:
|
BewareMyPower
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've gone through the whole PR, overall LGTM except for some coding style issues and some questions I've left. Please address these comments before merging.
P.S. The state machine is hard to read. It would be helpful to have a graph to show all possible state changes, but it's not required.
...broker/src/main/java/org/apache/pulsar/broker/service/persistent/PersistentSubscription.java
Show resolved
Hide resolved
… PersistentSubscription.removeConsumer
Thank you @BewareMyPower. I've addressed your review comments. PTAL. |
...broker/src/main/java/org/apache/pulsar/broker/service/persistent/PersistentSubscription.java
Show resolved
Hide resolved
managed-ledger/src/main/java/org/apache/bookkeeper/mledger/impl/ManagedCursorImpl.java
Outdated
Show resolved
Hide resolved
managed-ledger/src/main/java/org/apache/bookkeeper/mledger/impl/ManagedCursorImpl.java
Outdated
Show resolved
Hide resolved
BewareMyPower
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, left a comment about logging but it's not an important issue
managed-ledger/src/main/java/org/apache/bookkeeper/mledger/impl/ManagedCursorImpl.java
Show resolved
Hide resolved
…Lock, registeredToWaitingCursors
@codelipenghui, your review comments have been address. I'm dismissing the "request changes" status so that we could proceed in unblocking Pulsar CI by merging this PR. Please review again.
…lifecycle issues (apache#24569) (cherry picked from commit c96f27a) (cherry picked from commit ec56ca5)
…lifecycle issues (apache#24569) (cherry picked from commit c96f27a) (cherry picked from commit c324228)
…lifecycle issues (apache#24569) (cherry picked from commit c96f27a) (cherry picked from commit ec56ca5)
…lifecycle issues (apache#24569) (cherry picked from commit c96f27a) (cherry picked from commit ec56ca5)
…lifecycle issues (apache#24569) (cherry picked from commit c96f27a)
…lifecycle issues (apache#24569) (cherry picked from commit c96f27a) (cherry picked from commit c324228)
Fixes #24566
Motivation
The test
ReplicatorTest.testCloseReplicatorStartProducerwas consistently failing due to critical issues introduced in PR #24551. Investigation (shared in #24566 (comment) by @3pacccccc ) revealed that the changes led to infinite recursive calls inManagedCursorImpl#cancelPendingReadRequest, causing broker instability and test failures.The root cause was improper state management for pending read operations, particularly when:
Modifications
This PR addresses the state management issues through several key changes:
Core Synchronization Fixes
Reverted problematic synchronization approach: Removed the complex
pendingReadOpMutexandDelayCheckForNewEntriesTaskimplementation added in [fix][broker] Fix Broker OOM due to too many waiting cursors and reuse a recycled OpReadEntry incorrectly #24551 that caused infinite recursion, returning to a more robust compare-and-swap based solution.Added operation ID tracking: Introduced an
idfield toOpReadEntryinstances to prevent recycled operations from being processed incorrectly. This ensures that only the correct operation instance is handled when checking for new entries.Improved atomic state transitions: Enhanced the CAS (Compare-And-Swap) operations using
getAndUpdate()with lambda functions that verify both the operation instance and its ID before making state changes:Enhanced State Management System
Comprehensive state model enhancement: Significantly expanded the cursor
Stateenum with:Deleting,Deleted,DeletingFailedfor better deletion lifecycle trackingclosedStateboolean field andisClosed()method for more robust state checkingisDeletingOrDeleted()method for deletion state validationThread-safe state transition methods: Implemented atomic state transition utilities:
changeStateIfNotClosed()- prevents state changes on closed cursorschangeStateIfNotDeletingOrDeleted()- prevents state changes on deleting/deleted cursorstrySetStateToClosing()with atomic operationsCursor Lifecycle Management
Proper cursor closure handling: Added comprehensive cursor closure logic:
closeWaitingCursor()- handles normal cursor closure scenarioscancelWaitingCursorsWhenDeactivated()- handles cursor deactivation scenariosinternalCloseWaitingCursor()- unified internal implementation with exception supplier patternWAITING_READ_OP_FOR_CLOSED_CURSOR) to handle closed cursor scenariosEnhanced cursor deactivation: Updated
setInactive()to properly cancel waiting cursors when deactivated, preventing orphaned read operations.Robust deletion lifecycle: Improved
asyncDeleteCursorLedger()with:Exception Handling Improvements
New exception types: Added
CursorDeactivatedWaitCallbackExceptionfor better error propagation when cursors are deactivated without properly cancelling pending requests.Enhanced error propagation: Improved error handling throughout the cursor lifecycle to ensure proper cleanup and user notification.
Managed Ledger Integration
Enhanced cursor deletion in ManagedLedgerImpl: Improved
asyncDeleteCursor()with:Enhanced waiting cursor registration: Implemented thread-safe registration/deregistration of waiting cursors to prevent race conditions when adding or removing cursors from the waiting list, with improved debug logging.
Specialized Cursor Implementation Updates
Non-durable cursor improvements: Updated
NonDurableCursorImpl.asyncClose()to properly close waiting cursors and set inactive state.Read-only cursor improvements: Updated
ReadOnlyCursorImpl.asyncClose()with the same proper cleanup logic.Misc refactoring in ManagedCursorImpl
statefield was sometimes read directly and sometimes withSTATE_UPDATER.get. Similarly, it was sometimes set directly and mostly withSTATE_UPDATER.set. There's no need to useSTATE_UPDATER.getandSTATE_UPDATER.set. The volatilestatefield can be read and set directly withoutSTATE_UPDATER.STATE_UPDATERis only needed for CAS operations on thestatefield.Key Technical Improvements
cancelPendingReadRequest(), which is now resolvedVerifying this change
This change is already covered by existing tests, specifically:
ReplicatorTest.testCloseReplicatorStartProducer- verifies proper cleanup when replicator cursors are closedFailoverSubscriptionTest.testWaitingCursorsCountAfterSwitchingActiveConsumers- verifies that each cursor doesn't get added more than once into ManagedLedgerImpl's waiting cursors listThe modifications also include timeout adjustments for the test to account for the improved error handling paths.
Impact Assessment
This change affects:
The changes are backward compatible and do not affect public APIs or configuration defaults.
Documentation
docdoc-requireddoc-not-neededdoc-complete