Skip to content

[fix][broker] Fix deadlock in Key_Shared PIP-379 implementation#23854

Merged
lhotari merged 9 commits into
apache:masterfrom
lhotari:lh-fix-key_shared-deadlock
Jan 16, 2025
Merged

[fix][broker] Fix deadlock in Key_Shared PIP-379 implementation#23854
lhotari merged 9 commits into
apache:masterfrom
lhotari:lh-fix-key_shared-deadlock

Conversation

@lhotari

@lhotari lhotari commented Jan 15, 2025

Copy link
Copy Markdown
Member

Fixes #23848

Motivation

The Pulsar 4.0 Key_Shared PIP-379 implementation could deadlock and this problem was first captured in test environment heapdump where the thread stack information included in the heapdump showed multiple threads in blocked state, hinting a deadlock.
With the help of @pdolif, the problem was reproduced in isolation by simply adding this type of test code in a KeySharedSubscriptionTest test case.

// test code that triggered the deadlock issue
new Thread(() -> {
    while(true) { pulsar.getBrokerService().updateRates(); }
}).start();

In the test case, this triggered a similar pattern of blocked threads and therefore could be considered a reproduction of the previously faced dead lock issue in a test environment.

threaddump with deadlock:

Modifications

  • add test case to reproduce the dead lock case which will prevent future regressions in the same area
  • fix the dead lock issue by replacing synchronization with read-write locks
    • locks are for minimum "scope"
    • callbacks are called outside of a lock scope to reduce the chances of potential other deadlocks
  • use volatile fields for counters in DrainingHashEntry so that entry methods can be used without separate locks

Documentation

  • doc
  • doc-required
  • doc-not-needed
  • doc-complete

@lhotari lhotari added this to the 4.1.0 milestone Jan 15, 2025
@lhotari lhotari self-assigned this Jan 15, 2025
@github-actions github-actions Bot added the doc-not-needed Your PR changes do not impact docs label Jan 15, 2025
@lhotari lhotari requested a review from heesung-sohn January 15, 2025 22:15
@lhotari lhotari requested a review from heesung-sohn January 16, 2025 07:14
@lhotari lhotari merged commit 8f04945 into apache:master Jan 16, 2025
lhotari added a commit that referenced this pull request Jan 16, 2025
poorbarcode pushed a commit to poorbarcode/pulsar that referenced this pull request Jan 23, 2025
singhvishalkr added a commit to singhvishalkr/pulsar that referenced this pull request Apr 21, 2026
…y to DEBUG

The `log.warn()` call in `DrainingHashesTracker.ConsumerDrainingHashesStats
#updateConsumerStats` fires whenever the hash-iterator observes an entry that
was already removed by a concurrent write path. In the original PIP-379
design the draining-hash stats were strongly consistent with the tracker
state, so a missing entry on the read path was genuinely unexpected.

That strong-consistency guarantee was intentionally dropped in apache#23854 to
resolve a deadlock: the stats path is now allowed to race with entry
removal. A missing entry is therefore a benign, expected outcome during
normal operation, not a signal worth alerting on.

Logging this at `WARN` floods broker logs without a corresponding
actionable condition. Drop it to `DEBUG` and add an inline comment
explaining the race so future readers understand why this is deliberately
quiet.

Fixes apache#25277
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug] Deadlock in DrainingHashesTracker due to synchronized methods

2 participants