Skip to content

Conversation

@poorbarcode
Copy link
Contributor

@poorbarcode poorbarcode commented May 30, 2025

Motivation

Issue 1: deadlock when compaction and topic deletion execute concurrently

Reproduction steps of the deadlock( see also the new test testConcurrentCompactionAndTopicDelete )

  1. thread-compaction: create a raw reader.
  2. thread-topic-deleting: mark the topic as deleted and disconnect clients, including the compaction reader.
  3. thread-compaction: The raw reader attempts to reconnect due to the disconnection in step "2-1".
  4. thread-topic-deleting: unsubscribe all subscriptions that contain "__compaction".
  5. thread-topic-deleting: deleting "__compaction" cursor will wait for the in-progress compaction task.
  6. thread-compaction: the raw read can not connect successfully anymore because the topic was marked as "fenced && isDeleting".
  7. deadlock: "thread-topic-deleting" waiting for compaction task being finished, thread-compaction continuously reconnects.

Note:

  • If the thread-compaction is stuck when the first phase of compaction calling reader.readNextAsync() at step 6, the deadlock can be solved after {brokerServiceCompactionPhaseOneLoopTimeInSeconds) seconds
  • But regarding other scenarios, such as reader.getLastMessageIdAsync... the deadlock will persist forever.

You can reproduce the issue by the new test testConcurrentCompactionAndTopicDelete


Issue 2-1: The responded compostable future of Consumer.receiveAsync() never complete if the topic is deleted

Background:

setState(State.Failed);
closeConsumerTasks();
client.cleanupConsumer(this);
// Issue: it should call "failPendingReceive()" also

You can reproduce the issue by the new test testReceiveWillDoneAfterTopicDeleted

Issue 2-2: resource leak that contains a dead letter producer and a retry letter producer.

Regarding issue 2-1, the client also forgot to release the related dead letter producer and the retry letter producer, which also caused a resource leak. Since there are other issues that prevent the new test from passing, I will write a separate PR to add the test to cover this case, which also contains another fix.


Issue 3: Even though the client received a TopicNotFound error, it continuously retries.

Root cause: Pulsar client prevents retrying when it receives a TopicNotFound error from the Subscribe command, but the TopicNotFound can also be received from the process get broker connections of the target topic.

You can reproduce the issue by the new test testReceiveWillDoneAfterTopicDeleted


Modifications

  • Regarding issue 1 - a deadlock error, closes the RawReader if the compaction task received a ServiceNotReady error, which happens when a topic/subscription is fenced, or the topic/namespace is loading/unloading. Since the RawReader is an internal API, this change is safe.
  • Regarding issue 2 - consumer forgets to complete the pending read request when it stops reconnecting, calls consumer.close() instead of calling a part of the stats modifying, which is more complete.
  • Regarding issue 3 - continues to reconnect even though it received a TopicNotFound error, stops reconnecting after receiving an unrecovered error from grabbing new connections, and closes consumers/producers.

Documentation

  • doc
  • doc-required
  • doc-not-needed
  • doc-complete

Matching PR in forked repository

PR in forked repository: x

@poorbarcode poorbarcode added this to the 4.1.0 milestone May 30, 2025
@poorbarcode poorbarcode self-assigned this May 30, 2025
@poorbarcode poorbarcode added type/bug The PR fixed a bug or issue reported a bug release/3.0.13 release/4.0.6 release/3.3.8 labels May 30, 2025
@github-actions github-actions bot added the doc-not-needed Your PR changes do not impact docs label May 30, 2025
@poorbarcode
Copy link
Contributor Author

/pulsarbot rerun-failure-checks

@codecov-commenter
Copy link

codecov-commenter commented May 30, 2025

Codecov Report

❌ Patch coverage is 81.53846% with 12 lines in your changes missing coverage. Please review.
✅ Project coverage is 74.26%. Comparing base (bbc6224) to head (fa2f120).
⚠️ Report is 1256 commits behind head on master.

Files with missing lines Patch % Lines
...va/org/apache/pulsar/client/impl/ConsumerImpl.java 76.19% 3 Missing and 2 partials ⚠️
...va/org/apache/pulsar/client/impl/ProducerImpl.java 83.33% 1 Missing and 2 partials ⚠️
...a/org/apache/pulsar/client/impl/RawReaderImpl.java 88.88% 0 Missing and 1 partial ⚠️
...g/apache/pulsar/client/impl/ConnectionHandler.java 85.71% 1 Missing ⚠️
...rg/apache/pulsar/client/impl/TopicListWatcher.java 50.00% 1 Missing ⚠️
...ulsar/client/impl/TransactionMetaStoreHandler.java 50.00% 1 Missing ⚠️
Additional details and impacted files

Impacted file tree graph

@@             Coverage Diff              @@
##             master   #24366      +/-   ##
============================================
+ Coverage     73.57%   74.26%   +0.68%     
- Complexity    32624    32692      +68     
============================================
  Files          1877     1867      -10     
  Lines        139502   145260    +5758     
  Branches      15299    16610    +1311     
============================================
+ Hits         102638   107876    +5238     
+ Misses        28908    28853      -55     
- Partials       7956     8531     +575     
Flag Coverage Δ
inttests 26.64% <44.61%> (+2.05%) ⬆️
systests 23.27% <9.23%> (-1.06%) ⬇️
unittests 73.76% <81.53%> (+0.91%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
...rg/apache/pulsar/broker/service/AbstractTopic.java 88.04% <ø> (+0.06%) ⬆️
...sar/broker/service/persistent/PersistentTopic.java 80.24% <100.00%> (+1.78%) ⬆️
...n/java/org/apache/pulsar/client/api/RawReader.java 100.00% <100.00%> (ø)
...n/java/org/apache/pulsar/compaction/Compactor.java 92.30% <100.00%> (+11.53%) ⬆️
...pache/pulsar/client/api/PulsarClientException.java 64.40% <100.00%> (+1.35%) ⬆️
.../java/org/apache/pulsar/client/impl/ClientCnx.java 70.12% <100.00%> (-1.66%) ⬇️
...a/org/apache/pulsar/client/impl/RawReaderImpl.java 84.07% <88.88%> (-0.70%) ⬇️
...g/apache/pulsar/client/impl/ConnectionHandler.java 86.17% <85.71%> (-0.60%) ⬇️
...rg/apache/pulsar/client/impl/TopicListWatcher.java 65.69% <50.00%> (-2.17%) ⬇️
...ulsar/client/impl/TransactionMetaStoreHandler.java 67.66% <50.00%> (-0.24%) ⬇️
... and 2 more

... and 1076 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@lhotari
Copy link
Member

lhotari commented May 30, 2025

@poorbarcode Does this resolve #24148?

@lhotari
Copy link
Member

lhotari commented May 30, 2025

why does the current code (before this PR) handle the if (isClosingOrDeleting) { check inside the write lock?

lock.writeLock().lock();
try {
if (isClosingOrDeleting) {
log.warn("[{}] Topic is already being closed or deleted", topic);
return FutureUtil.failedFuture(new TopicFencedException("Topic is already fenced"));
}

@codelipenghui
Copy link
Contributor

  1. deadlock: "thread-topic-deleting" waiting for compaction task being finished, thread-compaction continuously reconnects.

Actually, we are safe to give up the reconnect from the compaction reader if the topic is deleting.

Now, the solution is to unfence the topic if there is ongoing compaction task. But it unfencing will also make producers/consumers from user side available to connect to this topic which should not be expected.

So, I think the correct fix should be find a way to expose the "Topic is deleting Exception" to Raw reader and the reader can optionally give up the retry.

@poorbarcode
Copy link
Contributor Author

@codelipenghui

Actually, we are safe to give up the reconnect from the compaction reader if the topic is deleting.
Now, the solution is to unfence the topic if there is ongoing compaction task. But it unfencing will also make producers/consumers from user side available to connect to this topic which should not be expected.
So, I think the correct fix should be find a way to expose the "Topic is deleting Exception" to Raw reader and the reader can optionally give up the retry.

The solution you mentioned is what PIP-387: Modify interface TopicCompactionService to support cancelling an in-progress compaction task, which has been canceled 😂

@gaoran10
Copy link
Contributor

gaoran10 commented Jun 3, 2025

We'd better make the compaction task aware of the topic deletion operation and cancel the task.

@poorbarcode poorbarcode requested a review from lhotari June 3, 2025 14:27
@poorbarcode
Copy link
Contributor Author

@codelipenghui @gaoran10 @lhotari I have used a new solution to solve the issue: cancel the pending compaction when deleting the topic, please review again

@poorbarcode poorbarcode requested a review from lhotari June 4, 2025 15:09
@poorbarcode poorbarcode requested a review from codelipenghui June 4, 2025 17:17
@poorbarcode poorbarcode merged commit 37e160f into apache:master Jun 5, 2025
53 checks passed
@lhotari
Copy link
Member

lhotari commented Jun 9, 2025

@poorbarcode Please handle cherry-picking to branch-3.0 and branch-3.3 since there are merge conflicts.

lhotari pushed a commit that referenced this pull request Jun 9, 2025
poorbarcode added a commit that referenced this pull request Jun 10, 2025
Date:   Thu Jun 5 10:02:53 2025 +0800

[fix][broker]Fix deadlock when compaction and topic deletion execute concurrently (#24366)

(cherry picked from commit 37e160f)
poorbarcode added a commit that referenced this pull request Jun 10, 2025
poorbarcode added a commit that referenced this pull request Jun 10, 2025
manas-ctds pushed a commit to datastax/pulsar that referenced this pull request Jun 12, 2025
…concurrently (apache#24366)

(cherry picked from commit 37e160f)
(cherry picked from commit 92a1930)
manas-ctds pushed a commit to datastax/pulsar that referenced this pull request Jun 12, 2025
…concurrently (apache#24366)

(cherry picked from commit 37e160f)
(cherry picked from commit cae803e)
manas-ctds pushed a commit to datastax/pulsar that referenced this pull request Jun 12, 2025
…concurrently (apache#24366)

(cherry picked from commit 37e160f)
(cherry picked from commit 92a1930)
manas-ctds pushed a commit to datastax/pulsar that referenced this pull request Jun 12, 2025
…concurrently (apache#24366)

(cherry picked from commit 37e160f)
(cherry picked from commit 92a1930)
manas-ctds pushed a commit to datastax/pulsar that referenced this pull request Jun 13, 2025
…concurrently (apache#24366)

(cherry picked from commit 37e160f)
(cherry picked from commit 92a1930)
manas-ctds pushed a commit to datastax/pulsar that referenced this pull request Jun 13, 2025
…concurrently (apache#24366)

(cherry picked from commit 37e160f)
(cherry picked from commit 92a1930)
manas-ctds pushed a commit to datastax/pulsar that referenced this pull request Jun 13, 2025
…concurrently (apache#24366)

(cherry picked from commit 37e160f)
(cherry picked from commit cae803e)
manas-ctds pushed a commit to datastax/pulsar that referenced this pull request Jun 14, 2025
…concurrently (apache#24366)

(cherry picked from commit 37e160f)
(cherry picked from commit cae803e)
(cherry picked from commit ec2037e)
srinath-ctds pushed a commit to datastax/pulsar that referenced this pull request Jun 16, 2025
…concurrently (apache#24366)

(cherry picked from commit 37e160f)
(cherry picked from commit 92a1930)
srinath-ctds pushed a commit to datastax/pulsar that referenced this pull request Jun 16, 2025
…concurrently (apache#24366)

(cherry picked from commit 37e160f)
(cherry picked from commit cae803e)
(cherry picked from commit ec2037e)
nodece pushed a commit to nodece/pulsar that referenced this pull request Jun 18, 2025
@lhotari
Copy link
Member

lhotari commented Jun 19, 2025

@poorbarcode Please handle cherry-picking to branch-3.0 and branch-3.3 since there are merge conflicts.

@poorbarcode branch-3.0 compilation fails in checkstyle.

[INFO] There are 5 errors reported by Checkstyle 10.14.2 with ../buildtools/src/main/resources/pulsar/checkstyle.xml ruleset.
[ERROR] src/test/java/org/apache/pulsar/client/impl/RawReaderTest.java:[53,8] (imports) UnusedImports: Unused import: org.apache.pulsar.client.api.SubscriptionInitialPosition.
[ERROR] src/test/java/org/apache/pulsar/client/impl/RawReaderTest.java:[54,8] (imports) UnusedImports: Unused import: org.apache.pulsar.client.api.SubscriptionType.
[ERROR] src/test/java/org/apache/pulsar/client/impl/RawReaderTest.java:[55,8] (imports) UnusedImports: Unused import: org.apache.pulsar.client.impl.conf.ConsumerConfigurationData.
[ERROR] src/test/java/org/apache/pulsar/client/impl/RawReaderTest.java:[56,8] (imports) UnusedImports: Unused import: org.apache.pulsar.common.api.proto.BrokerEntryMetadata.
[ERROR] src/test/java/org/apache/pulsar/client/impl/RawReaderTest.java:[70,15] (imports) UnusedImports: Unused import: org.apache.pulsar.client.impl.RawReaderImpl.DEFAULT_RECEIVER_QUEUE_SIZE.

After cherry-picking, please make sure that checkstyle passes and the project builds.
A quick way to validate is to run this command:

mvn -Pcore-modules,-main -T 1C clean install -DskipTests -Dspotbugs.skip=true -DnarPluginPhase=none

@lhotari
Copy link
Member

lhotari commented Jun 19, 2025

@poorbarcode Please handle cherry-picking to branch-3.0 and branch-3.3 since there are merge conflicts.

@poorbarcode branch-3.0 compilation fails in checkstyle.

[INFO] There are 5 errors reported by Checkstyle 10.14.2 with ../buildtools/src/main/resources/pulsar/checkstyle.xml ruleset.
[ERROR] src/test/java/org/apache/pulsar/client/impl/RawReaderTest.java:[53,8] (imports) UnusedImports: Unused import: org.apache.pulsar.client.api.SubscriptionInitialPosition.
[ERROR] src/test/java/org/apache/pulsar/client/impl/RawReaderTest.java:[54,8] (imports) UnusedImports: Unused import: org.apache.pulsar.client.api.SubscriptionType.
[ERROR] src/test/java/org/apache/pulsar/client/impl/RawReaderTest.java:[55,8] (imports) UnusedImports: Unused import: org.apache.pulsar.client.impl.conf.ConsumerConfigurationData.
[ERROR] src/test/java/org/apache/pulsar/client/impl/RawReaderTest.java:[56,8] (imports) UnusedImports: Unused import: org.apache.pulsar.common.api.proto.BrokerEntryMetadata.
[ERROR] src/test/java/org/apache/pulsar/client/impl/RawReaderTest.java:[70,15] (imports) UnusedImports: Unused import: org.apache.pulsar.client.impl.RawReaderImpl.DEFAULT_RECEIVER_QUEUE_SIZE.

After cherry-picking, please make sure that checkstyle passes and the project builds. A quick way to validate is to run this command:

mvn -Pcore-modules,-main -T 1C clean install -DskipTests -Dspotbugs.skip=true -DnarPluginPhase=none

@poorbarcode I fixed it in b92af789

KannarFr pushed a commit to CleverCloud/pulsar that referenced this pull request Sep 22, 2025
walkinggo pushed a commit to walkinggo/pulsar that referenced this pull request Oct 8, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants