Skip to content

Conversation

@BewareMyPower
Copy link
Contributor

Motivation

There are several methods that get a connection from the ConnectionPool.

  1. ConnectionPool#getConnection(ServiceNameResolver)
    It's only used in BinaryProtoLookupService. The callbacks are all executed in PulsarClientImpl#lookupExecutorProvider: a single thread executor whose thread name starts with pulsar-client-lookup.
  2. ConnectionPool#getConnection(InetSocketAddress)
    It's called by the 1st method directly. Besides, it's only called by BinaryProtoLookupService#findBroker, which also uses PulsarClientImpl#lookupExecutorProvider to execute the callback.
  3. ConnectionPool#getConnection(InetSocketAddress logicalAddress, InetSocketAddress physicalAddress, int randomKey)
    It's called by the 2nd method directly. Besides, it's only called by the 4th method.
  4. PulsarClientImpl#getConnection(InetSocketAddress, InetSocketAddress, int)
    It's called in ConnectionHandler#grabCnx to establish a connection between broker and client (producer, consumer or reader). The callback calls connectionOpened or handleConnectionError without switching to another executor.
  5. Other methods in PulsarClientImpl
    Including:
    • getConnectionToServiceUrl
    • getConnection(String, int)
    • getConnection(String, String)
      They all call the 4th method in the callback of LookupService#getBroker and only used in grabCnx.

To solve a race condition caused by the fact that socket is closed in Netty's I/O thread while connectionOpened that sends the command is executed in another thread, #23499 completes the future of ConnectionPool#getConnection in Netty's' I/O thread as well. However, this adds an additional thread switching for all usages in method 1 above, which is not necessary.

I found this issue when I found a producer creation was blocked forever due to a deadlock in sendAsync's callback, which is executed in Netty's I/O thread. When checking the heap dump, I found ClientCnx#pendingRequests was empty, which means client.getCnxPool().getConnection(socketAddress) never complete, see

client.getCnxPool().getConnection(socketAddress).thenAcceptAsync(clientCnx -> {

Though even if without the change, other response processing will still be blocked because the I/O thread is blocked. It's very confusing when reviewing the heap dump:

  • BinaryProtoLookupService#partitionedMetadataInProgress is not empty
  • ClientCnx#pendingRequests is empty (the only connection in the pool)
image image

Actually, the root cause of the issue described in #23499 is that the StacklessClosedChannelException is treated as an exception cannot be retried. However, all network exceptions should be retried. Hence, this PR proposed a different solution to retry for such errors. Technically, only a few known exceptions should be treated as not retriable, e.g. AuthorizationException. Other known or unknown exceptions should be retried.

It's not guaranteed that writeAndFlush will always succeed. For example, if the code reaches here:

future.completeExceptionally(writeFuture.cause());

The future of ClientCnx#sendRequestWithId could fail with an exception that is not PulsarClientException.

Modifications

Revert the change in #23499 and retry for retriable exception even if it's not PulsarClientException. Improve SimpleProduceConsumeIoTest to cover consumer creation as well. Since this test only covers a very limited case, add testUnknownRpcExceptionFor* tests that inject failure on the 1st writeAndFlush in connectionOpened.

Documentation

  • doc
  • doc-required
  • doc-not-needed
  • doc-complete

Matching PR in forked repository

PR in forked repository:

@github-actions github-actions bot added the doc-not-needed Your PR changes do not impact docs label Aug 4, 2025
@BewareMyPower BewareMyPower force-pushed the bewaremypower/get-connection-deadlock branch from 0b3ab88 to 904a71f Compare August 4, 2025 12:43
@BewareMyPower BewareMyPower self-assigned this Aug 4, 2025
Copy link
Member

@lhotari lhotari left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@codecov-commenter
Copy link

codecov-commenter commented Aug 4, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 74.32%. Comparing base (d272825) to head (904a71f).
⚠️ Report is 10 commits behind head on master.

Additional details and impacted files

Impacted file tree graph

@@             Coverage Diff              @@
##             master   #24599      +/-   ##
============================================
+ Coverage     74.21%   74.32%   +0.11%     
- Complexity    33142    33149       +7     
============================================
  Files          1881     1881              
  Lines        146770   146770              
  Branches      16859    16857       -2     
============================================
+ Hits         108922   109084     +162     
+ Misses        29181    29028     -153     
+ Partials       8667     8658       -9     
Flag Coverage Δ
inttests 26.63% <50.00%> (?)
systests 23.28% <50.00%> (-0.08%) ⬇️
unittests 73.80% <100.00%> (+0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
.../org/apache/pulsar/client/impl/ConnectionPool.java 75.70% <100.00%> (-0.47%) ⬇️
...va/org/apache/pulsar/client/impl/ConsumerImpl.java 79.87% <100.00%> (+0.68%) ⬆️
...va/org/apache/pulsar/client/impl/ProducerImpl.java 84.16% <100.00%> (ø)
...g/apache/pulsar/common/protocol/PulsarHandler.java 73.01% <ø> (+6.34%) ⬆️

... and 100 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@BewareMyPower BewareMyPower merged commit 6f992bd into apache:master Aug 5, 2025
69 of 74 checks passed
@BewareMyPower BewareMyPower deleted the bewaremypower/get-connection-deadlock branch August 5, 2025 03:17
lhotari pushed a commit that referenced this pull request Aug 5, 2025
lhotari pushed a commit that referenced this pull request Aug 5, 2025
lhotari pushed a commit that referenced this pull request Aug 5, 2025
nodece pushed a commit to ascentstream/pulsar that referenced this pull request Aug 6, 2025
gaozhangmin pushed a commit to gaozhangmin/pulsar that referenced this pull request Aug 13, 2025
poorbarcode pushed a commit to poorbarcode/pulsar that referenced this pull request Aug 14, 2025
ganesh-ctds pushed a commit to datastax/pulsar that referenced this pull request Aug 20, 2025
…r consumer (apache#24599)

(cherry picked from commit 6f992bd)
(cherry picked from commit c16eb2c)
srinath-ctds pushed a commit to datastax/pulsar that referenced this pull request Aug 20, 2025
…r consumer (apache#24599)

(cherry picked from commit 6f992bd)
(cherry picked from commit c16eb2c)
manas-ctds pushed a commit to datastax/pulsar that referenced this pull request Aug 20, 2025
…r consumer (apache#24599)

(cherry picked from commit 6f992bd)
(cherry picked from commit c3f4f6a)
srinath-ctds pushed a commit to datastax/pulsar that referenced this pull request Aug 26, 2025
…r consumer (apache#24599)

(cherry picked from commit 6f992bd)
(cherry picked from commit c3f4f6a)
@lhotari lhotari added this to the 4.1.0 milestone Sep 17, 2025
KannarFr pushed a commit to CleverCloud/pulsar that referenced this pull request Sep 22, 2025
walkinggo pushed a commit to walkinggo/pulsar that referenced this pull request Oct 8, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants