-
Notifications
You must be signed in to change notification settings - Fork 3.7k
[fix][client] Retry for unknown exceptions when creating a producer or consumer #24599
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
BewareMyPower
merged 1 commit into
apache:master
from
BewareMyPower:bewaremypower/get-connection-deadlock
Aug 5, 2025
Merged
[fix][client] Retry for unknown exceptions when creating a producer or consumer #24599
BewareMyPower
merged 1 commit into
apache:master
from
BewareMyPower:bewaremypower/get-connection-deadlock
Aug 5, 2025
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
0b3ab88 to
904a71f
Compare
lhotari
approved these changes
Aug 4, 2025
Member
lhotari
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #24599 +/- ##
============================================
+ Coverage 74.21% 74.32% +0.11%
- Complexity 33142 33149 +7
============================================
Files 1881 1881
Lines 146770 146770
Branches 16859 16857 -2
============================================
+ Hits 108922 109084 +162
+ Misses 29181 29028 -153
+ Partials 8667 8658 -9
Flags with carried forward coverage won't be shown. Click here to find out more.
🚀 New features to boost your workflow:
|
Technoboy-
approved these changes
Aug 5, 2025
nodece
pushed a commit
to ascentstream/pulsar
that referenced
this pull request
Aug 6, 2025
…r consumer (apache#24599) (cherry picked from commit 6f992bd)
gaozhangmin
pushed a commit
to gaozhangmin/pulsar
that referenced
this pull request
Aug 13, 2025
poorbarcode
pushed a commit
to poorbarcode/pulsar
that referenced
this pull request
Aug 14, 2025
ganesh-ctds
pushed a commit
to datastax/pulsar
that referenced
this pull request
Aug 20, 2025
…r consumer (apache#24599) (cherry picked from commit 6f992bd) (cherry picked from commit c16eb2c)
srinath-ctds
pushed a commit
to datastax/pulsar
that referenced
this pull request
Aug 20, 2025
…r consumer (apache#24599) (cherry picked from commit 6f992bd) (cherry picked from commit c16eb2c)
manas-ctds
pushed a commit
to datastax/pulsar
that referenced
this pull request
Aug 20, 2025
…r consumer (apache#24599) (cherry picked from commit 6f992bd) (cherry picked from commit c3f4f6a)
srinath-ctds
pushed a commit
to datastax/pulsar
that referenced
this pull request
Aug 26, 2025
…r consumer (apache#24599) (cherry picked from commit 6f992bd) (cherry picked from commit c3f4f6a)
KannarFr
pushed a commit
to CleverCloud/pulsar
that referenced
this pull request
Sep 22, 2025
walkinggo
pushed a commit
to walkinggo/pulsar
that referenced
this pull request
Oct 8, 2025
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
area/client
cherry-picked/branch-3.0
cherry-picked/branch-3.3
cherry-picked/branch-4.0
doc-not-needed
Your PR changes do not impact docs
ready-to-test
release/3.0.14
release/3.3.9
release/4.0.7
type/bug
The PR fixed a bug or issue reported a bug
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Motivation
There are several methods that get a connection from the
ConnectionPool.ConnectionPool#getConnection(ServiceNameResolver)It's only used in
BinaryProtoLookupService. The callbacks are all executed inPulsarClientImpl#lookupExecutorProvider: a single thread executor whose thread name starts withpulsar-client-lookup.ConnectionPool#getConnection(InetSocketAddress)It's called by the 1st method directly. Besides, it's only called by
BinaryProtoLookupService#findBroker, which also usesPulsarClientImpl#lookupExecutorProviderto execute the callback.ConnectionPool#getConnection(InetSocketAddress logicalAddress, InetSocketAddress physicalAddress, int randomKey)It's called by the 2nd method directly. Besides, it's only called by the 4th method.
PulsarClientImpl#getConnection(InetSocketAddress, InetSocketAddress, int)It's called in
ConnectionHandler#grabCnxto establish a connection between broker and client (producer, consumer or reader). The callback callsconnectionOpenedorhandleConnectionErrorwithout switching to another executor.PulsarClientImplIncluding:
getConnectionToServiceUrlgetConnection(String, int)getConnection(String, String)They all call the 4th method in the callback of
LookupService#getBrokerand only used ingrabCnx.To solve a race condition caused by the fact that socket is closed in Netty's I/O thread while
connectionOpenedthat sends the command is executed in another thread, #23499 completes the future ofConnectionPool#getConnectionin Netty's' I/O thread as well. However, this adds an additional thread switching for all usages in method 1 above, which is not necessary.I found this issue when I found a producer creation was blocked forever due to a deadlock in
sendAsync's callback, which is executed in Netty's I/O thread. When checking the heap dump, I foundClientCnx#pendingRequestswas empty, which meansclient.getCnxPool().getConnection(socketAddress)never complete, seepulsar/pulsar-client/src/main/java/org/apache/pulsar/client/impl/BinaryProtoLookupService.java
Line 218 in 1e57827
Though even if without the change, other response processing will still be blocked because the I/O thread is blocked. It's very confusing when reviewing the heap dump:
BinaryProtoLookupService#partitionedMetadataInProgressis not emptyClientCnx#pendingRequestsis empty (the only connection in the pool)Actually, the root cause of the issue described in #23499 is that the
StacklessClosedChannelExceptionis treated as an exception cannot be retried. However, all network exceptions should be retried. Hence, this PR proposed a different solution to retry for such errors. Technically, only a few known exceptions should be treated as not retriable, e.g.AuthorizationException. Other known or unknown exceptions should be retried.It's not guaranteed that
writeAndFlushwill always succeed. For example, if the code reaches here:pulsar/pulsar-client/src/main/java/org/apache/pulsar/client/impl/ClientCnx.java
Line 1062 in d272825
The future of
ClientCnx#sendRequestWithIdcould fail with an exception that is notPulsarClientException.Modifications
Revert the change in #23499 and retry for retriable exception even if it's not
PulsarClientException. ImproveSimpleProduceConsumeIoTestto cover consumer creation as well. Since this test only covers a very limited case, addtestUnknownRpcExceptionFor*tests that inject failure on the 1stwriteAndFlushinconnectionOpened.Documentation
docdoc-requireddoc-not-neededdoc-completeMatching PR in forked repository
PR in forked repository: