Skip to content

Conversation

@AuroraTwinkle
Copy link
Contributor

@AuroraTwinkle AuroraTwinkle commented Jun 5, 2025

Fixes #22934 (comment)

Main Issue: #22934 (comment)

PIP: #24394

Motivation

As #22934 and #22933 mentioned, when most of the nodes in serviceurl are down (but there is at least one available node), creating consumers and producers through PulsarClient will most likely fail. I think this is not as expected. If the code is robust enough, as long as there is one available node, it should be accessible normally. Therefore, this PR is going to optimize the code logic, remove unavailable nodes through the feedback mechanism, and improve the success rate of PulsarClient requests.

By the way, #22935 removes faulty nodes through a regular health check mechanism, but this brings new problems (frequent creation of connections and increased system load), so this solution is abandoned. See #22934 (comment) for more details!

Modifications

Verifying this change

  • Make sure that the change passes the CI checks.

(Please pick either of the following options)

This change is a trivial rework / code cleanup without any test coverage.

(or)

This change is already covered by existing tests, such as (please describe tests).

(or)

This change added tests and can be verified as follows:

(example:)

  • Added integration tests for end-to-end deployment with large payloads (10MB)
  • Extended integration test for recovery after broker failure

Does this pull request potentially affect one of the following parts:

If the box was checked, please highlight the changes

  • Dependencies (add or upgrade a dependency)
  • The public API
  • The schema
  • The default values of configurations
  • The threading model
  • The binary protocol
  • The REST endpoints
  • The admin CLI options
  • The metrics
  • Anything that affects deployment

Documentation

  • doc
  • doc-required
  • doc-not-needed
  • doc-complete

Matching PR in forked repository

PR in forked repository:
AuroraTwinkle#8

@github-actions github-actions bot added the doc-not-needed Your PR changes do not impact docs label Jun 5, 2025
@AuroraTwinkle AuroraTwinkle force-pushed the fix/serviceNameResolver branch 2 times, most recently from fbf6613 to e510a8c Compare June 6, 2025 06:36
@AuroraTwinkle AuroraTwinkle changed the title [fix][client]:fix problem that consumer or producer create failed when build PulsarClient with many unavailable broker nodes [fix][client]:fix problem that consumer or producer create failed when many nodes in PulsarClient serviceUrl become unavailable Jun 7, 2025
@AuroraTwinkle AuroraTwinkle changed the title [fix][client]:fix problem that consumer or producer create failed when many nodes in PulsarClient serviceUrl become unavailable [improve][client]:fix problem that consumer or producer create failed when many nodes in PulsarClient serviceUrl become unavailable Jun 7, 2025
@AuroraTwinkle AuroraTwinkle marked this pull request as ready for review June 7, 2025 10:02
@AuroraTwinkle AuroraTwinkle changed the title [improve][client]:fix problem that consumer or producer create failed when many nodes in PulsarClient serviceUrl become unavailable [improve][client]PIP-425:fix problem that consumer or producer create failed when many nodes in PulsarClient serviceUrl become unavailable Jun 7, 2025
@AuroraTwinkle AuroraTwinkle changed the title [improve][client]PIP-425:fix problem that consumer or producer create failed when many nodes in PulsarClient serviceUrl become unavailable [improve][client]PIP-425:Support connecting with next available endpoint for multi-endpoint serviceUrls Jun 9, 2025
liangyepianzhou pushed a commit that referenced this pull request Jun 25, 2025
…nt for multi-endpoint serviceUrls (#24394)

Fixes #22934 (comment)

Main Issue: #22934 (comment)

Implementation: #24387

### Motivation
As #22934 and #22933 mentioned, when most of the nodes in serviceurl are down (but there is at least one available node), creating consumers and producers through PulsarClient will most likely fail. I think this is not as expected. If the code is robust enough, as long as there is one available node, it should be accessible normally. Therefore, this pip is going to optimize the logic, remove unavailable nodes through the feedback mechanism, and improve the success rate of PulsarClient requests.

By the way, #22935 removes faulty nodes through a regular health check mechanism, but this brings new problems (frequent creation of connections and increased system load), so this solution is abandoned. See #22934 (comment) for more details!
@AuroraTwinkle AuroraTwinkle force-pushed the fix/serviceNameResolver branch 5 times, most recently from 72f52aa to 5d408c4 Compare June 26, 2025 11:04
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR improves the Pulsar client’s handling of multi-endpoint service URLs by implementing endpoint quarantine and host availability tracking, ensuring higher connection success when some nodes are unavailable.

  • Introduces quarantine logic with configurable initial and maximum durations in the ServiceNameResolver.
  • Adjusts connection acquisition in multiple components (HTTP client, connection pool, lookup service) to mark host availability.
  • Updates tests and client builder APIs to verify and configure these new quarantine behaviors.

Reviewed Changes

Copilot reviewed 13 out of 13 changed files in this pull request and generated no comments.

Show a summary per file
File Description
pulsar-client/src/test/java/org/apache/pulsar/client/impl/PulsarServiceNameResolverTest.java Added tests to verify host quarantine and recovery behavior.
pulsar-client/src/test/java/org/apache/pulsar/client/impl/ClientTestFixtures.java Updated connection mocks to support the new resolver API.
pulsar-client/src/test/java/org/apache/pulsar/client/impl/BinaryProtoLookupServiceTest.java Adjusted tests for resolver changes and added additional connection pathways.
pulsar-client/src/main/java/org/apache/pulsar/client/impl/conf/ClientConfigurationData.java Introduced new configuration properties for quarantine durations.
pulsar-client/src/main/java/org/apache/pulsar/client/impl/ServiceNameResolver.java Added default method for marking host availability.
pulsar-client/src/main/java/org/apache/pulsar/client/impl/PulsarServiceNameResolver.java Integrated quarantine logic, host tracking, and synchronized service URL updates.
pulsar-client/src/main/java/org/apache/pulsar/client/impl/HttpClient.java Integrated host availability notifications during HTTP failures.
pulsar-client/src/main/java/org/apache/pulsar/client/impl/ConnectionPool.java Added a getConnection method that leverages the new resolver interface.
pulsar-client/src/main/java/org/apache/pulsar/client/impl/ClientBuilderImpl.java and pulsar-client-api/src/main/java/org/apache/pulsar/client/api/ClientBuilder.java Exposed client builder options for configuring quarantine durations.
pulsar-broker/src/test/java/org/apache/pulsar/client/impl/CreateConsumerProducerTest.java Added extensive integration tests to validate unhealthy endpoint removal and reconnection behavior.
Comments suppressed due to low confidence (1)

pulsar-broker/src/test/java/org/apache/pulsar/client/impl/CreateConsumerProducerTest.java:238

  • [nitpick] The method name 'creatConsumerAndProducers' appears to contain a typo. Consider renaming it to 'createConsumerAndProducers' for clarity.
    private int creatConsumerAndProducers(PulsarClientImpl pulsarClient, int createCount, String topic) {

@AuroraTwinkle
Copy link
Contributor Author

The CI tests have passed in the fork repository AuroraTwinkle#8

@codelipenghui codelipenghui added this to the 4.1.0 milestone Jun 28, 2025
@codelipenghui codelipenghui added the type/feature The PR added a new feature or issue requested a new feature label Jun 28, 2025
Copy link
Member

@nodece nodece left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, I've completed a partial review of the code.

@AuroraTwinkle AuroraTwinkle force-pushed the fix/serviceNameResolver branch from ed18237 to 26ce95e Compare July 14, 2025 09:31
@codecov-commenter
Copy link

Codecov Report

Attention: Patch coverage is 87.27273% with 14 lines in your changes missing coverage. Please review.

Project coverage is 74.34%. Comparing base (bbc6224) to head (26ce95e).
Report is 1194 commits behind head on master.

Files with missing lines Patch % Lines
.../pulsar/client/impl/PulsarServiceNameResolver.java 90.00% 4 Missing and 4 partials ⚠️
...g/apache/pulsar/client/impl/ClientBuilderImpl.java 66.66% 0 Missing and 2 partials ⚠️
...java/org/apache/pulsar/client/impl/HttpClient.java 75.00% 2 Missing ⚠️
...e/pulsar/client/impl/BinaryProtoLookupService.java 90.00% 1 Missing ⚠️
...apache/pulsar/client/impl/ServiceNameResolver.java 0.00% 1 Missing ⚠️
Additional details and impacted files

Impacted file tree graph

@@             Coverage Diff              @@
##             master   #24387      +/-   ##
============================================
+ Coverage     73.57%   74.34%   +0.77%     
+ Complexity    32624    32545      -79     
============================================
  Files          1877     1869       -8     
  Lines        139502   146034    +6532     
  Branches      15299    16747    +1448     
============================================
+ Hits         102638   108569    +5931     
+ Misses        28908    28880      -28     
- Partials       7956     8585     +629     
Flag Coverage Δ
inttests 26.72% <57.27%> (+2.14%) ⬆️
systests 23.33% <40.00%> (-0.99%) ⬇️
unittests 73.83% <87.27%> (+0.99%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
.../org/apache/pulsar/client/impl/ConnectionPool.java 76.63% <100.00%> (+2.11%) ⬆️
...lsar/client/impl/conf/ClientConfigurationData.java 93.33% <ø> (-3.37%) ⬇️
...e/pulsar/client/impl/BinaryProtoLookupService.java 81.25% <90.00%> (-1.29%) ⬇️
...apache/pulsar/client/impl/ServiceNameResolver.java 0.00% <0.00%> (ø)
...g/apache/pulsar/client/impl/ClientBuilderImpl.java 85.63% <66.66%> (-0.36%) ⬇️
...java/org/apache/pulsar/client/impl/HttpClient.java 81.15% <75.00%> (+2.96%) ⬆️
.../pulsar/client/impl/PulsarServiceNameResolver.java 89.65% <90.00%> (-0.09%) ⬇️

... and 1094 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copy link
Contributor

@liangyepianzhou liangyepianzhou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@nodece nodece merged commit d6071c7 into apache:master Jul 17, 2025
96 of 98 checks passed
KannarFr pushed a commit to CleverCloud/pulsar that referenced this pull request Sep 22, 2025
…nt for multi-endpoint serviceUrls (apache#24394)

Fixes apache#22934 (comment)

Main Issue: apache#22934 (comment)

Implementation: apache#24387

### Motivation
As apache#22934 and apache#22933 mentioned, when most of the nodes in serviceurl are down (but there is at least one available node), creating consumers and producers through PulsarClient will most likely fail. I think this is not as expected. If the code is robust enough, as long as there is one available node, it should be accessible normally. Therefore, this pip is going to optimize the logic, remove unavailable nodes through the feedback mechanism, and improve the success rate of PulsarClient requests.

By the way, apache#22935 removes faulty nodes through a regular health check mechanism, but this brings new problems (frequent creation of connections and increased system load), so this solution is abandoned. See apache#22934 (comment) for more details!
KannarFr pushed a commit to CleverCloud/pulsar that referenced this pull request Sep 22, 2025
…int for multi-endpoint serviceUrls (apache#24387)

Co-authored-by: duanlinlin <duanlinllin@xiaohongshu.com>
Co-authored-by: Zixuan Liu <nodeces@gmail.com>
walkinggo pushed a commit to walkinggo/pulsar that referenced this pull request Oct 8, 2025
…nt for multi-endpoint serviceUrls (apache#24394)

Fixes apache#22934 (comment)

Main Issue: apache#22934 (comment)

Implementation: apache#24387

### Motivation
As apache#22934 and apache#22933 mentioned, when most of the nodes in serviceurl are down (but there is at least one available node), creating consumers and producers through PulsarClient will most likely fail. I think this is not as expected. If the code is robust enough, as long as there is one available node, it should be accessible normally. Therefore, this pip is going to optimize the logic, remove unavailable nodes through the feedback mechanism, and improve the success rate of PulsarClient requests.

By the way, apache#22935 removes faulty nodes through a regular health check mechanism, but this brings new problems (frequent creation of connections and increased system load), so this solution is abandoned. See apache#22934 (comment) for more details!
walkinggo pushed a commit to walkinggo/pulsar that referenced this pull request Oct 8, 2025
…int for multi-endpoint serviceUrls (apache#24387)

Co-authored-by: duanlinlin <duanlinllin@xiaohongshu.com>
Co-authored-by: Zixuan Liu <nodeces@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/client doc-not-needed Your PR changes do not impact docs type/feature The PR added a new feature or issue requested a new feature

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] consumer or producer will create failed frequently when build PulsarClient with many unavailable broker nodes

6 participants