[improve][client]:Perform health checks on the endpoints that passed in by serviceUrl of PulsarClient #22935

AuroraTwinkle · 2024-06-18T09:16:39Z

Main Issue: #22934

Motivation

Refer to issue: #22934

Modifications

Verifying this change

Make sure that the change passes the CI checks.

(Please pick either of the following options)

This change is a trivial rework / code cleanup without any test coverage.

(or)

This change is already covered by existing tests, such as (please describe tests).

(or)

This change added tests and can be verified as follows:

(example:)

Added integration tests for end-to-end deployment with large payloads (10MB)
Extended integration test for recovery after broker failure

Does this pull request potentially affect one of the following parts:

If the box was checked, please highlight the changes

Documentation

doc
doc-required
doc-not-needed
doc-complete

Matching PR in forked repository

PR in forked repository: AuroraTwinkle#4

github-actions · 2024-06-18T09:17:07Z

@AuroraTwinkle Please add the following content to your PR description and select a checkbox:

- [ ] `doc` <!-- Your PR contains doc changes -->
- [ ] `doc-required` <!-- Your PR changes impact docs and you will update later -->
- [ ] `doc-not-needed` <!-- Your PR changes do not impact docs -->
- [ ] `doc-complete` <!-- Docs have been already added -->

pulsar-client/src/main/java/org/apache/pulsar/client/impl/PulsarServiceNameResolver.java

pulsar-client/src/test/java/org/apache/pulsar/client/impl/PulsarServiceNameResolverTest.java

… serviceUrl

AuroraTwinkle · 2025-06-05T03:43:29Z

@liangyepianzhou I have fixed all your comments, PTAL, thanks

Copilot

Pull Request Overview

This PR implements a health check mechanism for endpoints passed through the serviceUrl in PulsarClient. Key changes include:

Adding periodic health checks and removal of unhealthy endpoints in PulsarServiceNameResolver.
Updating tests and client modules (HttpClient, BinaryProtoLookupService, AutoClusterFailover) to integrate and validate the new health check behavior.
Introducing a Caffeine eviction listener to properly close resolvers when they expire.

Reviewed Changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 1 comment.

Show a summary per file

File	Description
pulsar-client/src/test/java/org/apache/pulsar/client/impl/PulsarServiceNameResolverTest.java	Added a local server socket in tests to simulate endpoint health and verify removal of unreachable hosts.
pulsar-client/src/main/java/org/apache/pulsar/client/impl/ServiceNameResolver.java	Extended ServiceNameResolver with the Closeable interface and a default close method.
pulsar-client/src/main/java/org/apache/pulsar/client/impl/PulsarServiceNameResolver.java	Introduced health check logic with scheduled periodic checks and updated endpoint lists.
pulsar-client/src/main/java/org/apache/pulsar/client/impl/HttpClient.java	Ensured the resolver is closed when the HttpClient is shut down.
pulsar-client/src/main/java/org/apache/pulsar/client/impl/BinaryProtoLookupService.java	Closed the resolver during cleanup.
pulsar-client/src/main/java/org/apache/pulsar/client/impl/AutoClusterFailover.java	Closed the resolver during failover cleanup.
pulsar-broker/src/test/java/org/apache/pulsar/client/api/CreateConsumerProducerTest.java	Added a new test to validate client creation in scenarios with unavailable broker nodes.
pulsar-broker/src/main/java/org/apache/pulsar/broker/web/PulsarWebResource.java	Integrated a Caffeine eviction listener to close the resolver on cache eviction.

Comments suppressed due to low confidence (1)

pulsar-client/src/test/java/org/apache/pulsar/client/impl/PulsarServiceNameResolverTest.java:59

The accept loop in the anonymously spawned thread does not check whether the serverSocket has been closed, which can lead to continuous exception logging. Consider adding a condition to exit the loop when the serverSocket is closed.

new Thread(() -> { while (true) { try { serverSocket.accept(); } catch (IOException e) { e.printStackTrace(); } } }).start();

pulsar-client/src/main/java/org/apache/pulsar/client/impl/PulsarServiceNameResolver.java

…arServiceNameResolver.java Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

lhotari · 2025-06-05T06:57:58Z

pulsar-client/src/main/java/org/apache/pulsar/client/impl/PulsarServiceNameResolver.java

 @Slf4j
 public class PulsarServiceNameResolver implements ServiceNameResolver {
-
+    private static final int HEALTH_CHECK_TIMEOUT_MS = 5000;


if this feature gets added, it should be configurable.

lhotari · 2025-06-05T06:58:34Z

pulsar-client/src/main/java/org/apache/pulsar/client/impl/PulsarServiceNameResolver.java

+                if (!healthCheckScheduled.get()) {
+                    ScheduledFuture<?> future =
+                            ((ScheduledExecutorService) executorProvider.getExecutor()).scheduleWithFixedDelay(
+                                    this::doHealthCheck, 0, 5, TimeUnit.SECONDS);


Interval should be configurable. If interval is 0, there should be no healthchecks at all.

Ok, I will write a new pip to add the health check interval and timeout parameters

lhotari · 2025-06-05T07:02:19Z

pulsar-client/src/main/java/org/apache/pulsar/client/impl/PulsarServiceNameResolver.java

+    private static boolean checkAddress(InetSocketAddress address) {
+        try (Socket socket = new Socket()) {
+            socket.connect(new InetSocketAddress(address.getHostName(), address.getPort()), HEALTH_CHECK_TIMEOUT_MS);
+            return true;
+        } catch (Exception e) {
+            log.error("Health check error, failed to connect to {}, error:{}", address, e.getMessage());
+            return false;
+        }
+    }


In Pulsar code base, Java's Socket API is avoided. Netty API is used instead. Another detail is that Netty's DNS resolver is used and it has it's own DNS cache. Making a test connection where the address is resolved using the same DNS resolver cache will be more useful.
In this case, Netty API could be used synchronously so there wouldn't be a need to change to async style completely in health checks.

The Netty DNS resolver (AddressResolver<InetSocketAddress>) for the client is created in ConnectionPool class. It would be necessary to use the same instance so that the client wouldn't have 2 separate DNS caches and 2 different DNS configurations.

Ok, I will fix it.

lhotari

The general design is problematic since the health checking will keep running and creating TCP/IP connections that are immediately closed. This will cause additional load in the overall system, including endpoints (proxies / brokers). Additionally opening and closing a TCP/IP connection will keep the local port occupied in TIME_WAIT state for some time (2*MSL, 60s-240s depending on OS and it's config). SO_REUSEADDR/SO_REUSEPORT doesn't prevent port occupation since it doesn't help bypass TIME_WAIT restrictions for outbound client connections to the same 4-tuple (local ip, local port, remote ip, remote port).

Before actually implementing this health check feature, it would be necessary to describe the issue that is currently caused by not adding the health check and primary addressing that issue instead of implementing this solution in this PR.

lhotari · 2025-06-05T10:46:35Z

Replied in #22934 (comment) about a better way to solve the actual problem.

lhotari

Please read #22934 (comment). A different type of solution would solve the problem in a better way since this health checking solution causes it's own problems and overhead as explained in previous comments.

AuroraTwinkle · 2025-06-05T14:28:55Z

The general design is problematic since the health checking will keep running and creating TCP/IP connections that are immediately closed. This will cause additional load in the overall system, including endpoints (proxies / brokers). Additionally opening and closing a TCP/IP connection will keep the local port occupied in TIME_WAIT state for some time (2*MSL, 60s-240s depending on OS and it's config). SO_REUSEADDR/SO_REUSEPORT doesn't prevent port occupation since it doesn't help bypass TIME_WAIT restrictions for outbound client connections to the same 4-tuple (local ip, local port, remote ip, remote port).

Before actually implementing this health check feature, it would be necessary to describe the issue that is currently caused by not adding the health check and primary addressing that issue instead of implementing this solution in this PR.

Ok, I will start a new PR for a better solution that mentioned at #22934 (comment). And I will close current PR.

…nt for multi-endpoint serviceUrls (#24394) Fixes #22934 (comment) Main Issue: #22934 (comment) Implementation: #24387 ### Motivation As #22934 and #22933 mentioned, when most of the nodes in serviceurl are down (but there is at least one available node), creating consumers and producers through PulsarClient will most likely fail. I think this is not as expected. If the code is robust enough, as long as there is one available node, it should be accessible normally. Therefore, this pip is going to optimize the logic, remove unavailable nodes through the feedback mechanism, and improve the success rate of PulsarClient requests. By the way, #22935 removes faulty nodes through a regular health check mechanism, but this brings new problems (frequent creation of connections and increased system load), so this solution is abandoned. See #22934 (comment) for more details!

…nt for multi-endpoint serviceUrls (apache#24394) Fixes apache#22934 (comment) Main Issue: apache#22934 (comment) Implementation: apache#24387 ### Motivation As apache#22934 and apache#22933 mentioned, when most of the nodes in serviceurl are down (but there is at least one available node), creating consumers and producers through PulsarClient will most likely fail. I think this is not as expected. If the code is robust enough, as long as there is one available node, it should be accessible normally. Therefore, this pip is going to optimize the logic, remove unavailable nodes through the feedback mechanism, and improve the success rate of PulsarClient requests. By the way, apache#22935 removes faulty nodes through a regular health check mechanism, but this brings new problems (frequent creation of connections and increased system load), so this solution is abandoned. See apache#22934 (comment) for more details!

github-actions bot added the doc-label-missing label Jun 18, 2024

github-actions bot added doc-not-needed Your PR changes do not impact docs and removed doc-label-missing labels Jun 18, 2024

AuroraTwinkle changed the title ~~[improve][client]:Perform health checks on the endpoints passed in by serviceUrl when building PulsarClient~~ [improve][client]:Perform health checks on the endpoints that passed in by serviceUrl of PulsarClient Jun 18, 2024

AuroraTwinkle force-pushed the improve/checkEndpointsInServiceNameResolver branch 3 times, most recently from 8ec287e to 45ffddf Compare June 18, 2024 10:40

liangyepianzhou assigned AuroraTwinkle Jun 19, 2024

Coselding reviewed Jul 1, 2024

View reviewed changes

pulsar-client/src/main/java/org/apache/pulsar/client/impl/PulsarServiceNameResolver.java Outdated Show resolved Hide resolved

AuroraTwinkle requested a review from Coselding July 5, 2024 11:49

AuroraTwinkle marked this pull request as ready for review July 5, 2024 15:05

liangyepianzhou reviewed Jul 7, 2024

View reviewed changes

pulsar-client/src/main/java/org/apache/pulsar/client/impl/PulsarServiceNameResolver.java Outdated Show resolved Hide resolved

pulsar-client/src/main/java/org/apache/pulsar/client/impl/PulsarServiceNameResolver.java Outdated Show resolved Hide resolved

liangyepianzhou reviewed Jul 8, 2024

View reviewed changes

pulsar-client/src/test/java/org/apache/pulsar/client/impl/PulsarServiceNameResolverTest.java Outdated Show resolved Hide resolved

duanlinlin added 3 commits June 4, 2025 16:38

[improve][client]:Perform health checks on the endpoints passed in by…

85219c2

… serviceUrl

improve and add unit test

f6d56a4

close resolver in unit test

78e5c9b

AuroraTwinkle force-pushed the improve/checkEndpointsInServiceNameResolver branch from ec0f81d to 78e5c9b Compare June 4, 2025 08:38

fix review comments

befac76

AuroraTwinkle force-pushed the improve/checkEndpointsInServiceNameResolver branch 9 times, most recently from e801c1f to e66426e Compare June 4, 2025 12:42

add test

c1d73d1

AuroraTwinkle force-pushed the improve/checkEndpointsInServiceNameResolver branch from e66426e to c1d73d1 Compare June 4, 2025 12:44

add test

88249ef

liangyepianzhou requested review from BewareMyPower, Demogorgon314, dao-jun, lhotari and nodece and removed request for Coselding June 5, 2025 03:54

liangyepianzhou added this to the 4.1.0 milestone Jun 5, 2025

liangyepianzhou requested a review from Copilot June 5, 2025 04:03

Copilot AI reviewed Jun 5, 2025

View reviewed changes

pulsar-client/src/main/java/org/apache/pulsar/client/impl/PulsarServiceNameResolver.java Outdated Show resolved Hide resolved

AuroraTwinkle and others added 2 commits June 5, 2025 12:05

Update pulsar-client/src/main/java/org/apache/pulsar/client/impl/Puls…

00262c9

…arServiceNameResolver.java Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Fix

37de78a

AuroraTwinkle force-pushed the improve/checkEndpointsInServiceNameResolver branch from 034148e to 37de78a Compare June 5, 2025 04:10

lhotari requested changes Jun 5, 2025

View reviewed changes

fix test

5096e8b

lhotari requested changes Jun 5, 2025

View reviewed changes

lhotari mentioned this pull request Jun 5, 2025

[BUG] consumer or producer will create failed frequently when build PulsarClient with many unavailable broker nodes #22934

Closed

2 tasks

fix review comments

482ab5b

lhotari requested changes Jun 5, 2025

View reviewed changes

AuroraTwinkle closed this Jun 5, 2025

This was referenced Jun 7, 2025

[improve][client]PIP-425:Support connecting with next available endpoint for multi-endpoint serviceUrls #24387

Merged

[improve][pip] PIP-425: Support connecting with next available endpoint for multi-endpoint serviceUrls #24394

Merged

[improve][client]:Perform health checks on the endpoints that passed in by serviceUrl of PulsarClient #22935

[improve][client]:Perform health checks on the endpoints that passed in by serviceUrl of PulsarClient #22935

Uh oh!

Conversation

AuroraTwinkle commented Jun 18, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Verifying this change

Does this pull request potentially affect one of the following parts:

Documentation

Matching PR in forked repository

Uh oh!

github-actions bot commented Jun 18, 2024

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

AuroraTwinkle commented Jun 5, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

lhotari Jun 5, 2025

Choose a reason for hiding this comment

Uh oh!

lhotari Jun 5, 2025

Choose a reason for hiding this comment

Uh oh!

AuroraTwinkle Jun 5, 2025

Choose a reason for hiding this comment

Uh oh!

lhotari Jun 5, 2025

Choose a reason for hiding this comment

Uh oh!

lhotari Jun 5, 2025

Choose a reason for hiding this comment

Uh oh!

AuroraTwinkle Jun 5, 2025

Choose a reason for hiding this comment

Uh oh!

lhotari left a comment

Choose a reason for hiding this comment

Uh oh!

lhotari commented Jun 5, 2025

Uh oh!

lhotari left a comment

Choose a reason for hiding this comment

Uh oh!

AuroraTwinkle commented Jun 5, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

AuroraTwinkle commented Jun 18, 2024 •

edited

Loading