Skip to content

Conversation

@AuroraTwinkle
Copy link
Contributor

@AuroraTwinkle AuroraTwinkle commented Jun 7, 2025

Fixes #22934 (comment)

Main Issue: #22934 (comment)

Implementation: #24387

Motivation

As #22934 and #22933 mentioned, when most of the nodes in serviceurl are down (but there is at least one available node), creating consumers and producers through PulsarClient will most likely fail. I think this is not as expected. If the code is robust enough, as long as there is one available node, it should be accessible normally. Therefore, this pip is going to optimize the logic, remove unavailable nodes through the feedback mechanism, and improve the success rate of PulsarClient requests.

By the way, #22935 removes faulty nodes through a regular health check mechanism, but this brings new problems (frequent creation of connections and increased system load), so this solution is abandoned. See #22934 (comment) for more details!

Modifications

Verifying this change

  • Make sure that the change passes the CI checks.

(Please pick either of the following options)

This change is a trivial rework / code cleanup without any test coverage.

(or)

This change is already covered by existing tests, such as (please describe tests).

(or)

This change added tests and can be verified as follows:

(example:)

  • Added integration tests for end-to-end deployment with large payloads (10MB)
  • Extended integration test for recovery after broker failure

Does this pull request potentially affect one of the following parts:

If the box was checked, please highlight the changes

  • Dependencies (add or upgrade a dependency)
  • The public API
  • The schema
  • The default values of configurations
  • The threading model
  • The binary protocol
  • The REST endpoints
  • The admin CLI options
  • The metrics
  • Anything that affects deployment

Documentation

  • doc
  • doc-required
  • doc-not-needed
  • doc-complete

Matching PR in forked repository

PR in forked repository:

…failed when many nodes in PulsarClient serviceUrl become unavailable
@github-actions
Copy link

github-actions bot commented Jun 7, 2025

@AuroraTwinkle Please add the following content to your PR description and select a checkbox:

- [ ] `doc` <!-- Your PR contains doc changes -->
- [ ] `doc-required` <!-- Your PR changes impact docs and you will update later -->
- [ ] `doc-not-needed` <!-- Your PR changes do not impact docs -->
- [ ] `doc-complete` <!-- Docs have been already added -->

Copy link
Member

@lhotari lhotari left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some initial feedback

@AuroraTwinkle
Copy link
Contributor Author

AuroraTwinkle commented Jun 9, 2025

Some initial feedback

@lhotari All good suggestions, I will fix them and update later. Thanks for your help!

@AuroraTwinkle AuroraTwinkle changed the title [improve][pip] PIP-425: fix problem that consumer or producer create failed when many nodes in PulsarClient serviceUrl become unavailable [improve][pip] PIP-425: Support connecting with next available endpoint for multi-endpoint serviceUrls Jun 9, 2025
@AuroraTwinkle
Copy link
Contributor Author

Some initial feedback

Updated!

Copy link
Member

@lhotari lhotari left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good progress! added some follow up comments

@AuroraTwinkle AuroraTwinkle force-pushed the PIP-425 branch 2 times, most recently from 7f314b9 to 817a231 Compare June 9, 2025 14:53
@AuroraTwinkle
Copy link
Contributor Author

good progress! added some follow up comments

@lhotari Modified and updated as suggested. Thanks!

Copy link
Contributor

@codelipenghui codelipenghui left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @AuroraTwinkle

The proposal looks good to me.
I’ve added a few comments to help clarify the problem and the proposed solution, making it easier to understand.

@codelipenghui codelipenghui added this to the 4.1.0 milestone Jun 16, 2025
@AuroraTwinkle
Copy link
Contributor Author

Hi @AuroraTwinkle

The proposal looks good to me. I’ve added a few comments to help clarify the problem and the proposed solution, making it easier to understand.

Ok, I will fix them later, Thanks!

AuroraTwinkle and others added 3 commits June 17, 2025 11:08
Co-authored-by: Penghui Li <penghui@apache.org>
Co-authored-by: Penghui Li <penghui@apache.org>
Co-authored-by: Penghui Li <penghui@apache.org>
AuroraTwinkle and others added 5 commits June 17, 2025 11:10
Co-authored-by: Penghui Li <penghui@apache.org>
Co-authored-by: Penghui Li <penghui@apache.org>
Co-authored-by: Penghui Li <penghui@apache.org>
@AuroraTwinkle
Copy link
Contributor Author

Hi @AuroraTwinkle

The proposal looks good to me. I’ve added a few comments to help clarify the problem and the proposed solution, making it easier to understand.

@codelipenghui Very interesting and detailed suggestions, I have fixed them, thank you again!

Copy link
Member

@lhotari lhotari left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, good work @AuroraTwinkle

@AuroraTwinkle AuroraTwinkle requested a review from 315157973 June 24, 2025 09:26
@liangyepianzhou
Copy link
Contributor

liangyepianzhou commented Jun 25, 2025

@lhotari @codelipenghui @315157973 We need more votes on the mailing list to close this PIP. Could you please help vote when you have a moment?
https://lists.apache.org/thread/c2zvjwf7bqp8nc2rpzbxd4kdtztk23xp

@liangyepianzhou liangyepianzhou merged commit be385c4 into apache:master Jun 25, 2025
20 checks passed
KannarFr pushed a commit to CleverCloud/pulsar that referenced this pull request Sep 22, 2025
…nt for multi-endpoint serviceUrls (apache#24394)

Fixes apache#22934 (comment)

Main Issue: apache#22934 (comment)

Implementation: apache#24387

### Motivation
As apache#22934 and apache#22933 mentioned, when most of the nodes in serviceurl are down (but there is at least one available node), creating consumers and producers through PulsarClient will most likely fail. I think this is not as expected. If the code is robust enough, as long as there is one available node, it should be accessible normally. Therefore, this pip is going to optimize the logic, remove unavailable nodes through the feedback mechanism, and improve the success rate of PulsarClient requests.

By the way, apache#22935 removes faulty nodes through a regular health check mechanism, but this brings new problems (frequent creation of connections and increased system load), so this solution is abandoned. See apache#22934 (comment) for more details!
walkinggo pushed a commit to walkinggo/pulsar that referenced this pull request Oct 8, 2025
…nt for multi-endpoint serviceUrls (apache#24394)

Fixes apache#22934 (comment)

Main Issue: apache#22934 (comment)

Implementation: apache#24387

### Motivation
As apache#22934 and apache#22933 mentioned, when most of the nodes in serviceurl are down (but there is at least one available node), creating consumers and producers through PulsarClient will most likely fail. I think this is not as expected. If the code is robust enough, as long as there is one available node, it should be accessible normally. Therefore, this pip is going to optimize the logic, remove unavailable nodes through the feedback mechanism, and improve the success rate of PulsarClient requests.

By the way, apache#22935 removes faulty nodes through a regular health check mechanism, but this brings new problems (frequent creation of connections and increased system load), so this solution is abandoned. See apache#22934 (comment) for more details!
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

doc-not-needed Your PR changes do not impact docs PIP

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] consumer or producer will create failed frequently when build PulsarClient with many unavailable broker nodes

6 participants