Skip to content

Conversation

@lhotari
Copy link
Member

@lhotari lhotari commented Apr 6, 2023

Motivation

In Kubernetes upgrades, it's possible to get into a state where a TCP/IP connection seems to be alive on the broker. The connection is orphaned and therefore consuming resources and causing resource conflicts with producers and consumers.
This PR intends to target the issue with consumers that are using the Key shared subscription type with sticky hash ranges. The prevents the issue of "Range conflict with consumer ..." which happens when the previous connection from the client is still active on the broker while the client is reconnecting on a new connection.

Modifications

  • make Dispatch.addConsumer and StickyKeyConsumerSelector.addConsumer asynchronous
  • add connectionLivenessCheckTimeoutMillis setting that defaults to 5000ms
  • when a conflicting consumer is found, first do an active check with Pulsar's Ping command to see if the connection is alive.
  • resume the attempt to add the consumer after the check completes

Documentation

  • doc
  • doc-required
  • doc-not-needed
  • doc-complete

…umers when TCP/IP connections hang

- make Dispatch.addConsumer and StickyKeyConsumerSelector.addConsumer asynchronous
- add connectionLivenessCheckTimeoutMillis setting that defaults to 5000ms
- when a conflicting consumer is found, first do an active check with Pulsar's Ping command
  to see if the connection is alive.
- resume the attempt to add the consumer after the check completes
@lhotari lhotari force-pushed the lh-broker-dead-connection-detection branch from 4203566 to ef9698e Compare April 6, 2023 15:21
Copy link
Member

@michaeljmarshall michaeljmarshall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, great work @lhotari!

public void addConsumer(Consumer consumer) throws BrokerServiceException.ConsumerAssignException {
validateKeySharedMeta(consumer);
public synchronized CompletableFuture<Void> addConsumer(Consumer consumer) {
return validateKeySharedMeta(consumer).thenRun(() -> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: a minor optimization could assign validateKeySharedMeta to a local variable, and then check if that future is already completed. When it is, we know we had the synchronized lock when the validation was done, which would allow us to skip the secondary call to findConflictingConsumer. I am not sure how expensive that call is, but I assume it has some cost that adds up with many key shared consumers.

Since it is an optimization, I don't think we should hold up this PR for that.

@lhotari lhotari merged commit 08b28f5 into apache:master Apr 10, 2023
@lhotari lhotari added this to the 3.0.0 milestone Apr 10, 2023
Demogorgon314 pushed a commit to Demogorgon314/pulsar that referenced this pull request Apr 11, 2023
lhotari added a commit to datastax/pulsar that referenced this pull request Apr 13, 2023
…umers when TCP/IP connections get orphaned (#174)

upstream PR apache#20026
@poorbarcode
Copy link
Contributor

poorbarcode commented Sep 14, 2023

@lhotari

The PR #21155 fixes an issue in which the producer sends messages timeout due to the inability to reconnect successfully. The root cause is the client created a new connection to reregister the producer after it assumed the old client was invalidated, but at the same time, the broker assumed the old connection was still validated, so the client got an error "Producer with name 'st-0-5' is already connected to topic".

The fix in the PR #21155 tries to start a new heartbeat after the broker receives different connections for the same producer registration.

In this fix, the PR #21155 uses a tool method Servercnx.checkConnectionLiveness( introduced in the current PR).
I want to cherry-pick the tool method Servercnx.checkConnectionLiveness into branch-2.10 and branch-2.11 along with the PR #21155

I also send a discuss to do this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/broker doc-not-needed Your PR changes do not impact docs ready-to-test

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants