Skip to content

Conversation

@merlimat
Copy link
Contributor

@merlimat merlimat commented May 29, 2020

Note: this is based on top of #6791 & #7104. Once these are merged, I'll rebase here. For the sake of this review, check commit 23d6dcb

Motivation

If one consumer is slowly processing messages, this can prevent other consumers from making progress on the topic. Instead we're in a loop of keep trying to replay messages without being able to dispatch any message.

The basic idea here is that we can make progress by keep going through the topic and dispatch these messages to the consumers that are free, at least the keys that belong to them.

@merlimat merlimat added the type/enhancement The enhancements for the existing features or docs. e.g. reduce memory usage of the delayed messages label May 29, 2020
@merlimat merlimat added this to the 2.6.0 milestone May 29, 2020
@merlimat merlimat self-assigned this May 29, 2020
@jiazhai
Copy link
Member

jiazhai commented Jun 4, 2020

@merlimat Would you please help rebase this change?

@codelipenghui
Copy link
Contributor

@merlimat The broker unit test is so hard to get passed. I'm creating a PR to split the whole broker unit test into multiple parts and fixed some flaky tests. Please take a look at #7173.

@codelipenghui codelipenghui merged commit 2f75d02 into apache:master Jun 5, 2020
@merlimat merlimat deleted the pr-a89554b4af branch June 5, 2020 00:26
zeo1995 pushed a commit to zeo1995/pulsar that referenced this pull request Jun 5, 2020
…te-update

* 'website-update' of github.com:zeo1995/pulsar: (432 commits)
  Fixed ordering issue in KeyShared dispatcher when adding consumer (apache#7106)
  Fix Duplicated messages are sent to dead letter topic apache#6960 (apache#7021)
  [Issue 2793][Doc]--Update the TLS hostname verification for CPP and Python clients (apache#7162)
  [Doc]--set netty mex frame size (apache#7174)
  [Doc] Update for the maximum message size (apache#7171)
  Fixed KeyShared consumers getting stuck on delivery (apache#7105)
  [apache#6003][pulsar-functions] Possibility to add builtin Functions (apache#6895)
  [Issue 6921][pulsar-broker-common] Replaced "Paths.get(...).getParent()", because it's system dependent and uses '\' as path separator on Windows (apache#6992)
  Improve broker unit test CI (apache#7173)
  Fix typo in exception message (apache#7027)
  Support KeyValue Schema Use Null Key And Null Value (apache#7139)
  [Doc]--Update documents for support consumer priority level in failover mode (apache#7136)
  Add schema config to cpp and cgo docs. (apache#7137)
  [Doc]--Update for the maximum message size (apache#7160)
  [C++] Expose ZSTD and Snappy compression to C API (apache#7014)
  [pulsar-proxy] add proxyLogLevel into config file (apache#6948)
  Add multi-hosts example for bookkeeperMetadataServiceUri (apache#6998)
  support for termination of partitioned topic (apache#6126)
  Use pure-java Air-Compressor instead of JNI based libraries (apache#5390)
  [Issues 5709]remove the namespace checking (apache#5716)
  ...

# Conflicts:
#	site2/website/scripts/split-swagger-by-version.js
cdbartholomew pushed a commit to kafkaesque-io/pulsar that referenced this pull request Jul 24, 2020
Motivation
If one consumer is slowly processing messages, this can prevent other consumers from making progress on the topic. Instead we're in a loop of keep trying to replay messages without being able to dispatch any message.

The basic idea here is that we can make progress by keep going through the topic and dispatch these messages to the consumers that are free, at least the keys that belong to them.
huangdx0726 pushed a commit to huangdx0726/pulsar that referenced this pull request Aug 24, 2020
Motivation
If one consumer is slowly processing messages, this can prevent other consumers from making progress on the topic. Instead we're in a loop of keep trying to replay messages without being able to dispatch any message.

The basic idea here is that we can make progress by keep going through the topic and dispatch these messages to the consumers that are free, at least the keys that belong to them.
codelipenghui pushed a commit that referenced this pull request Sep 2, 2020
…ey_Shared consumer stuck on delivery (#7553)

### Motivation
In some case of Key_Shared consumer, messages ordering was broken.
Here is how to reproduce(I think it is one of case to reproduce this issue).

1. Connect Consumer1 to Key_Shared subscription `sub` and stop to receive
   - receiverQueueSize: 500
2. Connect Producer and publish 500 messages with key `(i % 10)`
3. Connect Consumer2 to same subscription and start to receive
   - receiverQueueSize: 1
   - since #7106 , Consumer2 can't receive (expected)
4. Producer publish more 500 messages with same key generation algorithm
5. After that, Consumer1 start to receive
6. Check Consumer2 message ordering
   - sometimes message ordering was broken in same key

Consumer1:
```
Connected: Tue Jul 14 09:36:39 JST 2020
[pulsar-client-io-1-1] WARN com.scurrilous.circe.checksum.Crc32cIntChecksum - Failed to load Circe JNI library. Falling back to Java based CRC32c provider
[pulsar-timer-4-1] INFO org.apache.pulsar.client.impl.ConsumerStatsRecorderImpl - [persistent://public/default/key-shared-test] [sub0] [820f0] Prefetched messages: 499 --- Consume throughput received: 0.02 msgs/s --- 0.00 Mbit/s --- Ack sent rate: 0.00 ack/s --- Failed messages: 0 --- batch messages: 0 ---Failed acks: 0
Received: my-message-0 PublishTime: 1594687006203 Date: Tue Jul 14 09:37:46 JST 2020
Received: my-message-1 PublishTime: 1594687006243 Date: Tue Jul 14 09:37:46 JST 2020
Received: my-message-2 PublishTime: 1594687006247 Date: Tue Jul 14 09:37:46 JST 2020
...
Received: my-message-498 PublishTime: 1594687008727 Date: Tue Jul 14 09:37:46 JST 2020
Received: my-message-499 PublishTime: 1594687008731 Date: Tue Jul 14 09:37:46 JST 2020
Received: my-message-500 PublishTime: 1594687038742 Date: Tue Jul 14 09:37:46 JST 2020
...
Received: my-message-990 PublishTime: 1594687040094 Date: Tue Jul 14 09:37:46 JST 2020
Received: my-message-994 PublishTime: 1594687040103 Date: Tue Jul 14 09:37:46 JST 2020
Received: my-message-995 PublishTime: 1594687040105 Date: Tue Jul 14 09:37:46 JST 2020
Received: my-message-997 PublishTime: 1594687040113 Date: Tue Jul 14 09:37:46 JST 2020
```

Consumer2:
```
Connected: Tue Jul 14 09:37:03 JST 2020
[pulsar-client-io-1-1] WARN com.scurrilous.circe.checksum.Crc32cIntChecksum - Failed to load Circe JNI library. Falling back to Java based CRC32c provider
Received: my-message-501 MessageId: 4:1501:-1 PublishTime: 1594687038753 Date: Tue Jul 14 09:37:46 JST 2020
Received: my-message-502 MessageId: 4:1502:-1 PublishTime: 1594687038755 Date: Tue Jul 14 09:37:46 JST 2020
Received: my-message-503 MessageId: 4:1503:-1 PublishTime: 1594687038759 Date: Tue Jul 14 09:37:46 JST 2020
Received: my-message-506 MessageId: 4:1506:-1 PublishTime: 1594687038785 Date: Tue Jul 14 09:37:46 JST 2020
Received: my-message-508 MessageId: 4:1508:-1 PublishTime: 1594687038812 Date: Tue Jul 14 09:37:46 JST 2020
Received: my-message-901 MessageId: 4:1901:-1 PublishTime: 1594687039871 Date: Tue Jul 14 09:37:46 JST 2020
Received: my-message-509 MessageId: 4:1509:-1 PublishTime: 1594687038815 Date: Tue Jul 14 09:37:46 JST 2020
ordering was broken, key: 1 oldNum: 901 newNum: 511
Received: my-message-511 MessageId: 4:1511:-1 PublishTime: 1594687038826 Date: Tue Jul 14 09:37:46 JST 2020
Received: my-message-512 MessageId: 4:1512:-1 PublishTime: 1594687038830 Date: Tue Jul 14 09:37:46 JST 2020
...
```

I think this issue is caused by #7105.
Here is an example.
1. dispatch messages
2. Consumer2 was stuck and `totalMessagesSent=0`
   - Consumer2 availablePermits was 0
3. skip redeliver messages temporally
   - Consumer2 availablePermits was back to 1
4. dispatch new messages
   - new message was dispatched to Consumer2
5. back to redeliver messages
4. dispatch messages
   - ordering was broken

### Modifications
Stop to dispatch when skip message temporally since Key_Shared consumer stuck on delivery.
lbenc135 pushed a commit to lbenc135/pulsar that referenced this pull request Sep 5, 2020
…ey_Shared consumer stuck on delivery (apache#7553)

### Motivation
In some case of Key_Shared consumer, messages ordering was broken.
Here is how to reproduce(I think it is one of case to reproduce this issue).

1. Connect Consumer1 to Key_Shared subscription `sub` and stop to receive
   - receiverQueueSize: 500
2. Connect Producer and publish 500 messages with key `(i % 10)`
3. Connect Consumer2 to same subscription and start to receive
   - receiverQueueSize: 1
   - since apache#7106 , Consumer2 can't receive (expected)
4. Producer publish more 500 messages with same key generation algorithm
5. After that, Consumer1 start to receive
6. Check Consumer2 message ordering
   - sometimes message ordering was broken in same key

Consumer1:
```
Connected: Tue Jul 14 09:36:39 JST 2020
[pulsar-client-io-1-1] WARN com.scurrilous.circe.checksum.Crc32cIntChecksum - Failed to load Circe JNI library. Falling back to Java based CRC32c provider
[pulsar-timer-4-1] INFO org.apache.pulsar.client.impl.ConsumerStatsRecorderImpl - [persistent://public/default/key-shared-test] [sub0] [820f0] Prefetched messages: 499 --- Consume throughput received: 0.02 msgs/s --- 0.00 Mbit/s --- Ack sent rate: 0.00 ack/s --- Failed messages: 0 --- batch messages: 0 ---Failed acks: 0
Received: my-message-0 PublishTime: 1594687006203 Date: Tue Jul 14 09:37:46 JST 2020
Received: my-message-1 PublishTime: 1594687006243 Date: Tue Jul 14 09:37:46 JST 2020
Received: my-message-2 PublishTime: 1594687006247 Date: Tue Jul 14 09:37:46 JST 2020
...
Received: my-message-498 PublishTime: 1594687008727 Date: Tue Jul 14 09:37:46 JST 2020
Received: my-message-499 PublishTime: 1594687008731 Date: Tue Jul 14 09:37:46 JST 2020
Received: my-message-500 PublishTime: 1594687038742 Date: Tue Jul 14 09:37:46 JST 2020
...
Received: my-message-990 PublishTime: 1594687040094 Date: Tue Jul 14 09:37:46 JST 2020
Received: my-message-994 PublishTime: 1594687040103 Date: Tue Jul 14 09:37:46 JST 2020
Received: my-message-995 PublishTime: 1594687040105 Date: Tue Jul 14 09:37:46 JST 2020
Received: my-message-997 PublishTime: 1594687040113 Date: Tue Jul 14 09:37:46 JST 2020
```

Consumer2:
```
Connected: Tue Jul 14 09:37:03 JST 2020
[pulsar-client-io-1-1] WARN com.scurrilous.circe.checksum.Crc32cIntChecksum - Failed to load Circe JNI library. Falling back to Java based CRC32c provider
Received: my-message-501 MessageId: 4:1501:-1 PublishTime: 1594687038753 Date: Tue Jul 14 09:37:46 JST 2020
Received: my-message-502 MessageId: 4:1502:-1 PublishTime: 1594687038755 Date: Tue Jul 14 09:37:46 JST 2020
Received: my-message-503 MessageId: 4:1503:-1 PublishTime: 1594687038759 Date: Tue Jul 14 09:37:46 JST 2020
Received: my-message-506 MessageId: 4:1506:-1 PublishTime: 1594687038785 Date: Tue Jul 14 09:37:46 JST 2020
Received: my-message-508 MessageId: 4:1508:-1 PublishTime: 1594687038812 Date: Tue Jul 14 09:37:46 JST 2020
Received: my-message-901 MessageId: 4:1901:-1 PublishTime: 1594687039871 Date: Tue Jul 14 09:37:46 JST 2020
Received: my-message-509 MessageId: 4:1509:-1 PublishTime: 1594687038815 Date: Tue Jul 14 09:37:46 JST 2020
ordering was broken, key: 1 oldNum: 901 newNum: 511
Received: my-message-511 MessageId: 4:1511:-1 PublishTime: 1594687038826 Date: Tue Jul 14 09:37:46 JST 2020
Received: my-message-512 MessageId: 4:1512:-1 PublishTime: 1594687038830 Date: Tue Jul 14 09:37:46 JST 2020
...
```

I think this issue is caused by apache#7105.
Here is an example.
1. dispatch messages
2. Consumer2 was stuck and `totalMessagesSent=0`
   - Consumer2 availablePermits was 0
3. skip redeliver messages temporally
   - Consumer2 availablePermits was back to 1
4. dispatch new messages
   - new message was dispatched to Consumer2
5. back to redeliver messages
4. dispatch messages
   - ordering was broken

### Modifications
Stop to dispatch when skip message temporally since Key_Shared consumer stuck on delivery.
lbenc135 pushed a commit to lbenc135/pulsar that referenced this pull request Sep 5, 2020
…ey_Shared consumer stuck on delivery (apache#7553)

### Motivation
In some case of Key_Shared consumer, messages ordering was broken.
Here is how to reproduce(I think it is one of case to reproduce this issue).

1. Connect Consumer1 to Key_Shared subscription `sub` and stop to receive
   - receiverQueueSize: 500
2. Connect Producer and publish 500 messages with key `(i % 10)`
3. Connect Consumer2 to same subscription and start to receive
   - receiverQueueSize: 1
   - since apache#7106 , Consumer2 can't receive (expected)
4. Producer publish more 500 messages with same key generation algorithm
5. After that, Consumer1 start to receive
6. Check Consumer2 message ordering
   - sometimes message ordering was broken in same key

Consumer1:
```
Connected: Tue Jul 14 09:36:39 JST 2020
[pulsar-client-io-1-1] WARN com.scurrilous.circe.checksum.Crc32cIntChecksum - Failed to load Circe JNI library. Falling back to Java based CRC32c provider
[pulsar-timer-4-1] INFO org.apache.pulsar.client.impl.ConsumerStatsRecorderImpl - [persistent://public/default/key-shared-test] [sub0] [820f0] Prefetched messages: 499 --- Consume throughput received: 0.02 msgs/s --- 0.00 Mbit/s --- Ack sent rate: 0.00 ack/s --- Failed messages: 0 --- batch messages: 0 ---Failed acks: 0
Received: my-message-0 PublishTime: 1594687006203 Date: Tue Jul 14 09:37:46 JST 2020
Received: my-message-1 PublishTime: 1594687006243 Date: Tue Jul 14 09:37:46 JST 2020
Received: my-message-2 PublishTime: 1594687006247 Date: Tue Jul 14 09:37:46 JST 2020
...
Received: my-message-498 PublishTime: 1594687008727 Date: Tue Jul 14 09:37:46 JST 2020
Received: my-message-499 PublishTime: 1594687008731 Date: Tue Jul 14 09:37:46 JST 2020
Received: my-message-500 PublishTime: 1594687038742 Date: Tue Jul 14 09:37:46 JST 2020
...
Received: my-message-990 PublishTime: 1594687040094 Date: Tue Jul 14 09:37:46 JST 2020
Received: my-message-994 PublishTime: 1594687040103 Date: Tue Jul 14 09:37:46 JST 2020
Received: my-message-995 PublishTime: 1594687040105 Date: Tue Jul 14 09:37:46 JST 2020
Received: my-message-997 PublishTime: 1594687040113 Date: Tue Jul 14 09:37:46 JST 2020
```

Consumer2:
```
Connected: Tue Jul 14 09:37:03 JST 2020
[pulsar-client-io-1-1] WARN com.scurrilous.circe.checksum.Crc32cIntChecksum - Failed to load Circe JNI library. Falling back to Java based CRC32c provider
Received: my-message-501 MessageId: 4:1501:-1 PublishTime: 1594687038753 Date: Tue Jul 14 09:37:46 JST 2020
Received: my-message-502 MessageId: 4:1502:-1 PublishTime: 1594687038755 Date: Tue Jul 14 09:37:46 JST 2020
Received: my-message-503 MessageId: 4:1503:-1 PublishTime: 1594687038759 Date: Tue Jul 14 09:37:46 JST 2020
Received: my-message-506 MessageId: 4:1506:-1 PublishTime: 1594687038785 Date: Tue Jul 14 09:37:46 JST 2020
Received: my-message-508 MessageId: 4:1508:-1 PublishTime: 1594687038812 Date: Tue Jul 14 09:37:46 JST 2020
Received: my-message-901 MessageId: 4:1901:-1 PublishTime: 1594687039871 Date: Tue Jul 14 09:37:46 JST 2020
Received: my-message-509 MessageId: 4:1509:-1 PublishTime: 1594687038815 Date: Tue Jul 14 09:37:46 JST 2020
ordering was broken, key: 1 oldNum: 901 newNum: 511
Received: my-message-511 MessageId: 4:1511:-1 PublishTime: 1594687038826 Date: Tue Jul 14 09:37:46 JST 2020
Received: my-message-512 MessageId: 4:1512:-1 PublishTime: 1594687038830 Date: Tue Jul 14 09:37:46 JST 2020
...
```

I think this issue is caused by apache#7105.
Here is an example.
1. dispatch messages
2. Consumer2 was stuck and `totalMessagesSent=0`
   - Consumer2 availablePermits was 0
3. skip redeliver messages temporally
   - Consumer2 availablePermits was back to 1
4. dispatch new messages
   - new message was dispatched to Consumer2
5. back to redeliver messages
4. dispatch messages
   - ordering was broken

### Modifications
Stop to dispatch when skip message temporally since Key_Shared consumer stuck on delivery.
lbenc135 pushed a commit to lbenc135/pulsar that referenced this pull request Sep 5, 2020
…ey_Shared consumer stuck on delivery (apache#7553)

### Motivation
In some case of Key_Shared consumer, messages ordering was broken.
Here is how to reproduce(I think it is one of case to reproduce this issue).

1. Connect Consumer1 to Key_Shared subscription `sub` and stop to receive
   - receiverQueueSize: 500
2. Connect Producer and publish 500 messages with key `(i % 10)`
3. Connect Consumer2 to same subscription and start to receive
   - receiverQueueSize: 1
   - since apache#7106 , Consumer2 can't receive (expected)
4. Producer publish more 500 messages with same key generation algorithm
5. After that, Consumer1 start to receive
6. Check Consumer2 message ordering
   - sometimes message ordering was broken in same key

Consumer1:
```
Connected: Tue Jul 14 09:36:39 JST 2020
[pulsar-client-io-1-1] WARN com.scurrilous.circe.checksum.Crc32cIntChecksum - Failed to load Circe JNI library. Falling back to Java based CRC32c provider
[pulsar-timer-4-1] INFO org.apache.pulsar.client.impl.ConsumerStatsRecorderImpl - [persistent://public/default/key-shared-test] [sub0] [820f0] Prefetched messages: 499 --- Consume throughput received: 0.02 msgs/s --- 0.00 Mbit/s --- Ack sent rate: 0.00 ack/s --- Failed messages: 0 --- batch messages: 0 ---Failed acks: 0
Received: my-message-0 PublishTime: 1594687006203 Date: Tue Jul 14 09:37:46 JST 2020
Received: my-message-1 PublishTime: 1594687006243 Date: Tue Jul 14 09:37:46 JST 2020
Received: my-message-2 PublishTime: 1594687006247 Date: Tue Jul 14 09:37:46 JST 2020
...
Received: my-message-498 PublishTime: 1594687008727 Date: Tue Jul 14 09:37:46 JST 2020
Received: my-message-499 PublishTime: 1594687008731 Date: Tue Jul 14 09:37:46 JST 2020
Received: my-message-500 PublishTime: 1594687038742 Date: Tue Jul 14 09:37:46 JST 2020
...
Received: my-message-990 PublishTime: 1594687040094 Date: Tue Jul 14 09:37:46 JST 2020
Received: my-message-994 PublishTime: 1594687040103 Date: Tue Jul 14 09:37:46 JST 2020
Received: my-message-995 PublishTime: 1594687040105 Date: Tue Jul 14 09:37:46 JST 2020
Received: my-message-997 PublishTime: 1594687040113 Date: Tue Jul 14 09:37:46 JST 2020
```

Consumer2:
```
Connected: Tue Jul 14 09:37:03 JST 2020
[pulsar-client-io-1-1] WARN com.scurrilous.circe.checksum.Crc32cIntChecksum - Failed to load Circe JNI library. Falling back to Java based CRC32c provider
Received: my-message-501 MessageId: 4:1501:-1 PublishTime: 1594687038753 Date: Tue Jul 14 09:37:46 JST 2020
Received: my-message-502 MessageId: 4:1502:-1 PublishTime: 1594687038755 Date: Tue Jul 14 09:37:46 JST 2020
Received: my-message-503 MessageId: 4:1503:-1 PublishTime: 1594687038759 Date: Tue Jul 14 09:37:46 JST 2020
Received: my-message-506 MessageId: 4:1506:-1 PublishTime: 1594687038785 Date: Tue Jul 14 09:37:46 JST 2020
Received: my-message-508 MessageId: 4:1508:-1 PublishTime: 1594687038812 Date: Tue Jul 14 09:37:46 JST 2020
Received: my-message-901 MessageId: 4:1901:-1 PublishTime: 1594687039871 Date: Tue Jul 14 09:37:46 JST 2020
Received: my-message-509 MessageId: 4:1509:-1 PublishTime: 1594687038815 Date: Tue Jul 14 09:37:46 JST 2020
ordering was broken, key: 1 oldNum: 901 newNum: 511
Received: my-message-511 MessageId: 4:1511:-1 PublishTime: 1594687038826 Date: Tue Jul 14 09:37:46 JST 2020
Received: my-message-512 MessageId: 4:1512:-1 PublishTime: 1594687038830 Date: Tue Jul 14 09:37:46 JST 2020
...
```

I think this issue is caused by apache#7105.
Here is an example.
1. dispatch messages
2. Consumer2 was stuck and `totalMessagesSent=0`
   - Consumer2 availablePermits was 0
3. skip redeliver messages temporally
   - Consumer2 availablePermits was back to 1
4. dispatch new messages
   - new message was dispatched to Consumer2
5. back to redeliver messages
4. dispatch messages
   - ordering was broken

### Modifications
Stop to dispatch when skip message temporally since Key_Shared consumer stuck on delivery.
wolfstudy pushed a commit that referenced this pull request Oct 30, 2020
…ey_Shared consumer stuck on delivery (#7553)

### Motivation
In some case of Key_Shared consumer, messages ordering was broken.
Here is how to reproduce(I think it is one of case to reproduce this issue).

1. Connect Consumer1 to Key_Shared subscription `sub` and stop to receive
   - receiverQueueSize: 500
2. Connect Producer and publish 500 messages with key `(i % 10)`
3. Connect Consumer2 to same subscription and start to receive
   - receiverQueueSize: 1
   - since #7106 , Consumer2 can't receive (expected)
4. Producer publish more 500 messages with same key generation algorithm
5. After that, Consumer1 start to receive
6. Check Consumer2 message ordering
   - sometimes message ordering was broken in same key

Consumer1:
```
Connected: Tue Jul 14 09:36:39 JST 2020
[pulsar-client-io-1-1] WARN com.scurrilous.circe.checksum.Crc32cIntChecksum - Failed to load Circe JNI library. Falling back to Java based CRC32c provider
[pulsar-timer-4-1] INFO org.apache.pulsar.client.impl.ConsumerStatsRecorderImpl - [persistent://public/default/key-shared-test] [sub0] [820f0] Prefetched messages: 499 --- Consume throughput received: 0.02 msgs/s --- 0.00 Mbit/s --- Ack sent rate: 0.00 ack/s --- Failed messages: 0 --- batch messages: 0 ---Failed acks: 0
Received: my-message-0 PublishTime: 1594687006203 Date: Tue Jul 14 09:37:46 JST 2020
Received: my-message-1 PublishTime: 1594687006243 Date: Tue Jul 14 09:37:46 JST 2020
Received: my-message-2 PublishTime: 1594687006247 Date: Tue Jul 14 09:37:46 JST 2020
...
Received: my-message-498 PublishTime: 1594687008727 Date: Tue Jul 14 09:37:46 JST 2020
Received: my-message-499 PublishTime: 1594687008731 Date: Tue Jul 14 09:37:46 JST 2020
Received: my-message-500 PublishTime: 1594687038742 Date: Tue Jul 14 09:37:46 JST 2020
...
Received: my-message-990 PublishTime: 1594687040094 Date: Tue Jul 14 09:37:46 JST 2020
Received: my-message-994 PublishTime: 1594687040103 Date: Tue Jul 14 09:37:46 JST 2020
Received: my-message-995 PublishTime: 1594687040105 Date: Tue Jul 14 09:37:46 JST 2020
Received: my-message-997 PublishTime: 1594687040113 Date: Tue Jul 14 09:37:46 JST 2020
```

Consumer2:
```
Connected: Tue Jul 14 09:37:03 JST 2020
[pulsar-client-io-1-1] WARN com.scurrilous.circe.checksum.Crc32cIntChecksum - Failed to load Circe JNI library. Falling back to Java based CRC32c provider
Received: my-message-501 MessageId: 4:1501:-1 PublishTime: 1594687038753 Date: Tue Jul 14 09:37:46 JST 2020
Received: my-message-502 MessageId: 4:1502:-1 PublishTime: 1594687038755 Date: Tue Jul 14 09:37:46 JST 2020
Received: my-message-503 MessageId: 4:1503:-1 PublishTime: 1594687038759 Date: Tue Jul 14 09:37:46 JST 2020
Received: my-message-506 MessageId: 4:1506:-1 PublishTime: 1594687038785 Date: Tue Jul 14 09:37:46 JST 2020
Received: my-message-508 MessageId: 4:1508:-1 PublishTime: 1594687038812 Date: Tue Jul 14 09:37:46 JST 2020
Received: my-message-901 MessageId: 4:1901:-1 PublishTime: 1594687039871 Date: Tue Jul 14 09:37:46 JST 2020
Received: my-message-509 MessageId: 4:1509:-1 PublishTime: 1594687038815 Date: Tue Jul 14 09:37:46 JST 2020
ordering was broken, key: 1 oldNum: 901 newNum: 511
Received: my-message-511 MessageId: 4:1511:-1 PublishTime: 1594687038826 Date: Tue Jul 14 09:37:46 JST 2020
Received: my-message-512 MessageId: 4:1512:-1 PublishTime: 1594687038830 Date: Tue Jul 14 09:37:46 JST 2020
...
```

I think this issue is caused by #7105.
Here is an example.
1. dispatch messages
2. Consumer2 was stuck and `totalMessagesSent=0`
   - Consumer2 availablePermits was 0
3. skip redeliver messages temporally
   - Consumer2 availablePermits was back to 1
4. dispatch new messages
   - new message was dispatched to Consumer2
5. back to redeliver messages
4. dispatch messages
   - ordering was broken

### Modifications
Stop to dispatch when skip message temporally since Key_Shared consumer stuck on delivery.

(cherry picked from commit c7ac08b)
poorbarcode added a commit that referenced this pull request May 19, 2023
…essage skip to avoid unnecessary consumption stuck (#20335)

### Motivation
- #7105 provide a mechanism to avoid a stuck consumer affecting the consumption of other consumers: 
  - if all consumers can not accept more messages, stop delivering messages to the client.
  - if one consumer can not accept more messages, just read new messages and deliver them to other consumers.
- #7553 provide a mechanism to fix the issue of lost order of consumption: If the consumer cannot accept any more messages, skip the consumer for the next round of message delivery because there may be messages with the same key in the replay queue.
- #10762 provide a mechanism to fix the issue of lost order of consumption: If there have any messages with the same key in the replay queue, do not deliver the new messages to this consumer.

#10762 and #7553 do the same thing and #10762 is better than #7553 , so #7553 is unnecessary. 

### Modifications
remove the mechanism provided by #7553 to avoid unnecessary consumption stuck.
poorbarcode added a commit that referenced this pull request May 19, 2023
…essage skip to avoid unnecessary consumption stuck (#20335)

- #7105 provide a mechanism to avoid a stuck consumer affecting the consumption of other consumers:
  - if all consumers can not accept more messages, stop delivering messages to the client.
  - if one consumer can not accept more messages, just read new messages and deliver them to other consumers.
- #7553 provide a mechanism to fix the issue of lost order of consumption: If the consumer cannot accept any more messages, skip the consumer for the next round of message delivery because there may be messages with the same key in the replay queue.
- #10762 provide a mechanism to fix the issue of lost order of consumption: If there have any messages with the same key in the replay queue, do not deliver the new messages to this consumer.

#10762 and #7553 do the same thing and #10762 is better than #7553 , so #7553 is unnecessary.

remove the mechanism provided by #7553 to avoid unnecessary consumption stuck.

(cherry picked from commit 1e664b7)
Technoboy- pushed a commit that referenced this pull request May 24, 2023
…essage skip to avoid unnecessary consumption stuck (#20335)

### Motivation
- #7105 provide a mechanism to avoid a stuck consumer affecting the consumption of other consumers: 
  - if all consumers can not accept more messages, stop delivering messages to the client.
  - if one consumer can not accept more messages, just read new messages and deliver them to other consumers.
- #7553 provide a mechanism to fix the issue of lost order of consumption: If the consumer cannot accept any more messages, skip the consumer for the next round of message delivery because there may be messages with the same key in the replay queue.
- #10762 provide a mechanism to fix the issue of lost order of consumption: If there have any messages with the same key in the replay queue, do not deliver the new messages to this consumer.

#10762 and #7553 do the same thing and #10762 is better than #7553 , so #7553 is unnecessary. 

### Modifications
remove the mechanism provided by #7553 to avoid unnecessary consumption stuck.
lhotari pushed a commit to datastax/pulsar that referenced this pull request May 29, 2023
…essage skip to avoid unnecessary consumption stuck (apache#20335)

- apache#7105 provide a mechanism to avoid a stuck consumer affecting the consumption of other consumers:
  - if all consumers can not accept more messages, stop delivering messages to the client.
  - if one consumer can not accept more messages, just read new messages and deliver them to other consumers.
- apache#7553 provide a mechanism to fix the issue of lost order of consumption: If the consumer cannot accept any more messages, skip the consumer for the next round of message delivery because there may be messages with the same key in the replay queue.
- apache#10762 provide a mechanism to fix the issue of lost order of consumption: If there have any messages with the same key in the replay queue, do not deliver the new messages to this consumer.

apache#10762 and apache#7553 do the same thing and apache#10762 is better than apache#7553 , so apache#7553 is unnecessary.

remove the mechanism provided by apache#7553 to avoid unnecessary consumption stuck.

(cherry picked from commit 1e664b7)
(cherry picked from commit c973603)
poorbarcode added a commit that referenced this pull request May 30, 2023
…essage skip to avoid unnecessary consumption stuck (#20335)

### Motivation
- #7105 provide a mechanism to avoid a stuck consumer affecting the consumption of other consumers:
  - if all consumers can not accept more messages, stop delivering messages to the client.
  - if one consumer can not accept more messages, just read new messages and deliver them to other consumers.
- #7553 provide a mechanism to fix the issue of lost order of consumption: If the consumer cannot accept any more messages, skip the consumer for the next round of message delivery because there may be messages with the same key in the replay queue.
- #10762 provide a mechanism to fix the issue of lost order of consumption: If there have any messages with the same key in the replay queue, do not deliver the new messages to this consumer.

#10762 and #7553 do the same thing and #10762 is better than #7553 , so #7553 is unnecessary.

### Modifications
remove the mechanism provided by #7553 to avoid unnecessary consumption stuck.

(cherry picked from commit 1e664b7)
poorbarcode added a commit to poorbarcode/pulsar that referenced this pull request Mar 16, 2024
poorbarcode added a commit to poorbarcode/pulsar that referenced this pull request Mar 25, 2024
poorbarcode added a commit to poorbarcode/pulsar that referenced this pull request Mar 28, 2024
poorbarcode added a commit to poorbarcode/pulsar that referenced this pull request Mar 28, 2024
lhotari added a commit to lhotari/pulsar that referenced this pull request Aug 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

type/enhancement The enhancements for the existing features or docs. e.g. reduce memory usage of the delayed messages

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants