-
Notifications
You must be signed in to change notification settings - Fork 3.7k
Fixed KeyShared consumers getting stuck on delivery #7105
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
+101
−1
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This was referenced May 29, 2020
Member
|
@merlimat Would you please help rebase this change? |
Contributor
codelipenghui
approved these changes
Jun 5, 2020
zeo1995
pushed a commit
to zeo1995/pulsar
that referenced
this pull request
Jun 5, 2020
…te-update * 'website-update' of github.com:zeo1995/pulsar: (432 commits) Fixed ordering issue in KeyShared dispatcher when adding consumer (apache#7106) Fix Duplicated messages are sent to dead letter topic apache#6960 (apache#7021) [Issue 2793][Doc]--Update the TLS hostname verification for CPP and Python clients (apache#7162) [Doc]--set netty mex frame size (apache#7174) [Doc] Update for the maximum message size (apache#7171) Fixed KeyShared consumers getting stuck on delivery (apache#7105) [apache#6003][pulsar-functions] Possibility to add builtin Functions (apache#6895) [Issue 6921][pulsar-broker-common] Replaced "Paths.get(...).getParent()", because it's system dependent and uses '\' as path separator on Windows (apache#6992) Improve broker unit test CI (apache#7173) Fix typo in exception message (apache#7027) Support KeyValue Schema Use Null Key And Null Value (apache#7139) [Doc]--Update documents for support consumer priority level in failover mode (apache#7136) Add schema config to cpp and cgo docs. (apache#7137) [Doc]--Update for the maximum message size (apache#7160) [C++] Expose ZSTD and Snappy compression to C API (apache#7014) [pulsar-proxy] add proxyLogLevel into config file (apache#6948) Add multi-hosts example for bookkeeperMetadataServiceUri (apache#6998) support for termination of partitioned topic (apache#6126) Use pure-java Air-Compressor instead of JNI based libraries (apache#5390) [Issues 5709]remove the namespace checking (apache#5716) ... # Conflicts: # site2/website/scripts/split-swagger-by-version.js
1 task
cdbartholomew
pushed a commit
to kafkaesque-io/pulsar
that referenced
this pull request
Jul 24, 2020
Motivation If one consumer is slowly processing messages, this can prevent other consumers from making progress on the topic. Instead we're in a loop of keep trying to replay messages without being able to dispatch any message. The basic idea here is that we can make progress by keep going through the topic and dispatch these messages to the consumers that are free, at least the keys that belong to them.
huangdx0726
pushed a commit
to huangdx0726/pulsar
that referenced
this pull request
Aug 24, 2020
Motivation If one consumer is slowly processing messages, this can prevent other consumers from making progress on the topic. Instead we're in a loop of keep trying to replay messages without being able to dispatch any message. The basic idea here is that we can make progress by keep going through the topic and dispatch these messages to the consumers that are free, at least the keys that belong to them.
codelipenghui
pushed a commit
that referenced
this pull request
Sep 2, 2020
…ey_Shared consumer stuck on delivery (#7553) ### Motivation In some case of Key_Shared consumer, messages ordering was broken. Here is how to reproduce(I think it is one of case to reproduce this issue). 1. Connect Consumer1 to Key_Shared subscription `sub` and stop to receive - receiverQueueSize: 500 2. Connect Producer and publish 500 messages with key `(i % 10)` 3. Connect Consumer2 to same subscription and start to receive - receiverQueueSize: 1 - since #7106 , Consumer2 can't receive (expected) 4. Producer publish more 500 messages with same key generation algorithm 5. After that, Consumer1 start to receive 6. Check Consumer2 message ordering - sometimes message ordering was broken in same key Consumer1: ``` Connected: Tue Jul 14 09:36:39 JST 2020 [pulsar-client-io-1-1] WARN com.scurrilous.circe.checksum.Crc32cIntChecksum - Failed to load Circe JNI library. Falling back to Java based CRC32c provider [pulsar-timer-4-1] INFO org.apache.pulsar.client.impl.ConsumerStatsRecorderImpl - [persistent://public/default/key-shared-test] [sub0] [820f0] Prefetched messages: 499 --- Consume throughput received: 0.02 msgs/s --- 0.00 Mbit/s --- Ack sent rate: 0.00 ack/s --- Failed messages: 0 --- batch messages: 0 ---Failed acks: 0 Received: my-message-0 PublishTime: 1594687006203 Date: Tue Jul 14 09:37:46 JST 2020 Received: my-message-1 PublishTime: 1594687006243 Date: Tue Jul 14 09:37:46 JST 2020 Received: my-message-2 PublishTime: 1594687006247 Date: Tue Jul 14 09:37:46 JST 2020 ... Received: my-message-498 PublishTime: 1594687008727 Date: Tue Jul 14 09:37:46 JST 2020 Received: my-message-499 PublishTime: 1594687008731 Date: Tue Jul 14 09:37:46 JST 2020 Received: my-message-500 PublishTime: 1594687038742 Date: Tue Jul 14 09:37:46 JST 2020 ... Received: my-message-990 PublishTime: 1594687040094 Date: Tue Jul 14 09:37:46 JST 2020 Received: my-message-994 PublishTime: 1594687040103 Date: Tue Jul 14 09:37:46 JST 2020 Received: my-message-995 PublishTime: 1594687040105 Date: Tue Jul 14 09:37:46 JST 2020 Received: my-message-997 PublishTime: 1594687040113 Date: Tue Jul 14 09:37:46 JST 2020 ``` Consumer2: ``` Connected: Tue Jul 14 09:37:03 JST 2020 [pulsar-client-io-1-1] WARN com.scurrilous.circe.checksum.Crc32cIntChecksum - Failed to load Circe JNI library. Falling back to Java based CRC32c provider Received: my-message-501 MessageId: 4:1501:-1 PublishTime: 1594687038753 Date: Tue Jul 14 09:37:46 JST 2020 Received: my-message-502 MessageId: 4:1502:-1 PublishTime: 1594687038755 Date: Tue Jul 14 09:37:46 JST 2020 Received: my-message-503 MessageId: 4:1503:-1 PublishTime: 1594687038759 Date: Tue Jul 14 09:37:46 JST 2020 Received: my-message-506 MessageId: 4:1506:-1 PublishTime: 1594687038785 Date: Tue Jul 14 09:37:46 JST 2020 Received: my-message-508 MessageId: 4:1508:-1 PublishTime: 1594687038812 Date: Tue Jul 14 09:37:46 JST 2020 Received: my-message-901 MessageId: 4:1901:-1 PublishTime: 1594687039871 Date: Tue Jul 14 09:37:46 JST 2020 Received: my-message-509 MessageId: 4:1509:-1 PublishTime: 1594687038815 Date: Tue Jul 14 09:37:46 JST 2020 ordering was broken, key: 1 oldNum: 901 newNum: 511 Received: my-message-511 MessageId: 4:1511:-1 PublishTime: 1594687038826 Date: Tue Jul 14 09:37:46 JST 2020 Received: my-message-512 MessageId: 4:1512:-1 PublishTime: 1594687038830 Date: Tue Jul 14 09:37:46 JST 2020 ... ``` I think this issue is caused by #7105. Here is an example. 1. dispatch messages 2. Consumer2 was stuck and `totalMessagesSent=0` - Consumer2 availablePermits was 0 3. skip redeliver messages temporally - Consumer2 availablePermits was back to 1 4. dispatch new messages - new message was dispatched to Consumer2 5. back to redeliver messages 4. dispatch messages - ordering was broken ### Modifications Stop to dispatch when skip message temporally since Key_Shared consumer stuck on delivery.
lbenc135
pushed a commit
to lbenc135/pulsar
that referenced
this pull request
Sep 5, 2020
…ey_Shared consumer stuck on delivery (apache#7553) ### Motivation In some case of Key_Shared consumer, messages ordering was broken. Here is how to reproduce(I think it is one of case to reproduce this issue). 1. Connect Consumer1 to Key_Shared subscription `sub` and stop to receive - receiverQueueSize: 500 2. Connect Producer and publish 500 messages with key `(i % 10)` 3. Connect Consumer2 to same subscription and start to receive - receiverQueueSize: 1 - since apache#7106 , Consumer2 can't receive (expected) 4. Producer publish more 500 messages with same key generation algorithm 5. After that, Consumer1 start to receive 6. Check Consumer2 message ordering - sometimes message ordering was broken in same key Consumer1: ``` Connected: Tue Jul 14 09:36:39 JST 2020 [pulsar-client-io-1-1] WARN com.scurrilous.circe.checksum.Crc32cIntChecksum - Failed to load Circe JNI library. Falling back to Java based CRC32c provider [pulsar-timer-4-1] INFO org.apache.pulsar.client.impl.ConsumerStatsRecorderImpl - [persistent://public/default/key-shared-test] [sub0] [820f0] Prefetched messages: 499 --- Consume throughput received: 0.02 msgs/s --- 0.00 Mbit/s --- Ack sent rate: 0.00 ack/s --- Failed messages: 0 --- batch messages: 0 ---Failed acks: 0 Received: my-message-0 PublishTime: 1594687006203 Date: Tue Jul 14 09:37:46 JST 2020 Received: my-message-1 PublishTime: 1594687006243 Date: Tue Jul 14 09:37:46 JST 2020 Received: my-message-2 PublishTime: 1594687006247 Date: Tue Jul 14 09:37:46 JST 2020 ... Received: my-message-498 PublishTime: 1594687008727 Date: Tue Jul 14 09:37:46 JST 2020 Received: my-message-499 PublishTime: 1594687008731 Date: Tue Jul 14 09:37:46 JST 2020 Received: my-message-500 PublishTime: 1594687038742 Date: Tue Jul 14 09:37:46 JST 2020 ... Received: my-message-990 PublishTime: 1594687040094 Date: Tue Jul 14 09:37:46 JST 2020 Received: my-message-994 PublishTime: 1594687040103 Date: Tue Jul 14 09:37:46 JST 2020 Received: my-message-995 PublishTime: 1594687040105 Date: Tue Jul 14 09:37:46 JST 2020 Received: my-message-997 PublishTime: 1594687040113 Date: Tue Jul 14 09:37:46 JST 2020 ``` Consumer2: ``` Connected: Tue Jul 14 09:37:03 JST 2020 [pulsar-client-io-1-1] WARN com.scurrilous.circe.checksum.Crc32cIntChecksum - Failed to load Circe JNI library. Falling back to Java based CRC32c provider Received: my-message-501 MessageId: 4:1501:-1 PublishTime: 1594687038753 Date: Tue Jul 14 09:37:46 JST 2020 Received: my-message-502 MessageId: 4:1502:-1 PublishTime: 1594687038755 Date: Tue Jul 14 09:37:46 JST 2020 Received: my-message-503 MessageId: 4:1503:-1 PublishTime: 1594687038759 Date: Tue Jul 14 09:37:46 JST 2020 Received: my-message-506 MessageId: 4:1506:-1 PublishTime: 1594687038785 Date: Tue Jul 14 09:37:46 JST 2020 Received: my-message-508 MessageId: 4:1508:-1 PublishTime: 1594687038812 Date: Tue Jul 14 09:37:46 JST 2020 Received: my-message-901 MessageId: 4:1901:-1 PublishTime: 1594687039871 Date: Tue Jul 14 09:37:46 JST 2020 Received: my-message-509 MessageId: 4:1509:-1 PublishTime: 1594687038815 Date: Tue Jul 14 09:37:46 JST 2020 ordering was broken, key: 1 oldNum: 901 newNum: 511 Received: my-message-511 MessageId: 4:1511:-1 PublishTime: 1594687038826 Date: Tue Jul 14 09:37:46 JST 2020 Received: my-message-512 MessageId: 4:1512:-1 PublishTime: 1594687038830 Date: Tue Jul 14 09:37:46 JST 2020 ... ``` I think this issue is caused by apache#7105. Here is an example. 1. dispatch messages 2. Consumer2 was stuck and `totalMessagesSent=0` - Consumer2 availablePermits was 0 3. skip redeliver messages temporally - Consumer2 availablePermits was back to 1 4. dispatch new messages - new message was dispatched to Consumer2 5. back to redeliver messages 4. dispatch messages - ordering was broken ### Modifications Stop to dispatch when skip message temporally since Key_Shared consumer stuck on delivery.
lbenc135
pushed a commit
to lbenc135/pulsar
that referenced
this pull request
Sep 5, 2020
…ey_Shared consumer stuck on delivery (apache#7553) ### Motivation In some case of Key_Shared consumer, messages ordering was broken. Here is how to reproduce(I think it is one of case to reproduce this issue). 1. Connect Consumer1 to Key_Shared subscription `sub` and stop to receive - receiverQueueSize: 500 2. Connect Producer and publish 500 messages with key `(i % 10)` 3. Connect Consumer2 to same subscription and start to receive - receiverQueueSize: 1 - since apache#7106 , Consumer2 can't receive (expected) 4. Producer publish more 500 messages with same key generation algorithm 5. After that, Consumer1 start to receive 6. Check Consumer2 message ordering - sometimes message ordering was broken in same key Consumer1: ``` Connected: Tue Jul 14 09:36:39 JST 2020 [pulsar-client-io-1-1] WARN com.scurrilous.circe.checksum.Crc32cIntChecksum - Failed to load Circe JNI library. Falling back to Java based CRC32c provider [pulsar-timer-4-1] INFO org.apache.pulsar.client.impl.ConsumerStatsRecorderImpl - [persistent://public/default/key-shared-test] [sub0] [820f0] Prefetched messages: 499 --- Consume throughput received: 0.02 msgs/s --- 0.00 Mbit/s --- Ack sent rate: 0.00 ack/s --- Failed messages: 0 --- batch messages: 0 ---Failed acks: 0 Received: my-message-0 PublishTime: 1594687006203 Date: Tue Jul 14 09:37:46 JST 2020 Received: my-message-1 PublishTime: 1594687006243 Date: Tue Jul 14 09:37:46 JST 2020 Received: my-message-2 PublishTime: 1594687006247 Date: Tue Jul 14 09:37:46 JST 2020 ... Received: my-message-498 PublishTime: 1594687008727 Date: Tue Jul 14 09:37:46 JST 2020 Received: my-message-499 PublishTime: 1594687008731 Date: Tue Jul 14 09:37:46 JST 2020 Received: my-message-500 PublishTime: 1594687038742 Date: Tue Jul 14 09:37:46 JST 2020 ... Received: my-message-990 PublishTime: 1594687040094 Date: Tue Jul 14 09:37:46 JST 2020 Received: my-message-994 PublishTime: 1594687040103 Date: Tue Jul 14 09:37:46 JST 2020 Received: my-message-995 PublishTime: 1594687040105 Date: Tue Jul 14 09:37:46 JST 2020 Received: my-message-997 PublishTime: 1594687040113 Date: Tue Jul 14 09:37:46 JST 2020 ``` Consumer2: ``` Connected: Tue Jul 14 09:37:03 JST 2020 [pulsar-client-io-1-1] WARN com.scurrilous.circe.checksum.Crc32cIntChecksum - Failed to load Circe JNI library. Falling back to Java based CRC32c provider Received: my-message-501 MessageId: 4:1501:-1 PublishTime: 1594687038753 Date: Tue Jul 14 09:37:46 JST 2020 Received: my-message-502 MessageId: 4:1502:-1 PublishTime: 1594687038755 Date: Tue Jul 14 09:37:46 JST 2020 Received: my-message-503 MessageId: 4:1503:-1 PublishTime: 1594687038759 Date: Tue Jul 14 09:37:46 JST 2020 Received: my-message-506 MessageId: 4:1506:-1 PublishTime: 1594687038785 Date: Tue Jul 14 09:37:46 JST 2020 Received: my-message-508 MessageId: 4:1508:-1 PublishTime: 1594687038812 Date: Tue Jul 14 09:37:46 JST 2020 Received: my-message-901 MessageId: 4:1901:-1 PublishTime: 1594687039871 Date: Tue Jul 14 09:37:46 JST 2020 Received: my-message-509 MessageId: 4:1509:-1 PublishTime: 1594687038815 Date: Tue Jul 14 09:37:46 JST 2020 ordering was broken, key: 1 oldNum: 901 newNum: 511 Received: my-message-511 MessageId: 4:1511:-1 PublishTime: 1594687038826 Date: Tue Jul 14 09:37:46 JST 2020 Received: my-message-512 MessageId: 4:1512:-1 PublishTime: 1594687038830 Date: Tue Jul 14 09:37:46 JST 2020 ... ``` I think this issue is caused by apache#7105. Here is an example. 1. dispatch messages 2. Consumer2 was stuck and `totalMessagesSent=0` - Consumer2 availablePermits was 0 3. skip redeliver messages temporally - Consumer2 availablePermits was back to 1 4. dispatch new messages - new message was dispatched to Consumer2 5. back to redeliver messages 4. dispatch messages - ordering was broken ### Modifications Stop to dispatch when skip message temporally since Key_Shared consumer stuck on delivery.
lbenc135
pushed a commit
to lbenc135/pulsar
that referenced
this pull request
Sep 5, 2020
…ey_Shared consumer stuck on delivery (apache#7553) ### Motivation In some case of Key_Shared consumer, messages ordering was broken. Here is how to reproduce(I think it is one of case to reproduce this issue). 1. Connect Consumer1 to Key_Shared subscription `sub` and stop to receive - receiverQueueSize: 500 2. Connect Producer and publish 500 messages with key `(i % 10)` 3. Connect Consumer2 to same subscription and start to receive - receiverQueueSize: 1 - since apache#7106 , Consumer2 can't receive (expected) 4. Producer publish more 500 messages with same key generation algorithm 5. After that, Consumer1 start to receive 6. Check Consumer2 message ordering - sometimes message ordering was broken in same key Consumer1: ``` Connected: Tue Jul 14 09:36:39 JST 2020 [pulsar-client-io-1-1] WARN com.scurrilous.circe.checksum.Crc32cIntChecksum - Failed to load Circe JNI library. Falling back to Java based CRC32c provider [pulsar-timer-4-1] INFO org.apache.pulsar.client.impl.ConsumerStatsRecorderImpl - [persistent://public/default/key-shared-test] [sub0] [820f0] Prefetched messages: 499 --- Consume throughput received: 0.02 msgs/s --- 0.00 Mbit/s --- Ack sent rate: 0.00 ack/s --- Failed messages: 0 --- batch messages: 0 ---Failed acks: 0 Received: my-message-0 PublishTime: 1594687006203 Date: Tue Jul 14 09:37:46 JST 2020 Received: my-message-1 PublishTime: 1594687006243 Date: Tue Jul 14 09:37:46 JST 2020 Received: my-message-2 PublishTime: 1594687006247 Date: Tue Jul 14 09:37:46 JST 2020 ... Received: my-message-498 PublishTime: 1594687008727 Date: Tue Jul 14 09:37:46 JST 2020 Received: my-message-499 PublishTime: 1594687008731 Date: Tue Jul 14 09:37:46 JST 2020 Received: my-message-500 PublishTime: 1594687038742 Date: Tue Jul 14 09:37:46 JST 2020 ... Received: my-message-990 PublishTime: 1594687040094 Date: Tue Jul 14 09:37:46 JST 2020 Received: my-message-994 PublishTime: 1594687040103 Date: Tue Jul 14 09:37:46 JST 2020 Received: my-message-995 PublishTime: 1594687040105 Date: Tue Jul 14 09:37:46 JST 2020 Received: my-message-997 PublishTime: 1594687040113 Date: Tue Jul 14 09:37:46 JST 2020 ``` Consumer2: ``` Connected: Tue Jul 14 09:37:03 JST 2020 [pulsar-client-io-1-1] WARN com.scurrilous.circe.checksum.Crc32cIntChecksum - Failed to load Circe JNI library. Falling back to Java based CRC32c provider Received: my-message-501 MessageId: 4:1501:-1 PublishTime: 1594687038753 Date: Tue Jul 14 09:37:46 JST 2020 Received: my-message-502 MessageId: 4:1502:-1 PublishTime: 1594687038755 Date: Tue Jul 14 09:37:46 JST 2020 Received: my-message-503 MessageId: 4:1503:-1 PublishTime: 1594687038759 Date: Tue Jul 14 09:37:46 JST 2020 Received: my-message-506 MessageId: 4:1506:-1 PublishTime: 1594687038785 Date: Tue Jul 14 09:37:46 JST 2020 Received: my-message-508 MessageId: 4:1508:-1 PublishTime: 1594687038812 Date: Tue Jul 14 09:37:46 JST 2020 Received: my-message-901 MessageId: 4:1901:-1 PublishTime: 1594687039871 Date: Tue Jul 14 09:37:46 JST 2020 Received: my-message-509 MessageId: 4:1509:-1 PublishTime: 1594687038815 Date: Tue Jul 14 09:37:46 JST 2020 ordering was broken, key: 1 oldNum: 901 newNum: 511 Received: my-message-511 MessageId: 4:1511:-1 PublishTime: 1594687038826 Date: Tue Jul 14 09:37:46 JST 2020 Received: my-message-512 MessageId: 4:1512:-1 PublishTime: 1594687038830 Date: Tue Jul 14 09:37:46 JST 2020 ... ``` I think this issue is caused by apache#7105. Here is an example. 1. dispatch messages 2. Consumer2 was stuck and `totalMessagesSent=0` - Consumer2 availablePermits was 0 3. skip redeliver messages temporally - Consumer2 availablePermits was back to 1 4. dispatch new messages - new message was dispatched to Consumer2 5. back to redeliver messages 4. dispatch messages - ordering was broken ### Modifications Stop to dispatch when skip message temporally since Key_Shared consumer stuck on delivery.
wolfstudy
pushed a commit
that referenced
this pull request
Oct 30, 2020
…ey_Shared consumer stuck on delivery (#7553) ### Motivation In some case of Key_Shared consumer, messages ordering was broken. Here is how to reproduce(I think it is one of case to reproduce this issue). 1. Connect Consumer1 to Key_Shared subscription `sub` and stop to receive - receiverQueueSize: 500 2. Connect Producer and publish 500 messages with key `(i % 10)` 3. Connect Consumer2 to same subscription and start to receive - receiverQueueSize: 1 - since #7106 , Consumer2 can't receive (expected) 4. Producer publish more 500 messages with same key generation algorithm 5. After that, Consumer1 start to receive 6. Check Consumer2 message ordering - sometimes message ordering was broken in same key Consumer1: ``` Connected: Tue Jul 14 09:36:39 JST 2020 [pulsar-client-io-1-1] WARN com.scurrilous.circe.checksum.Crc32cIntChecksum - Failed to load Circe JNI library. Falling back to Java based CRC32c provider [pulsar-timer-4-1] INFO org.apache.pulsar.client.impl.ConsumerStatsRecorderImpl - [persistent://public/default/key-shared-test] [sub0] [820f0] Prefetched messages: 499 --- Consume throughput received: 0.02 msgs/s --- 0.00 Mbit/s --- Ack sent rate: 0.00 ack/s --- Failed messages: 0 --- batch messages: 0 ---Failed acks: 0 Received: my-message-0 PublishTime: 1594687006203 Date: Tue Jul 14 09:37:46 JST 2020 Received: my-message-1 PublishTime: 1594687006243 Date: Tue Jul 14 09:37:46 JST 2020 Received: my-message-2 PublishTime: 1594687006247 Date: Tue Jul 14 09:37:46 JST 2020 ... Received: my-message-498 PublishTime: 1594687008727 Date: Tue Jul 14 09:37:46 JST 2020 Received: my-message-499 PublishTime: 1594687008731 Date: Tue Jul 14 09:37:46 JST 2020 Received: my-message-500 PublishTime: 1594687038742 Date: Tue Jul 14 09:37:46 JST 2020 ... Received: my-message-990 PublishTime: 1594687040094 Date: Tue Jul 14 09:37:46 JST 2020 Received: my-message-994 PublishTime: 1594687040103 Date: Tue Jul 14 09:37:46 JST 2020 Received: my-message-995 PublishTime: 1594687040105 Date: Tue Jul 14 09:37:46 JST 2020 Received: my-message-997 PublishTime: 1594687040113 Date: Tue Jul 14 09:37:46 JST 2020 ``` Consumer2: ``` Connected: Tue Jul 14 09:37:03 JST 2020 [pulsar-client-io-1-1] WARN com.scurrilous.circe.checksum.Crc32cIntChecksum - Failed to load Circe JNI library. Falling back to Java based CRC32c provider Received: my-message-501 MessageId: 4:1501:-1 PublishTime: 1594687038753 Date: Tue Jul 14 09:37:46 JST 2020 Received: my-message-502 MessageId: 4:1502:-1 PublishTime: 1594687038755 Date: Tue Jul 14 09:37:46 JST 2020 Received: my-message-503 MessageId: 4:1503:-1 PublishTime: 1594687038759 Date: Tue Jul 14 09:37:46 JST 2020 Received: my-message-506 MessageId: 4:1506:-1 PublishTime: 1594687038785 Date: Tue Jul 14 09:37:46 JST 2020 Received: my-message-508 MessageId: 4:1508:-1 PublishTime: 1594687038812 Date: Tue Jul 14 09:37:46 JST 2020 Received: my-message-901 MessageId: 4:1901:-1 PublishTime: 1594687039871 Date: Tue Jul 14 09:37:46 JST 2020 Received: my-message-509 MessageId: 4:1509:-1 PublishTime: 1594687038815 Date: Tue Jul 14 09:37:46 JST 2020 ordering was broken, key: 1 oldNum: 901 newNum: 511 Received: my-message-511 MessageId: 4:1511:-1 PublishTime: 1594687038826 Date: Tue Jul 14 09:37:46 JST 2020 Received: my-message-512 MessageId: 4:1512:-1 PublishTime: 1594687038830 Date: Tue Jul 14 09:37:46 JST 2020 ... ``` I think this issue is caused by #7105. Here is an example. 1. dispatch messages 2. Consumer2 was stuck and `totalMessagesSent=0` - Consumer2 availablePermits was 0 3. skip redeliver messages temporally - Consumer2 availablePermits was back to 1 4. dispatch new messages - new message was dispatched to Consumer2 5. back to redeliver messages 4. dispatch messages - ordering was broken ### Modifications Stop to dispatch when skip message temporally since Key_Shared consumer stuck on delivery. (cherry picked from commit c7ac08b)
4 tasks
poorbarcode
added a commit
that referenced
this pull request
May 19, 2023
…essage skip to avoid unnecessary consumption stuck (#20335) ### Motivation - #7105 provide a mechanism to avoid a stuck consumer affecting the consumption of other consumers: - if all consumers can not accept more messages, stop delivering messages to the client. - if one consumer can not accept more messages, just read new messages and deliver them to other consumers. - #7553 provide a mechanism to fix the issue of lost order of consumption: If the consumer cannot accept any more messages, skip the consumer for the next round of message delivery because there may be messages with the same key in the replay queue. - #10762 provide a mechanism to fix the issue of lost order of consumption: If there have any messages with the same key in the replay queue, do not deliver the new messages to this consumer. #10762 and #7553 do the same thing and #10762 is better than #7553 , so #7553 is unnecessary. ### Modifications remove the mechanism provided by #7553 to avoid unnecessary consumption stuck.
poorbarcode
added a commit
that referenced
this pull request
May 19, 2023
…essage skip to avoid unnecessary consumption stuck (#20335) - #7105 provide a mechanism to avoid a stuck consumer affecting the consumption of other consumers: - if all consumers can not accept more messages, stop delivering messages to the client. - if one consumer can not accept more messages, just read new messages and deliver them to other consumers. - #7553 provide a mechanism to fix the issue of lost order of consumption: If the consumer cannot accept any more messages, skip the consumer for the next round of message delivery because there may be messages with the same key in the replay queue. - #10762 provide a mechanism to fix the issue of lost order of consumption: If there have any messages with the same key in the replay queue, do not deliver the new messages to this consumer. #10762 and #7553 do the same thing and #10762 is better than #7553 , so #7553 is unnecessary. remove the mechanism provided by #7553 to avoid unnecessary consumption stuck. (cherry picked from commit 1e664b7)
Technoboy-
pushed a commit
that referenced
this pull request
May 24, 2023
…essage skip to avoid unnecessary consumption stuck (#20335) ### Motivation - #7105 provide a mechanism to avoid a stuck consumer affecting the consumption of other consumers: - if all consumers can not accept more messages, stop delivering messages to the client. - if one consumer can not accept more messages, just read new messages and deliver them to other consumers. - #7553 provide a mechanism to fix the issue of lost order of consumption: If the consumer cannot accept any more messages, skip the consumer for the next round of message delivery because there may be messages with the same key in the replay queue. - #10762 provide a mechanism to fix the issue of lost order of consumption: If there have any messages with the same key in the replay queue, do not deliver the new messages to this consumer. #10762 and #7553 do the same thing and #10762 is better than #7553 , so #7553 is unnecessary. ### Modifications remove the mechanism provided by #7553 to avoid unnecessary consumption stuck.
lhotari
pushed a commit
to datastax/pulsar
that referenced
this pull request
May 29, 2023
…essage skip to avoid unnecessary consumption stuck (apache#20335) - apache#7105 provide a mechanism to avoid a stuck consumer affecting the consumption of other consumers: - if all consumers can not accept more messages, stop delivering messages to the client. - if one consumer can not accept more messages, just read new messages and deliver them to other consumers. - apache#7553 provide a mechanism to fix the issue of lost order of consumption: If the consumer cannot accept any more messages, skip the consumer for the next round of message delivery because there may be messages with the same key in the replay queue. - apache#10762 provide a mechanism to fix the issue of lost order of consumption: If there have any messages with the same key in the replay queue, do not deliver the new messages to this consumer. apache#10762 and apache#7553 do the same thing and apache#10762 is better than apache#7553 , so apache#7553 is unnecessary. remove the mechanism provided by apache#7553 to avoid unnecessary consumption stuck. (cherry picked from commit 1e664b7) (cherry picked from commit c973603)
poorbarcode
added a commit
that referenced
this pull request
May 30, 2023
…essage skip to avoid unnecessary consumption stuck (#20335) ### Motivation - #7105 provide a mechanism to avoid a stuck consumer affecting the consumption of other consumers: - if all consumers can not accept more messages, stop delivering messages to the client. - if one consumer can not accept more messages, just read new messages and deliver them to other consumers. - #7553 provide a mechanism to fix the issue of lost order of consumption: If the consumer cannot accept any more messages, skip the consumer for the next round of message delivery because there may be messages with the same key in the replay queue. - #10762 provide a mechanism to fix the issue of lost order of consumption: If there have any messages with the same key in the replay queue, do not deliver the new messages to this consumer. #10762 and #7553 do the same thing and #10762 is better than #7553 , so #7553 is unnecessary. ### Modifications remove the mechanism provided by #7553 to avoid unnecessary consumption stuck. (cherry picked from commit 1e664b7)
poorbarcode
added a commit
to poorbarcode/pulsar
that referenced
this pull request
Mar 16, 2024
poorbarcode
added a commit
to poorbarcode/pulsar
that referenced
this pull request
Mar 25, 2024
poorbarcode
added a commit
to poorbarcode/pulsar
that referenced
this pull request
Mar 28, 2024
poorbarcode
added a commit
to poorbarcode/pulsar
that referenced
this pull request
Mar 28, 2024
lhotari
added a commit
to lhotari/pulsar
that referenced
this pull request
Aug 16, 2024
This was referenced Aug 19, 2024
This was referenced Aug 26, 2024
Merged
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
type/enhancement
The enhancements for the existing features or docs. e.g. reduce memory usage of the delayed messages
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Note: this is based on top of #6791 & #7104. Once these are merged, I'll rebase here. For the sake of this review, check commit 23d6dcbMotivation
If one consumer is slowly processing messages, this can prevent other consumers from making progress on the topic. Instead we're in a loop of keep trying to replay messages without being able to dispatch any message.
The basic idea here is that we can make progress by keep going through the topic and dispatch these messages to the consumers that are free, at least the keys that belong to them.