-
Notifications
You must be signed in to change notification settings - Fork 3.7k
Description
Search before asking
- I searched in the issues and found nothing similar.
Read release policy
- I understand that unsupported versions don't get bug fixes. I will attempt to reproduce the issue on a supported version of Pulsar client and Pulsar broker.
Version
- 3.2.2
- 3.1.2
Minimal reproduce step
publish event in about 6k QPS and 100Mbits/sec
with metaData
BatcherBuilder.KEY_BASED mode
and producer and send message by high concurrent/parallel producer process.
it happens only in almost real time consumer (almost zero backlog)
What did you expect to see?
no lost event
What did you see instead?
could see error log in broker and show
Failed to peek sticky key from the message metadata
it look like thread safe issue, because it happen randomly.
in 1M events, it only happen few times but the consumer will lose few events
Anything else?
the error similar to
#10967 but I think it's different issue.
the data in bookkeeper is correct.
I can download the event from bookkeeper and parse it successfully.
or consume the same event few minutes later and it could consume successfully.
but all subscriptions will get the same error in the same event in real time consumer(zero backlog).
I have traced source code.
it happens in
PersistentDispatcherMultipleConsumers.readEntriesComplete -> AbstractBaseDispatcher.filterEntriesForConsumer
-> Commands.peekAndCopyMessageMetadata
and I also print the ByteBuf contents,
it's I could clearly see the data isn't the same in bookkeeper
in normal event , the hex code usually start by 010e (magicCrc32c)
0000000 010e 9529 5fbc 0000 0a03 3a0a 6e69 7267
in one of our error event, the bytebuf have about 48 bytes strange data, then continue with normal data
0000000 0000 a610 0000 0000 0200 7239 0000 0000 <== from here
0000020 0200 1339 0000 0000 ea17 a8b0 8b8e fa5e
0000040 2af0 2675 f645 1623 d17e dc34 526d ef44 <=== until here is garbage
0000060 010e 9529 5fbc 0000 0a03 3a0a 6e69 7267 <== from here is normal data
this is just an example, but sometimes the first few bytes are correct and something wrong after few bytes later.
I am still trying to debug when and how the ByteBuf returns incorrect data, and why it only happens during stress testing. It is still not easy to reproduce using the perf tool, but we can 100% reproduce it in our producer code.
Does anyone have any idea what could be causing this issue, and any suggestions on which library or class may have potential issues? Additionally, any suggestions on how to debug this issue or if I need to print any specific information to help identify the root cause would be appreciated. Thank you.
Are you willing to submit a PR?
- I'm willing to submit a PR!