Skip to content

[Bug] parseMessageMetadata error when broker entry metadata enable with high loading #22601

@semistone

Description

@semistone

Search before asking

  • I searched in the issues and found nothing similar.

Read release policy

  • I understand that unsupported versions don't get bug fixes. I will attempt to reproduce the issue on a supported version of Pulsar client and Pulsar broker.

Version

  • 3.2.2
  • 3.1.2

Minimal reproduce step

publish event in about 6k QPS and 100Mbits/sec
with metaData
BatcherBuilder.KEY_BASED mode
and producer and send message by high concurrent/parallel producer process.
it happens only in almost real time consumer (almost zero backlog)

What did you expect to see?

no lost event

What did you see instead?

could see error log in broker and show
Failed to peek sticky key from the message metadata

it look like thread safe issue, because it happen randomly.
in 1M events, it only happen few times but the consumer will lose few events

Anything else?

the error similar to
#10967 but I think it's different issue.

the data in bookkeeper is correct.
I can download the event from bookkeeper and parse it successfully.
or consume the same event few minutes later and it could consume successfully.
but all subscriptions will get the same error in the same event in real time consumer(zero backlog).

I have traced source code.
it happens in
PersistentDispatcherMultipleConsumers.readEntriesComplete -> AbstractBaseDispatcher.filterEntriesForConsumer
-> Commands.peekAndCopyMessageMetadata

and I also print the ByteBuf contents,
it's I could clearly see the data isn't the same in bookkeeper

in normal event , the hex code usually start by 010e (magicCrc32c)

0000000      010e    9529    5fbc    0000    0a03    3a0a    6e69    7267

in one of our error event, the bytebuf have about 48 bytes strange data, then continue with normal data

0000000      0000    a610    0000    0000    0200    7239    0000    0000 <== from here
     
0000020      0200    1339    0000    0000    ea17    a8b0    8b8e    fa5e
        
0000040      2af0    2675    f645    1623    d17e    dc34    526d    ef44 <=== until here is garbage
          
0000060      010e    9529    5fbc    0000    0a03    3a0a    6e69    7267 <== from here is normal data

this is just an example, but sometimes the first few bytes are correct and something wrong after few bytes later.

I am still trying to debug when and how the ByteBuf returns incorrect data, and why it only happens during stress testing. It is still not easy to reproduce using the perf tool, but we can 100% reproduce it in our producer code.

Does anyone have any idea what could be causing this issue, and any suggestions on which library or class may have potential issues? Additionally, any suggestions on how to debug this issue or if I need to print any specific information to help identify the root cause would be appreciated. Thank you.

Are you willing to submit a PR?

  • I'm willing to submit a PR!

Metadata

Metadata

Assignees

No one assigned

    Labels

    type/bugThe PR fixed a bug or issue reported a bug

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions