Skip to content

[Bug] [broker] empty entry writed into the compacted topic #21916

@thetumbled

Description

@thetumbled

Search before asking

  • I searched in the issues and found nothing similar.

Version

2.9

Minimal reproduce step

stack:

18:00:50.888 [pulsar-transaction-executor-17-1] ERROR org.apache.pulsar.broker.transaction.buffer.impl.TopicTransactionBuffer - [persistent://public/default/g-partition-0] Transaction buffer recover fail by read transactionBufferSnapshot timeout!
java.util.concurrent.TimeoutException: null
        at java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1886) ~[?:?]
        at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:2021) ~[?:?]
        at org.apache.pulsar.broker.transaction.buffer.impl.TopicTransactionBuffer$TopicTransactionBufferRecover.lambda$run$0(TopicTransactionBuffer.java:692) ~[org.apache.pulsar-pulsar-broker-2.9.5.4.jar:2.9.5.4]
        at java.util.concurrent.CompletableFuture$UniAccept.tryFire(CompletableFuture.java:714) ~[?:?]
        at java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:478) ~[?:?]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) ~[?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) ~[?:?]
        at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) ~[io.netty-netty-common-4.1.66.Final.jar:4.1.66.Final]
        at java.lang.Thread.run(Thread.java:834) ~[?:?]
18:00:50.888 [pulsar-transaction-executor-17-1] WARN  org.apache.pulsar.broker.transaction.buffer.impl.TopicTransactionBuffer - Closing topic persistent://public/default/g-partition-0 due to read transaction buffer snapshot while recovering the transaction buffer throw exception
org.apache.pulsar.broker.service.BrokerServiceException$ServiceUnitNotReadyException: [persistent://public/default/g-partition-0] Transaction buffer recover fail by read transactionBufferSnapshot timeout!
        at org.apache.pulsar.broker.transaction.buffer.impl.TopicTransactionBuffer$TopicTransactionBufferRecover.lambda$run$0(TopicTransactionBuffer.java:713) ~[org.apache.pulsar-pulsar-broker-2.9.5.4.jar:2.9.5.4]
        at java.util.concurrent.CompletableFuture$UniAccept.tryFire(CompletableFuture.java:714) ~[?:?]
        at java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:478) ~[?:?]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) ~[?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) ~[?:?]
        at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) ~[io.netty-netty-common-4.1.66.Final.jar:4.1.66.Final]
        at java.lang.Thread.run(Thread.java:834) ~[?:?]
Caused by: java.util.concurrent.TimeoutException
        at java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1886) ~[?:?]
        at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:2021) ~[?:?]
        at org.apache.pulsar.broker.transaction.buffer.impl.TopicTransactionBuffer$TopicTransactionBufferRecover.lambda$run$0(TopicTransactionBuffer.java:692) ~[org.apache.pulsar-pulsar-broker-2.9.5.4.jar:2.9.5.4]
        ... 6 more

18:00:51.189 [pulsar-io-32-32] INFO  org.apache.pulsar.client.impl.ConsumerImpl - [persistent://public/default/__transaction_buffer_snapshot-partition-2] [multiTopicsReader-ef4f7acfd4] Closed consumer

topic snapshot recover fail and can't server for client because of the timeout exception for read. But why the read operation will time out? I find that the method reader.hasMessageAvailable() return true, but reader.readNext can't return any messages.

Further, i notice that the reader try to read messages from compacted topic because there is no content in the normal topic backlog. so i find out the compacted ledger id as follows:

image

I read out the entry of compacted ledger from bookie and deserialize the data, find out that all messages in these 6 entry are compacted out, which means there is no payload in it and will be dropped by client automatically.
image

So the reason of the problem is clear: reader.hasMessageAvailable() get the last message id in the compacted topic, which may be a completely compacted out message, the code pass through the reader.hasMessageAvailable() check, but it will not receive any valid data, so the topic recovery code stuck.

But all messages in one entry is compacted out is beyond expect, because we have some filtering logic in org.apache.pulsar.client.impl.RawBatchConverter#rebatchMessage:

            if (messagesRetained > 0) {
                ...
                return result;
            } else {
                return Optional.empty();
            }

If all messages in one entry is compacted out, we the variable messagesRetained should 0, and we should return an empty message back to write.
But it did exist some entries with all compacted messages! So there must be a bug in it.

I find out a bug that can result into this situations:

  • At the first time of compaction, we write a entry with some valid messages and some compacted messages, such as
Entry1{ msg1{payload}, msg2{empty}}
  • At the second time of compaction, we read Entry1 and find that the msg1 need to be compacted, so we execute
Commands.serializeSingleMessageInBatchWithPayload(emptyMetadata,
                                                                      Unpooled.EMPTY_BUFFER, batchBuffer);

and keep messagesRetained be 0. Then we meet msg2{empty}, it do not have the partition key as well! For example:
image
So we enter following logic:

                if (!singleMessageMetadata.hasPartitionKey()) {
                    messagesRetained++;
                    Commands.serializeSingleMessageInBatchWithPayload(singleMessageMetadata,
                                                                      singleMessagePayload, batchBuffer);
                }

We increment the messagesRetained but we write a empty payload! As the messagesRetained is greater than 0, we pass through the check of messagesRetained, we write such a empty entry into the compacted topic.

What did you expect to see?

~

What did you see instead?

~

Anything else?

No response

Are you willing to submit a PR?

  • I'm willing to submit a PR!

Metadata

Metadata

Assignees

Labels

type/bugThe PR fixed a bug or issue reported a bug

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions