Search before asking
Version
2.9
Minimal reproduce step
stack:
18:00:50.888 [pulsar-transaction-executor-17-1] ERROR org.apache.pulsar.broker.transaction.buffer.impl.TopicTransactionBuffer - [persistent://public/default/g-partition-0] Transaction buffer recover fail by read transactionBufferSnapshot timeout!
java.util.concurrent.TimeoutException: null
at java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1886) ~[?:?]
at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:2021) ~[?:?]
at org.apache.pulsar.broker.transaction.buffer.impl.TopicTransactionBuffer$TopicTransactionBufferRecover.lambda$run$0(TopicTransactionBuffer.java:692) ~[org.apache.pulsar-pulsar-broker-2.9.5.4.jar:2.9.5.4]
at java.util.concurrent.CompletableFuture$UniAccept.tryFire(CompletableFuture.java:714) ~[?:?]
at java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:478) ~[?:?]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) ~[?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) ~[?:?]
at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) ~[io.netty-netty-common-4.1.66.Final.jar:4.1.66.Final]
at java.lang.Thread.run(Thread.java:834) ~[?:?]
18:00:50.888 [pulsar-transaction-executor-17-1] WARN org.apache.pulsar.broker.transaction.buffer.impl.TopicTransactionBuffer - Closing topic persistent://public/default/g-partition-0 due to read transaction buffer snapshot while recovering the transaction buffer throw exception
org.apache.pulsar.broker.service.BrokerServiceException$ServiceUnitNotReadyException: [persistent://public/default/g-partition-0] Transaction buffer recover fail by read transactionBufferSnapshot timeout!
at org.apache.pulsar.broker.transaction.buffer.impl.TopicTransactionBuffer$TopicTransactionBufferRecover.lambda$run$0(TopicTransactionBuffer.java:713) ~[org.apache.pulsar-pulsar-broker-2.9.5.4.jar:2.9.5.4]
at java.util.concurrent.CompletableFuture$UniAccept.tryFire(CompletableFuture.java:714) ~[?:?]
at java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:478) ~[?:?]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) ~[?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) ~[?:?]
at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) ~[io.netty-netty-common-4.1.66.Final.jar:4.1.66.Final]
at java.lang.Thread.run(Thread.java:834) ~[?:?]
Caused by: java.util.concurrent.TimeoutException
at java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1886) ~[?:?]
at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:2021) ~[?:?]
at org.apache.pulsar.broker.transaction.buffer.impl.TopicTransactionBuffer$TopicTransactionBufferRecover.lambda$run$0(TopicTransactionBuffer.java:692) ~[org.apache.pulsar-pulsar-broker-2.9.5.4.jar:2.9.5.4]
... 6 more
18:00:51.189 [pulsar-io-32-32] INFO org.apache.pulsar.client.impl.ConsumerImpl - [persistent://public/default/__transaction_buffer_snapshot-partition-2] [multiTopicsReader-ef4f7acfd4] Closed consumer
topic snapshot recover fail and can't server for client because of the timeout exception for read. But why the read operation will time out? I find that the method reader.hasMessageAvailable() return true, but reader.readNext can't return any messages.
Further, i notice that the reader try to read messages from compacted topic because there is no content in the normal topic backlog. so i find out the compacted ledger id as follows:

I read out the entry of compacted ledger from bookie and deserialize the data, find out that all messages in these 6 entry are compacted out, which means there is no payload in it and will be dropped by client automatically.

So the reason of the problem is clear: reader.hasMessageAvailable() get the last message id in the compacted topic, which may be a completely compacted out message, the code pass through the reader.hasMessageAvailable() check, but it will not receive any valid data, so the topic recovery code stuck.
But all messages in one entry is compacted out is beyond expect, because we have some filtering logic in org.apache.pulsar.client.impl.RawBatchConverter#rebatchMessage:
if (messagesRetained > 0) {
...
return result;
} else {
return Optional.empty();
}
If all messages in one entry is compacted out, we the variable messagesRetained should 0, and we should return an empty message back to write.
But it did exist some entries with all compacted messages! So there must be a bug in it.
I find out a bug that can result into this situations:
- At the first time of compaction, we write a entry with some valid messages and some compacted messages, such as
Entry1{ msg1{payload}, msg2{empty}}
- At the second time of compaction, we read
Entry1 and find that the msg1 need to be compacted, so we execute
Commands.serializeSingleMessageInBatchWithPayload(emptyMetadata,
Unpooled.EMPTY_BUFFER, batchBuffer);
and keep messagesRetained be 0. Then we meet msg2{empty}, it do not have the partition key as well! For example:

So we enter following logic:
if (!singleMessageMetadata.hasPartitionKey()) {
messagesRetained++;
Commands.serializeSingleMessageInBatchWithPayload(singleMessageMetadata,
singleMessagePayload, batchBuffer);
}
We increment the messagesRetained but we write a empty payload! As the messagesRetained is greater than 0, we pass through the check of messagesRetained, we write such a empty entry into the compacted topic.
What did you expect to see?
~
What did you see instead?
~
Anything else?
No response
Are you willing to submit a PR?
Search before asking
Version
2.9
Minimal reproduce step
stack:
topic snapshot recover fail and can't server for client because of the timeout exception for read. But why the read operation will time out? I find that the method
reader.hasMessageAvailable()return true, butreader.readNextcan't return any messages.Further, i notice that the reader try to read messages from compacted topic because there is no content in the normal topic backlog. so i find out the compacted ledger id as follows:
I read out the entry of compacted ledger from bookie and deserialize the data, find out that all messages in these 6 entry are

compacted out, which means there is no payload in it and will be dropped by client automatically.So the reason of the problem is clear:
reader.hasMessageAvailable()get the last message id in the compacted topic, which may be a completelycompacted outmessage, the code pass through thereader.hasMessageAvailable()check, but it will not receive any valid data, so the topic recovery code stuck.But all messages in one entry is compacted out is beyond expect, because we have some filtering logic in
org.apache.pulsar.client.impl.RawBatchConverter#rebatchMessage:If all messages in one entry is compacted out, we the variable
messagesRetainedshould 0, and we should return an empty message back to write.But it did exist some entries with all compacted messages! So there must be a bug in it.
I find out a bug that can result into this situations:
Entry1and find that themsg1need to be compacted, so we executeand keep

messagesRetainedbe 0. Then we meetmsg2{empty}, it do not have the partition key as well! For example:So we enter following logic:
We increment the
messagesRetainedbut we write a empty payload! As themessagesRetainedis greater than 0, we pass through the check ofmessagesRetained, we write such a empty entry into the compacted topic.What did you expect to see?
~
What did you see instead?
~
Anything else?
No response
Are you willing to submit a PR?