-
Notifications
You must be signed in to change notification settings - Fork 3.7k
[fix][broker][branch-4.0] Fix failed testFinishTakeSnapshotWhenTopicLoading due to topic future cache conflicts #24947
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
… topic future cache conflicts
lhotari
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
|
/pulsarbot rerun-failure-checks |
Just wondering if there should be code to handle this. Similar as in pulsar/pulsar-broker/src/main/java/org/apache/pulsar/broker/service/BrokerService.java Lines 1389 to 1395 in 74931c9
|
|
Well, there has been a debate before in https://github.com/apache/pulsar/pull/24500/files#r2199584548 The timeout failure usually represents a unknown state. Removing it from the cache cannot cancel the ongoing topic loading process in background. There could be two concurrent topic loading tasks if we added the logic, some tests might fail (you can see the changes in #24500). We have to take care of this case. Topic loading for non-persistent topics is much more simple than persistent topics, so actually the timeout exception hardly happen. |
@BewareMyPower Yes, I agree. I came to the same conclusion when looking into the code. However, there's also problems in the current CompletableFuture & timeout based solution. For example, if the topic has already opened the ledger when the timeout occurs, it won't be able to complete the future successfully. The topic needs to be closed after it just got successfully opened: pulsar/pulsar-broker/src/main/java/org/apache/pulsar/broker/service/BrokerService.java Lines 1896 to 1914 in 39bb675
This doesn't make much sense. Instead of using CompletableFuture values in maps, a better approach would be to have some Entry class that holds more related state. There could be a CompletableFuture inside the Entry class. |
|
IMO, we should not apply a timeout on topic loading. When the timeout happens, there must be a bug or the CPU resource is exhausted. You can see In production env, the timeout is usually caused by topic replay because the metadata store operation should not be such slow:
The 1st case is usually caused by a bug with system topic based topic policies service. It is hard to say whether retrying helps. The 2nd case is usually caused by the snapshot was not taken in time. Retry the topic replay never helps for the 2nd case. That's also the motivation of #23004, which introduces I've been working on this part before but I didn't have time recently. Here are some local notes I wrote before. https://gist.github.com/BewareMyPower/c7f26eb2b5741c94d8cf771c4dc35a97 |
…oading due to topic future cache conflicts (apache#24947) (cherry picked from commit 6e3d5d8)
…oading due to topic future cache conflicts (apache#24947) (cherry picked from commit 6e3d5d8)
Motivation
The cherry-picking on a399b5c inserts the topic future cache duplicatedly.
1st time:
pulsar/pulsar-broker/src/main/java/org/apache/pulsar/broker/service/BrokerService.java
Line 1129 in 74931c9
or
pulsar/pulsar-broker/src/main/java/org/apache/pulsar/broker/service/BrokerService.java
Line 1141 in 74931c9
2nd time:
pulsar/pulsar-broker/src/main/java/org/apache/pulsar/broker/service/BrokerService.java
Line 1146 in 74931c9
In this case, topic future might fail with timeout. This error was exposed after dd149d4 because the future completed with
LowOverheadTimeoutExceptionwon't trigger the cache invalidation.Modifications
Move the
fetchPartitionedTopicMetadataAsynccheck toloadOrCreatePersistentTopic.Documentation
docdoc-requireddoc-not-neededdoc-completeMatching PR in forked repository
PR in forked repository: