[fix][broker] Avoid being stuck in 30+ seconds when closing the BrokerService#31
Closed
BewareMyPower wants to merge 2 commits into
Closed
[fix][broker] Avoid being stuck in 30+ seconds when closing the BrokerService#31BewareMyPower wants to merge 2 commits into
BewareMyPower wants to merge 2 commits into
Conversation
…rService Fixes apache#22569 ### Motivation `BrokerService#closeAsync` calls `unloadNamespaceBundlesGracefully` to unload namespaces gracefully. With extensible load manager, it eventually calls `TableViewLoadDataStoreImpl#validateProducer`: ``` BrokerService#unloadNamespaceBundlesGracefully ExtensibleLoadManagerWrapper#disableBroker ExtensibleLoadManagerImpl#disableBroker ServiceUnitStateChannelImpl#cleanOwnerships ServiceUnitStateChannelImpl#doCleanup TableViewLoadDataStoreImpl#removeAsync TableViewLoadDataStoreImpl#validateProducer ``` In `validateProducer`, if the producer is not connected, it will recreate the producer synchronously. However, since the state of `PulsarService` has already been changed to `Closing`, all connect or lookup requests will fail with `ServiceNotReady`. Then the client will retry until timeout. Besides, the unload operation could also trigger the reconnection because the extensible load manager sends the unload event to the `loadbalancer-service-unit-state` topic. ### Modifications The major fix: Before changing PulsarService's state to `Closing`, call `BrokerService#unloadNamespaceBundlesGracefully` first to make the load manager complete the unload operations first. Minor fixes: - Record the time when `LoadManager#disableBroker` is done. - Don't check if producer is disconnected because the producer could retry if it's disconnected. ### Verifications Add `ExtensibleLoadManagerCloseTest` to verify closing `PulsarService` won't take too much time. Here are some test results locally: ``` 2024-04-24T19:43:38,851 - INFO - [main:ExtensibleLoadManagerCloseTest] - Brokers close time: [3342, 3276, 3310] 2024-04-24T19:44:26,711 - INFO - [main:ExtensibleLoadManagerCloseTest] - Brokers close time: [3357, 3258, 3298] 2024-04-24T19:46:16,791 - INFO - [main:ExtensibleLoadManagerCloseTest] - Brokers close time: [3313, 3257, 3263] 2024-04-24T20:13:05,763 - INFO - [main:ExtensibleLoadManagerCloseTest] - Brokers close time: [3304, 3279, 3299] 2024-04-24T20:13:43,979 - INFO - [main:ExtensibleLoadManagerCloseTest] - Brokers close time: [3343, 3308, 3310] ``` As you can see, each broker takes only about 3 seconds to close due to `OWNERSHIP_CLEAN_UP_CONVERGENCE_DELAY_IN_MILLIS` value added in apache#20315
4 tasks
Owner
Author
|
Failed tests: |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fixes apache#22569
Motivation
BrokerService#closeAsynccallsunloadNamespaceBundlesGracefullyto unload namespaces gracefully. With extensible load manager, it eventually callsTableViewLoadDataStoreImpl#validateProducer:In
validateProducer, if the producer is not connected, it will recreate the producer synchronously. However, since the state ofPulsarServicehas already been changed toClosing, all connect or lookup requests will fail withServiceNotReady. Then the client will retry until timeout.Besides, the unload operation could also trigger the reconnection because the extensible load manager sends the unload event to the
loadbalancer-service-unit-statetopic.Modifications
The major fix:
Before changing PulsarService's state to
Closing, callBrokerService#unloadNamespaceBundlesGracefullyfirst to make the load manager complete the unload operations first.Minor fixes:
LoadManager#disableBrokeris done.Verifications
Add
ExtensibleLoadManagerCloseTestto verify closingPulsarServicewon't take too much time. Here are some test results locally:As you can see, each broker takes only about 3 seconds to close due to
OWNERSHIP_CLEAN_UP_CONVERGENCE_DELAY_IN_MILLISvalue added in apache#20315