[fix][broker] Avoid being stuck when closing the broker with extensible load manager#22573
Merged
lhotari merged 2 commits intoApr 26, 2024
Conversation
…rService Fixes apache#22569 ### Motivation `BrokerService#closeAsync` calls `unloadNamespaceBundlesGracefully` to unload namespaces gracefully. With extensible load manager, it eventually calls `TableViewLoadDataStoreImpl#validateProducer`: ``` BrokerService#unloadNamespaceBundlesGracefully ExtensibleLoadManagerWrapper#disableBroker ExtensibleLoadManagerImpl#disableBroker ServiceUnitStateChannelImpl#cleanOwnerships ServiceUnitStateChannelImpl#doCleanup TableViewLoadDataStoreImpl#removeAsync TableViewLoadDataStoreImpl#validateProducer ``` In `validateProducer`, if the producer is not connected, it will recreate the producer synchronously. However, since the state of `PulsarService` has already been changed to `Closing`, all connect or lookup requests will fail with `ServiceNotReady`. Then the client will retry until timeout. Besides, the unload operation could also trigger the reconnection because the extensible load manager sends the unload event to the `loadbalancer-service-unit-state` topic. ### Modifications The major fix: Before changing PulsarService's state to `Closing`, call `BrokerService#unloadNamespaceBundlesGracefully` first to make the load manager complete the unload operations first. Minor fixes: - Record the time when `LoadManager#disableBroker` is done. - Don't check if producer is disconnected because the producer could retry if it's disconnected. ### Verifications Add `ExtensibleLoadManagerCloseTest` to verify closing `PulsarService` won't take too much time. Here are some test results locally: ``` 2024-04-24T19:43:38,851 - INFO - [main:ExtensibleLoadManagerCloseTest] - Brokers close time: [3342, 3276, 3310] 2024-04-24T19:44:26,711 - INFO - [main:ExtensibleLoadManagerCloseTest] - Brokers close time: [3357, 3258, 3298] 2024-04-24T19:46:16,791 - INFO - [main:ExtensibleLoadManagerCloseTest] - Brokers close time: [3313, 3257, 3263] 2024-04-24T20:13:05,763 - INFO - [main:ExtensibleLoadManagerCloseTest] - Brokers close time: [3304, 3279, 3299] 2024-04-24T20:13:43,979 - INFO - [main:ExtensibleLoadManagerCloseTest] - Brokers close time: [3343, 3308, 3310] ``` As you can see, each broker takes only about 3 seconds to close due to `OWNERSHIP_CLEAN_UP_CONVERGENCE_DELAY_IN_MILLIS` value added in apache#20315
cbf5ac0 to
a29d3b9
Compare
Contributor
|
OWNERSHIP_CLEAN_UP_CONVERGENCE_DELAY_IN_MILLIS I think we can actually remove this. This was added to wait for some time after bundles are unloaded, but I don't think it is necessary. |
Contributor
Author
Agreed. We can remove it in another PR. |
heesung-sohn
approved these changes
Apr 25, 2024
lhotari
reviewed
Apr 26, 2024
lhotari
approved these changes
Apr 26, 2024
RobertIndie
approved these changes
Apr 26, 2024
poorbarcode
approved these changes
Apr 26, 2024
nikhil-ctds
pushed a commit
to datastax/pulsar
that referenced
this pull request
May 15, 2024
…le load manager (apache#22573) (cherry picked from commit f411e3c) (cherry picked from commit db43414)
srinath-ctds
pushed a commit
to datastax/pulsar
that referenced
this pull request
May 16, 2024
…le load manager (apache#22573) (cherry picked from commit f411e3c) (cherry picked from commit db43414)
4 tasks
2 tasks
hanmz
pushed a commit
to hanmz/pulsar
that referenced
this pull request
Feb 12, 2025
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fixes #22569
Motivation
BrokerService#closeAsynccallsunloadNamespaceBundlesGracefullyto unload namespaces gracefully. With extensible load manager, it eventually callsTableViewLoadDataStoreImpl#validateProducer:In
validateProducer, if the producer is not connected, it will recreate the producer synchronously. However, since the state ofPulsarServicehas already been changed toClosing, all connect or lookup requests will fail withServiceNotReady. Then the client will retry until timeout.Besides, the unload operation could also trigger the reconnection because the extensible load manager sends the unload event to the
loadbalancer-service-unit-statetopic.Modifications
The major fix:
Before changing PulsarService's state to
Closing, callBrokerService#unloadNamespaceBundlesGracefullyfirst to make the load manager complete the unload operations first.Minor fixes:
LoadManager#disableBrokeris done.Verifications
Add
ExtensibleLoadManagerCloseTestto verify closingPulsarServicewon't take too much time. Here are some test results locally:As you can see, each broker takes only about 3 seconds to close due to
OWNERSHIP_CLEAN_UP_CONVERGENCE_DELAY_IN_MILLISvalue added in #20315Documentation
docdoc-requireddoc-not-neededdoc-completeMatching PR in forked repository
PR in forked repository: BewareMyPower#31