[improve][broker] Gracefully shut down load balancer extension#20315
Merged
Demogorgon314 merged 4 commits intoMay 22, 2023
Merged
Conversation
ed773f4 to
4ee4b99
Compare
heesung-sohn
commented
May 13, 2023
| */ | ||
| public LeaderElectionService getLeaderElectionService() { | ||
| return this.leaderElectionService; | ||
| if (ExtensibleLoadManagerImpl.isLoadManagerExtensionEnabled(config)) { |
| } | ||
|
|
||
| public CompletableFuture<Optional<String>> selectAsync(ServiceUnitId bundle, | ||
| Optional<Set<String>> excludeBrokerSet) { |
Contributor
There was a problem hiding this comment.
maybe we can just use empty set like Collections.emptySet() to replace Optional
Demogorgon314
approved these changes
May 13, 2023
lifepuzzlefun
approved these changes
May 13, 2023
f72b25f to
2e8eb09
Compare
c9cac97 to
a3f7b26
Compare
a3f7b26 to
0426d30
Compare
gaoran10
approved these changes
May 20, 2023
Technoboy-
pushed a commit
that referenced
this pull request
May 30, 2023
### Motivation Load balancer extension needs to shut down gracefully, especially when shutting down the leader broker. When the leader broker closes the leader election service too late, service unit lookups to the leader broker could fail during the shutdown. This could delay client reconnection time. Lookup failure logs ``` (shutdown starts) pulsar-broker-8 pulsar-broker 2023-04-22T00:19:52,630+0000 [pulsar-service-shutdown] INFO org.apache.pulsar.broker.PulsarService - Closing PulsarService pulsar-broker-3 pulsar-broker 2023-04-22T00:19:52,690+0000 [pulsar-io-4-5] INFO org.apache.pulsar.client.impl.ConnectionHandler - [persistent://pulsar/system/loadbalancer-service-unit-state] [pulsar-18-9] Closed connection [id: 0x907e74b0, L:/10.0.13.6:35838 ! R:pulsar-broker-8.pulsar-broker.pulsar.svc.cluster.local/10.0.18.18:6650] -- Will try again in 0.1 s pulsar-broker-3 pulsar-broker 2023-04-22T00:19:52,691+0000 [pulsar-io-4-5] INFO org.apache.pulsar.client.impl.ConnectionHandler - [persistent://pulsar/system/loadbalancer-service-unit-state] [reader-30e6b40dd9] Closed connection [id: 0x907e74b0, L:/10.0.13.6:35838 ! R:pulsar-broker-8.pulsar-broker.pulsar.svc.cluster.local/10.0.18.18:6650] -- Will try again in 0.1 s (znode is gone) pulsar-broker-8 pulsar-broker 2023-04-22T00:19:52,652+0000 [pulsar-load-manager-1-1] INFO org.apache.pulsar.broker.loadbalance.extensions.channel.ServiceUnitStateChannelImpl - BrokerRegistry detected the broker:pulsar-broker-8.pulsar-broker.pulsar.svc.cluster.local:8080 registry has been deleted. (lookup failure) org.apache.pulsar.client.impl.ConnectionHandler - [persistent://pulsar/system/loadbalancer-service-unit-state] [reader-30e6b40dd9] Could not get connection to broker: org.apache.pulsar.client.api.PulsarClientException$ConnectException: Disconnected from server at pulsar-broker-3.pulsar-broker.pulsar.svc.cluster.local/10.0.13.6:6650 -- Will try again in 0.193 s pulsar-broker-3 pulsar-broker 2023-04-22T00:19:52,827+0000 [main-EventThread] INFO org.apache.pulsar.broker.lookup.TopicLookupBase - Failed to lookup null for topic persistent://pulsar/system/loadbalancer-service-unit-state with error Failed to look up a broker registry:pulsar-broker-8.pulsar-broker.pulsar.svc.cluster.local:8080 for bundle:pulsar/system/0x40000000_0x50000000 pulsar-broker-3 pulsar-broker 2023-04-22T00:19:52,827+0000 [main-EventThread] INFO org.apache.pulsar.broker.lookup.TopicLookupBase - Failed to lookup null for topic persistent://pulsar/system/loadbalancer-service-unit-state with error Failed to look up a broker registry:pulsar-broker-8.pulsar-broker.pulsar.svc.cluster.local:8080 for bundle:pulsar/system/0x40000000_0x50000000 (leader election service has been closed) pulsar-broker-3 pulsar-broker 2023-04-22T00:19:55,569+0000 [pulsar-load-manager-1-1] INFO org.apache.pulsar.broker.loadbalance.extensions.ExtensibleLoadManagerImpl - This broker:pulsar-broker-3.pulsar-broker.pulsar.svc.cluster.local:8080 plays the leader now. ``` ### Modifications During the shutdown flow, when calling loadbalancer.disableBroker(), the load balancer extension gracefully gives up the owned bundles to other brokers. After that, it closes the leader election service and removes its register in the metadata store.
BewareMyPower
added a commit
to BewareMyPower/pulsar
that referenced
this pull request
Apr 24, 2024
…rService Fixes apache#22569 ### Motivation `BrokerService#closeAsync` calls `unloadNamespaceBundlesGracefully` to unload namespaces gracefully. With extensible load manager, it eventually calls `TableViewLoadDataStoreImpl#validateProducer`: ``` BrokerService#unloadNamespaceBundlesGracefully ExtensibleLoadManagerWrapper#disableBroker ExtensibleLoadManagerImpl#disableBroker ServiceUnitStateChannelImpl#cleanOwnerships ServiceUnitStateChannelImpl#doCleanup TableViewLoadDataStoreImpl#removeAsync TableViewLoadDataStoreImpl#validateProducer ``` In `validateProducer`, if the producer is not connected, it will recreate the producer synchronously. However, since the state of `PulsarService` has already been changed to `Closing`, all connect or lookup requests will fail with `ServiceNotReady`. Then the client will retry until timeout. Besides, the unload operation could also trigger the reconnection because the extensible load manager sends the unload event to the `loadbalancer-service-unit-state` topic. ### Modifications The major fix: Before changing PulsarService's state to `Closing`, call `BrokerService#unloadNamespaceBundlesGracefully` first to make the load manager complete the unload operations first. Minor fixes: - Record the time when `LoadManager#disableBroker` is done. - Don't check if producer is disconnected because the producer could retry if it's disconnected. ### Verifications Add `ExtensibleLoadManagerCloseTest` to verify closing `PulsarService` won't take too much time. Here are some test results locally: ``` 2024-04-24T19:43:38,851 - INFO - [main:ExtensibleLoadManagerCloseTest] - Brokers close time: [3342, 3276, 3310] 2024-04-24T19:44:26,711 - INFO - [main:ExtensibleLoadManagerCloseTest] - Brokers close time: [3357, 3258, 3298] 2024-04-24T19:46:16,791 - INFO - [main:ExtensibleLoadManagerCloseTest] - Brokers close time: [3313, 3257, 3263] 2024-04-24T20:13:05,763 - INFO - [main:ExtensibleLoadManagerCloseTest] - Brokers close time: [3304, 3279, 3299] 2024-04-24T20:13:43,979 - INFO - [main:ExtensibleLoadManagerCloseTest] - Brokers close time: [3343, 3308, 3310] ``` As you can see, each broker takes only about 3 seconds to close due to `OWNERSHIP_CLEAN_UP_CONVERGENCE_DELAY_IN_MILLIS` value added in apache#20315
4 tasks
BewareMyPower
added a commit
to BewareMyPower/pulsar
that referenced
this pull request
Apr 24, 2024
…rService Fixes apache#22569 ### Motivation `BrokerService#closeAsync` calls `unloadNamespaceBundlesGracefully` to unload namespaces gracefully. With extensible load manager, it eventually calls `TableViewLoadDataStoreImpl#validateProducer`: ``` BrokerService#unloadNamespaceBundlesGracefully ExtensibleLoadManagerWrapper#disableBroker ExtensibleLoadManagerImpl#disableBroker ServiceUnitStateChannelImpl#cleanOwnerships ServiceUnitStateChannelImpl#doCleanup TableViewLoadDataStoreImpl#removeAsync TableViewLoadDataStoreImpl#validateProducer ``` In `validateProducer`, if the producer is not connected, it will recreate the producer synchronously. However, since the state of `PulsarService` has already been changed to `Closing`, all connect or lookup requests will fail with `ServiceNotReady`. Then the client will retry until timeout. Besides, the unload operation could also trigger the reconnection because the extensible load manager sends the unload event to the `loadbalancer-service-unit-state` topic. ### Modifications The major fix: Before changing PulsarService's state to `Closing`, call `BrokerService#unloadNamespaceBundlesGracefully` first to make the load manager complete the unload operations first. Minor fixes: - Record the time when `LoadManager#disableBroker` is done. - Don't check if producer is disconnected because the producer could retry if it's disconnected. ### Verifications Add `ExtensibleLoadManagerCloseTest` to verify closing `PulsarService` won't take too much time. Here are some test results locally: ``` 2024-04-24T19:43:38,851 - INFO - [main:ExtensibleLoadManagerCloseTest] - Brokers close time: [3342, 3276, 3310] 2024-04-24T19:44:26,711 - INFO - [main:ExtensibleLoadManagerCloseTest] - Brokers close time: [3357, 3258, 3298] 2024-04-24T19:46:16,791 - INFO - [main:ExtensibleLoadManagerCloseTest] - Brokers close time: [3313, 3257, 3263] 2024-04-24T20:13:05,763 - INFO - [main:ExtensibleLoadManagerCloseTest] - Brokers close time: [3304, 3279, 3299] 2024-04-24T20:13:43,979 - INFO - [main:ExtensibleLoadManagerCloseTest] - Brokers close time: [3343, 3308, 3310] ``` As you can see, each broker takes only about 3 seconds to close due to `OWNERSHIP_CLEAN_UP_CONVERGENCE_DELAY_IN_MILLIS` value added in apache#20315
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
PIP: #16691
Motivation
Load balancer extension needs to shut down gracefully, especially when shutting down the leader broker. When the leader broker closes the leader election service too late, service unit lookups to the leader broker could fail during the shutdown. This could delay client reconnection time.
Lookup failure logs
Modifications
During the shutdown flow, when calling loadbalancer.disableBroker(), the load balancer extension gracefully gives up the owned bundles to other brokers. After that, it closes the leader election service and removes its register in the metadata store.
Verifying this change
This change is already covered by existing tests, and I added a new test case to cover this diableBroker logic.
Does this pull request potentially affect one of the following parts:
If the box was checked, please highlight the changes
Documentation
docdoc-requireddoc-not-neededdoc-completeMatching PR in forked repository
PR in forked repository: heesung-sohn#47