-
Notifications
You must be signed in to change notification settings - Fork 3.7k
[proxy] Fix deadlock in pulsar proxy #7690
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
/pulsarbot run-failure-checks |
| // Run in a separate thread to avoid deadlocks | ||
| brokerListUpdater.execute(() -> { | ||
| try { | ||
| updateBrokerList(brokerNodes); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need a blocking call here? Why can't we just make this method async? So we don't need to introduce another thread.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we need a blocking call here. Made the updateBrokerList() method asynchronous.
|
There was similar code in |
|
@sijie PTAL |
|
@sijie Could you review this again? |
### Motivation
When a broker server is restarted, the Pulsar proxy outputs the following warning:
```
15:09:01.486 [main-EventThread] INFO o.a.p.z.ZooKeeperChildrenCache - reloadCache called in zookeeperChildrenCache for path /loadbalance/brokers
15:09:31.487 [main-EventThread] WARN o.a.p.p.s.util.ZookeeperCacheLoader - Error updating broker info after broker list changed.
java.util.concurrent.TimeoutException: null
at java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1784)
at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1928)
at org.apache.pulsar.zookeeper.ZooKeeperDataCache.get(ZooKeeperDataCache.java:97)
at org.apache.pulsar.proxy.server.util.ZookeeperCacheLoader.updateBrokerList(ZookeeperCacheLoader.java:118)
at org.apache.pulsar.proxy.server.util.ZookeeperCacheLoader.lambda$new$0(ZookeeperCacheLoader.java:82)
at org.apache.pulsar.zookeeper.ZooKeeperChildrenCache.lambda$0(ZooKeeperChildrenCache.java:89)
at java.util.concurrent.CompletableFuture.uniAccept(CompletableFuture.java:670)
at java.util.concurrent.CompletableFuture$UniAccept.tryFire(CompletableFuture.java:646)
at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
at java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:1975)
at org.apache.pulsar.zookeeper.ZooKeeperCache.lambda$22(ZooKeeperCache.java:415)
at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:592)
at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:508)
```
This is due to a deadlock occurring in the proxy. While this deadlock occurs, `main-EventThread` is stopped in the following states:
```
"main-EventThread" apache#26 daemon prio=5 os_prio=0 tid=0x00007fa9b55ff000 nid=0x3d98 waiting on condition [0x00007fa923dfc000]
java.lang.Thread.State: TIMED_WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for <0x00000000eef2ad80> (a java.util.concurrent.CompletableFuture$Signaller)
at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
at java.util.concurrent.CompletableFuture$Signaller.block(CompletableFuture.java:1709)
at java.util.concurrent.ForkJoinPool.managedBlock(ForkJoinPool.java:3323)
at java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1788)
at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1928)
at org.apache.pulsar.zookeeper.ZooKeeperDataCache.get(ZooKeeperDataCache.java:97)
at org.apache.pulsar.proxy.server.util.ZookeeperCacheLoader.updateBrokerList(ZookeeperCacheLoader.java:123)
at org.apache.pulsar.proxy.server.util.ZookeeperCacheLoader.lambda$new$0(ZookeeperCacheLoader.java:84)
at org.apache.pulsar.proxy.server.util.ZookeeperCacheLoader$$Lambda$40/1450652220.onUpdate(Unknown Source)
at org.apache.pulsar.zookeeper.ZooKeeperChildrenCache.lambda$0(ZooKeeperChildrenCache.java:89)
at org.apache.pulsar.zookeeper.ZooKeeperChildrenCache$$Lambda$365/191748085.accept(Unknown Source)
at java.util.concurrent.CompletableFuture.uniAccept(CompletableFuture.java:670)
at java.util.concurrent.CompletableFuture$UniAccept.tryFire(CompletableFuture.java:646)
at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
at java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:1975)
at org.apache.pulsar.zookeeper.ZooKeeperCache.lambda$22(ZooKeeperCache.java:415)
at org.apache.pulsar.zookeeper.ZooKeeperCache$$Lambda$46/1819738265.processResult(Unknown Source)
at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:592)
at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:508)
```
`ZookeeperCacheLoader` will try to get load information for all available brokers from ZK or its cache if there is a change in the znode `/loadbalance/brokers`.
The problem is that it is `main-EventThread` that performs the above process, and it is also `main-EventThread` that gets the data from ZK.
`main-EventThread` is blocked at the following line, so deadlock occurs.
https://github.com/apache/pulsar/blob/9d3ab60efddd3f4064789293fd8e9c4d4388b6c7/pulsar-zookeeper-utils/src/main/java/org/apache/pulsar/zookeeper/ZooKeeperDataCache.java#L97
### Modifications
~Make the method `ZookeeperCacheLoader#updateBrokerList` run on a thread separate from `main-EventThread`.~
Made `updateBrokerList()` method asynchronous so that `main-EventThread` is not blocked.
### Motivation
When a broker server is restarted, the Pulsar proxy outputs the following warning:
```
15:09:01.486 [main-EventThread] INFO o.a.p.z.ZooKeeperChildrenCache - reloadCache called in zookeeperChildrenCache for path /loadbalance/brokers
15:09:31.487 [main-EventThread] WARN o.a.p.p.s.util.ZookeeperCacheLoader - Error updating broker info after broker list changed.
java.util.concurrent.TimeoutException: null
at java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1784)
at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1928)
at org.apache.pulsar.zookeeper.ZooKeeperDataCache.get(ZooKeeperDataCache.java:97)
at org.apache.pulsar.proxy.server.util.ZookeeperCacheLoader.updateBrokerList(ZookeeperCacheLoader.java:118)
at org.apache.pulsar.proxy.server.util.ZookeeperCacheLoader.lambda$new$0(ZookeeperCacheLoader.java:82)
at org.apache.pulsar.zookeeper.ZooKeeperChildrenCache.lambda$0(ZooKeeperChildrenCache.java:89)
at java.util.concurrent.CompletableFuture.uniAccept(CompletableFuture.java:670)
at java.util.concurrent.CompletableFuture$UniAccept.tryFire(CompletableFuture.java:646)
at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
at java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:1975)
at org.apache.pulsar.zookeeper.ZooKeeperCache.lambda$22(ZooKeeperCache.java:415)
at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:592)
at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:508)
```
This is due to a deadlock occurring in the proxy. While this deadlock occurs, `main-EventThread` is stopped in the following states:
```
"main-EventThread" apache#26 daemon prio=5 os_prio=0 tid=0x00007fa9b55ff000 nid=0x3d98 waiting on condition [0x00007fa923dfc000]
java.lang.Thread.State: TIMED_WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for <0x00000000eef2ad80> (a java.util.concurrent.CompletableFuture$Signaller)
at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
at java.util.concurrent.CompletableFuture$Signaller.block(CompletableFuture.java:1709)
at java.util.concurrent.ForkJoinPool.managedBlock(ForkJoinPool.java:3323)
at java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1788)
at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1928)
at org.apache.pulsar.zookeeper.ZooKeeperDataCache.get(ZooKeeperDataCache.java:97)
at org.apache.pulsar.proxy.server.util.ZookeeperCacheLoader.updateBrokerList(ZookeeperCacheLoader.java:123)
at org.apache.pulsar.proxy.server.util.ZookeeperCacheLoader.lambda$new$0(ZookeeperCacheLoader.java:84)
at org.apache.pulsar.proxy.server.util.ZookeeperCacheLoader$$Lambda$40/1450652220.onUpdate(Unknown Source)
at org.apache.pulsar.zookeeper.ZooKeeperChildrenCache.lambda$0(ZooKeeperChildrenCache.java:89)
at org.apache.pulsar.zookeeper.ZooKeeperChildrenCache$$Lambda$365/191748085.accept(Unknown Source)
at java.util.concurrent.CompletableFuture.uniAccept(CompletableFuture.java:670)
at java.util.concurrent.CompletableFuture$UniAccept.tryFire(CompletableFuture.java:646)
at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
at java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:1975)
at org.apache.pulsar.zookeeper.ZooKeeperCache.lambda$22(ZooKeeperCache.java:415)
at org.apache.pulsar.zookeeper.ZooKeeperCache$$Lambda$46/1819738265.processResult(Unknown Source)
at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:592)
at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:508)
```
`ZookeeperCacheLoader` will try to get load information for all available brokers from ZK or its cache if there is a change in the znode `/loadbalance/brokers`.
The problem is that it is `main-EventThread` that performs the above process, and it is also `main-EventThread` that gets the data from ZK.
`main-EventThread` is blocked at the following line, so deadlock occurs.
https://github.com/apache/pulsar/blob/9d3ab60efddd3f4064789293fd8e9c4d4388b6c7/pulsar-zookeeper-utils/src/main/java/org/apache/pulsar/zookeeper/ZooKeeperDataCache.java#L97
### Modifications
~Make the method `ZookeeperCacheLoader#updateBrokerList` run on a thread separate from `main-EventThread`.~
Made `updateBrokerList()` method asynchronous so that `main-EventThread` is not blocked.
### Motivation
When a broker server is restarted, the Pulsar proxy outputs the following warning:
```
15:09:01.486 [main-EventThread] INFO o.a.p.z.ZooKeeperChildrenCache - reloadCache called in zookeeperChildrenCache for path /loadbalance/brokers
15:09:31.487 [main-EventThread] WARN o.a.p.p.s.util.ZookeeperCacheLoader - Error updating broker info after broker list changed.
java.util.concurrent.TimeoutException: null
at java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1784)
at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1928)
at org.apache.pulsar.zookeeper.ZooKeeperDataCache.get(ZooKeeperDataCache.java:97)
at org.apache.pulsar.proxy.server.util.ZookeeperCacheLoader.updateBrokerList(ZookeeperCacheLoader.java:118)
at org.apache.pulsar.proxy.server.util.ZookeeperCacheLoader.lambda$new$0(ZookeeperCacheLoader.java:82)
at org.apache.pulsar.zookeeper.ZooKeeperChildrenCache.lambda$0(ZooKeeperChildrenCache.java:89)
at java.util.concurrent.CompletableFuture.uniAccept(CompletableFuture.java:670)
at java.util.concurrent.CompletableFuture$UniAccept.tryFire(CompletableFuture.java:646)
at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
at java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:1975)
at org.apache.pulsar.zookeeper.ZooKeeperCache.lambda$22(ZooKeeperCache.java:415)
at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:592)
at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:508)
```
This is due to a deadlock occurring in the proxy. While this deadlock occurs, `main-EventThread` is stopped in the following states:
```
"main-EventThread" apache#26 daemon prio=5 os_prio=0 tid=0x00007fa9b55ff000 nid=0x3d98 waiting on condition [0x00007fa923dfc000]
java.lang.Thread.State: TIMED_WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for <0x00000000eef2ad80> (a java.util.concurrent.CompletableFuture$Signaller)
at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
at java.util.concurrent.CompletableFuture$Signaller.block(CompletableFuture.java:1709)
at java.util.concurrent.ForkJoinPool.managedBlock(ForkJoinPool.java:3323)
at java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1788)
at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1928)
at org.apache.pulsar.zookeeper.ZooKeeperDataCache.get(ZooKeeperDataCache.java:97)
at org.apache.pulsar.proxy.server.util.ZookeeperCacheLoader.updateBrokerList(ZookeeperCacheLoader.java:123)
at org.apache.pulsar.proxy.server.util.ZookeeperCacheLoader.lambda$new$0(ZookeeperCacheLoader.java:84)
at org.apache.pulsar.proxy.server.util.ZookeeperCacheLoader$$Lambda$40/1450652220.onUpdate(Unknown Source)
at org.apache.pulsar.zookeeper.ZooKeeperChildrenCache.lambda$0(ZooKeeperChildrenCache.java:89)
at org.apache.pulsar.zookeeper.ZooKeeperChildrenCache$$Lambda$365/191748085.accept(Unknown Source)
at java.util.concurrent.CompletableFuture.uniAccept(CompletableFuture.java:670)
at java.util.concurrent.CompletableFuture$UniAccept.tryFire(CompletableFuture.java:646)
at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
at java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:1975)
at org.apache.pulsar.zookeeper.ZooKeeperCache.lambda$22(ZooKeeperCache.java:415)
at org.apache.pulsar.zookeeper.ZooKeeperCache$$Lambda$46/1819738265.processResult(Unknown Source)
at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:592)
at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:508)
```
`ZookeeperCacheLoader` will try to get load information for all available brokers from ZK or its cache if there is a change in the znode `/loadbalance/brokers`.
The problem is that it is `main-EventThread` that performs the above process, and it is also `main-EventThread` that gets the data from ZK.
`main-EventThread` is blocked at the following line, so deadlock occurs.
https://github.com/apache/pulsar/blob/9d3ab60efddd3f4064789293fd8e9c4d4388b6c7/pulsar-zookeeper-utils/src/main/java/org/apache/pulsar/zookeeper/ZooKeeperDataCache.java#L97
### Modifications
~Make the method `ZookeeperCacheLoader#updateBrokerList` run on a thread separate from `main-EventThread`.~
Made `updateBrokerList()` method asynchronous so that `main-EventThread` is not blocked.
### Motivation
When a broker server is restarted, the Pulsar proxy outputs the following warning:
```
15:09:01.486 [main-EventThread] INFO o.a.p.z.ZooKeeperChildrenCache - reloadCache called in zookeeperChildrenCache for path /loadbalance/brokers
15:09:31.487 [main-EventThread] WARN o.a.p.p.s.util.ZookeeperCacheLoader - Error updating broker info after broker list changed.
java.util.concurrent.TimeoutException: null
at java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1784)
at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1928)
at org.apache.pulsar.zookeeper.ZooKeeperDataCache.get(ZooKeeperDataCache.java:97)
at org.apache.pulsar.proxy.server.util.ZookeeperCacheLoader.updateBrokerList(ZookeeperCacheLoader.java:118)
at org.apache.pulsar.proxy.server.util.ZookeeperCacheLoader.lambda$new$0(ZookeeperCacheLoader.java:82)
at org.apache.pulsar.zookeeper.ZooKeeperChildrenCache.lambda$0(ZooKeeperChildrenCache.java:89)
at java.util.concurrent.CompletableFuture.uniAccept(CompletableFuture.java:670)
at java.util.concurrent.CompletableFuture$UniAccept.tryFire(CompletableFuture.java:646)
at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
at java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:1975)
at org.apache.pulsar.zookeeper.ZooKeeperCache.lambda$22(ZooKeeperCache.java:415)
at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:592)
at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:508)
```
This is due to a deadlock occurring in the proxy. While this deadlock occurs, `main-EventThread` is stopped in the following states:
```
"main-EventThread" apache#26 daemon prio=5 os_prio=0 tid=0x00007fa9b55ff000 nid=0x3d98 waiting on condition [0x00007fa923dfc000]
java.lang.Thread.State: TIMED_WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for <0x00000000eef2ad80> (a java.util.concurrent.CompletableFuture$Signaller)
at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
at java.util.concurrent.CompletableFuture$Signaller.block(CompletableFuture.java:1709)
at java.util.concurrent.ForkJoinPool.managedBlock(ForkJoinPool.java:3323)
at java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1788)
at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1928)
at org.apache.pulsar.zookeeper.ZooKeeperDataCache.get(ZooKeeperDataCache.java:97)
at org.apache.pulsar.proxy.server.util.ZookeeperCacheLoader.updateBrokerList(ZookeeperCacheLoader.java:123)
at org.apache.pulsar.proxy.server.util.ZookeeperCacheLoader.lambda$new$0(ZookeeperCacheLoader.java:84)
at org.apache.pulsar.proxy.server.util.ZookeeperCacheLoader$$Lambda$40/1450652220.onUpdate(Unknown Source)
at org.apache.pulsar.zookeeper.ZooKeeperChildrenCache.lambda$0(ZooKeeperChildrenCache.java:89)
at org.apache.pulsar.zookeeper.ZooKeeperChildrenCache$$Lambda$365/191748085.accept(Unknown Source)
at java.util.concurrent.CompletableFuture.uniAccept(CompletableFuture.java:670)
at java.util.concurrent.CompletableFuture$UniAccept.tryFire(CompletableFuture.java:646)
at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
at java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:1975)
at org.apache.pulsar.zookeeper.ZooKeeperCache.lambda$22(ZooKeeperCache.java:415)
at org.apache.pulsar.zookeeper.ZooKeeperCache$$Lambda$46/1819738265.processResult(Unknown Source)
at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:592)
at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:508)
```
`ZookeeperCacheLoader` will try to get load information for all available brokers from ZK or its cache if there is a change in the znode `/loadbalance/brokers`.
The problem is that it is `main-EventThread` that performs the above process, and it is also `main-EventThread` that gets the data from ZK.
`main-EventThread` is blocked at the following line, so deadlock occurs.
https://github.com/apache/pulsar/blob/9d3ab60efddd3f4064789293fd8e9c4d4388b6c7/pulsar-zookeeper-utils/src/main/java/org/apache/pulsar/zookeeper/ZooKeeperDataCache.java#L97
### Modifications
~Make the method `ZookeeperCacheLoader#updateBrokerList` run on a thread separate from `main-EventThread`.~
Made `updateBrokerList()` method asynchronous so that `main-EventThread` is not blocked.
### Motivation
When a broker server is restarted, the Pulsar proxy outputs the following warning:
```
15:09:01.486 [main-EventThread] INFO o.a.p.z.ZooKeeperChildrenCache - reloadCache called in zookeeperChildrenCache for path /loadbalance/brokers
15:09:31.487 [main-EventThread] WARN o.a.p.p.s.util.ZookeeperCacheLoader - Error updating broker info after broker list changed.
java.util.concurrent.TimeoutException: null
at java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1784)
at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1928)
at org.apache.pulsar.zookeeper.ZooKeeperDataCache.get(ZooKeeperDataCache.java:97)
at org.apache.pulsar.proxy.server.util.ZookeeperCacheLoader.updateBrokerList(ZookeeperCacheLoader.java:118)
at org.apache.pulsar.proxy.server.util.ZookeeperCacheLoader.lambda$new$0(ZookeeperCacheLoader.java:82)
at org.apache.pulsar.zookeeper.ZooKeeperChildrenCache.lambda$0(ZooKeeperChildrenCache.java:89)
at java.util.concurrent.CompletableFuture.uniAccept(CompletableFuture.java:670)
at java.util.concurrent.CompletableFuture$UniAccept.tryFire(CompletableFuture.java:646)
at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
at java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:1975)
at org.apache.pulsar.zookeeper.ZooKeeperCache.lambda$22(ZooKeeperCache.java:415)
at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:592)
at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:508)
```
This is due to a deadlock occurring in the proxy. While this deadlock occurs, `main-EventThread` is stopped in the following states:
```
"main-EventThread" apache#26 daemon prio=5 os_prio=0 tid=0x00007fa9b55ff000 nid=0x3d98 waiting on condition [0x00007fa923dfc000]
java.lang.Thread.State: TIMED_WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for <0x00000000eef2ad80> (a java.util.concurrent.CompletableFuture$Signaller)
at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
at java.util.concurrent.CompletableFuture$Signaller.block(CompletableFuture.java:1709)
at java.util.concurrent.ForkJoinPool.managedBlock(ForkJoinPool.java:3323)
at java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1788)
at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1928)
at org.apache.pulsar.zookeeper.ZooKeeperDataCache.get(ZooKeeperDataCache.java:97)
at org.apache.pulsar.proxy.server.util.ZookeeperCacheLoader.updateBrokerList(ZookeeperCacheLoader.java:123)
at org.apache.pulsar.proxy.server.util.ZookeeperCacheLoader.lambda$new$0(ZookeeperCacheLoader.java:84)
at org.apache.pulsar.proxy.server.util.ZookeeperCacheLoader$$Lambda$40/1450652220.onUpdate(Unknown Source)
at org.apache.pulsar.zookeeper.ZooKeeperChildrenCache.lambda$0(ZooKeeperChildrenCache.java:89)
at org.apache.pulsar.zookeeper.ZooKeeperChildrenCache$$Lambda$365/191748085.accept(Unknown Source)
at java.util.concurrent.CompletableFuture.uniAccept(CompletableFuture.java:670)
at java.util.concurrent.CompletableFuture$UniAccept.tryFire(CompletableFuture.java:646)
at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
at java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:1975)
at org.apache.pulsar.zookeeper.ZooKeeperCache.lambda$22(ZooKeeperCache.java:415)
at org.apache.pulsar.zookeeper.ZooKeeperCache$$Lambda$46/1819738265.processResult(Unknown Source)
at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:592)
at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:508)
```
`ZookeeperCacheLoader` will try to get load information for all available brokers from ZK or its cache if there is a change in the znode `/loadbalance/brokers`.
The problem is that it is `main-EventThread` that performs the above process, and it is also `main-EventThread` that gets the data from ZK.
`main-EventThread` is blocked at the following line, so deadlock occurs.
https://github.com/apache/pulsar/blob/9d3ab60efddd3f4064789293fd8e9c4d4388b6c7/pulsar-zookeeper-utils/src/main/java/org/apache/pulsar/zookeeper/ZooKeeperDataCache.java#L97
### Modifications
~Make the method `ZookeeperCacheLoader#updateBrokerList` run on a thread separate from `main-EventThread`.~
Made `updateBrokerList()` method asynchronous so that `main-EventThread` is not blocked.
### Motivation
When a broker server is restarted, the Pulsar proxy outputs the following warning:
```
15:09:01.486 [main-EventThread] INFO o.a.p.z.ZooKeeperChildrenCache - reloadCache called in zookeeperChildrenCache for path /loadbalance/brokers
15:09:31.487 [main-EventThread] WARN o.a.p.p.s.util.ZookeeperCacheLoader - Error updating broker info after broker list changed.
java.util.concurrent.TimeoutException: null
at java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1784)
at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1928)
at org.apache.pulsar.zookeeper.ZooKeeperDataCache.get(ZooKeeperDataCache.java:97)
at org.apache.pulsar.proxy.server.util.ZookeeperCacheLoader.updateBrokerList(ZookeeperCacheLoader.java:118)
at org.apache.pulsar.proxy.server.util.ZookeeperCacheLoader.lambda$new$0(ZookeeperCacheLoader.java:82)
at org.apache.pulsar.zookeeper.ZooKeeperChildrenCache.lambda$0(ZooKeeperChildrenCache.java:89)
at java.util.concurrent.CompletableFuture.uniAccept(CompletableFuture.java:670)
at java.util.concurrent.CompletableFuture$UniAccept.tryFire(CompletableFuture.java:646)
at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
at java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:1975)
at org.apache.pulsar.zookeeper.ZooKeeperCache.lambda$22(ZooKeeperCache.java:415)
at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:592)
at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:508)
```
This is due to a deadlock occurring in the proxy. While this deadlock occurs, `main-EventThread` is stopped in the following states:
```
"main-EventThread" #26 daemon prio=5 os_prio=0 tid=0x00007fa9b55ff000 nid=0x3d98 waiting on condition [0x00007fa923dfc000]
java.lang.Thread.State: TIMED_WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for <0x00000000eef2ad80> (a java.util.concurrent.CompletableFuture$Signaller)
at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
at java.util.concurrent.CompletableFuture$Signaller.block(CompletableFuture.java:1709)
at java.util.concurrent.ForkJoinPool.managedBlock(ForkJoinPool.java:3323)
at java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1788)
at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1928)
at org.apache.pulsar.zookeeper.ZooKeeperDataCache.get(ZooKeeperDataCache.java:97)
at org.apache.pulsar.proxy.server.util.ZookeeperCacheLoader.updateBrokerList(ZookeeperCacheLoader.java:123)
at org.apache.pulsar.proxy.server.util.ZookeeperCacheLoader.lambda$new$0(ZookeeperCacheLoader.java:84)
at org.apache.pulsar.proxy.server.util.ZookeeperCacheLoader$$Lambda$40/1450652220.onUpdate(Unknown Source)
at org.apache.pulsar.zookeeper.ZooKeeperChildrenCache.lambda$0(ZooKeeperChildrenCache.java:89)
at org.apache.pulsar.zookeeper.ZooKeeperChildrenCache$$Lambda$365/191748085.accept(Unknown Source)
at java.util.concurrent.CompletableFuture.uniAccept(CompletableFuture.java:670)
at java.util.concurrent.CompletableFuture$UniAccept.tryFire(CompletableFuture.java:646)
at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
at java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:1975)
at org.apache.pulsar.zookeeper.ZooKeeperCache.lambda$22(ZooKeeperCache.java:415)
at org.apache.pulsar.zookeeper.ZooKeeperCache$$Lambda$46/1819738265.processResult(Unknown Source)
at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:592)
at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:508)
```
`ZookeeperCacheLoader` will try to get load information for all available brokers from ZK or its cache if there is a change in the znode `/loadbalance/brokers`.
The problem is that it is `main-EventThread` that performs the above process, and it is also `main-EventThread` that gets the data from ZK.
`main-EventThread` is blocked at the following line, so deadlock occurs.
https://github.com/apache/pulsar/blob/9d3ab60efddd3f4064789293fd8e9c4d4388b6c7/pulsar-zookeeper-utils/src/main/java/org/apache/pulsar/zookeeper/ZooKeeperDataCache.java#L97
### Modifications
~Make the method `ZookeeperCacheLoader#updateBrokerList` run on a thread separate from `main-EventThread`.~
Made `updateBrokerList()` method asynchronous so that `main-EventThread` is not blocked.
(cherry picked from commit d7c1cfa)
Motivation
When a broker server is restarted, the Pulsar proxy outputs the following warning:
This is due to a deadlock occurring in the proxy. While this deadlock occurs,
main-EventThreadis stopped in the following states:ZookeeperCacheLoaderwill try to get load information for all available brokers from ZK or its cache if there is a change in the znode/loadbalance/brokers.The problem is that it is
main-EventThreadthat performs the above process, and it is alsomain-EventThreadthat gets the data from ZK.main-EventThreadis blocked at the following line, so deadlock occurs.pulsar/pulsar-zookeeper-utils/src/main/java/org/apache/pulsar/zookeeper/ZooKeeperDataCache.java
Line 97 in 9d3ab60
Modifications
Make the methodZookeeperCacheLoader#updateBrokerListrun on a thread separate frommain-EventThread.Made
updateBrokerList()method asynchronous so thatmain-EventThreadis not blocked.