-
Notifications
You must be signed in to change notification settings - Fork 5k
[Bug] [dolphinscheduler-server] MasterServer not die as expected #7361
Description
Search before asking
- I had searched in the issues and found no similar issues.
What happened
the process of MasterServer is still active, but it should be stopped.
now it need manual restart.
logs:
[INFO] 2021-12-13 12:14:51.254 org.apache.dolphinscheduler.server.master.MasterServer:[196] - master server is stopping ..., cause : i was judged to death, release resources and stop myself
/opt/apache-dolphinscheduler-2.0.1-bin/logs # grep -A 10000 'master server is stopping' dolphinscheduler-master.log
[INFO] 2021-12-13 12:14:51.254 org.apache.dolphinscheduler.server.master.MasterServer:[196] - master server is stopping ..., cause : i was judged to death, release resources and stop myself
[WARN] 2021-12-13 12:14:51.570 org.apache.dolphinscheduler.server.master.dispatch.host.LowerWeightHostManager:[161] - worker honghuo-worker-1.honghuo-worker-headless:1234 current cpu load average 11.18 is too high or available memory 8.08G is too low
[WARN] 2021-12-13 12:14:51.571 org.apache.dolphinscheduler.server.master.dispatch.host.LowerWeightHostManager:[161] - worker honghuo-worker-0.honghuo-worker-headless:1234 current cpu load average 9.54 is too high or available memory 6.27G is too low
[WARN] 2021-12-13 12:14:52.571 org.apache.dolphinscheduler.server.master.dispatch.host.LowerWeightHostManager:[161] - worker honghuo-worker-1.honghuo-worker-headless:1234 current cpu load average 10.77 is too high or available memory 8.17G is too low
[WARN] 2021-12-13 12:14:52.572 org.apache.dolphinscheduler.server.master.dispatch.host.LowerWeightHostManager:[161] - worker honghuo-worker-0.honghuo-worker-headless:1234 current cpu load average 9.54 is too high or available memory 6.27G is too low
[WARN] 2021-12-13 12:14:53.572 org.apache.dolphinscheduler.server.master.dispatch.host.LowerWeightHostManager:[161] - worker honghuo-worker-1.honghuo-worker-headless:1234 current cpu load average 10.77 is too high or available memory 8.17G is too low
[WARN] 2021-12-13 12:14:53.573 org.apache.dolphinscheduler.server.master.dispatch.host.LowerWeightHostManager:[161] - worker honghuo-worker-0.honghuo-worker-headless:1234 current cpu load average 9.54 is too high or available memory 6.27G is too low
[INFO] 2021-12-13 12:14:54.260 org.apache.dolphinscheduler.remote.NettyRemotingClient:[390] - netty client closed
[INFO] 2021-12-13 12:14:54.263 org.apache.dolphinscheduler.server.master.runner.MasterSchedulerService:[159] - master schedule service stopped...
[INFO] 2021-12-13 12:14:54.265 org.apache.dolphinscheduler.remote.NettyRemotingServer:[243] - netty server closed
[INFO] 2021-12-13 12:14:54.270 org.apache.dolphinscheduler.server.master.registry.MasterRegistryClient:[437] - master node : honghuo-master-1.honghuo-master-headless:5678 unRegistry to register center.
[INFO] 2021-12-13 12:14:54.270 org.apache.dolphinscheduler.server.master.registry.ServerNodeManager:[286] - master node : /nodes/master/honghuo-master-1.honghuo-master-headless:5678 down.
[INFO] 2021-12-13 12:14:54.271 org.apache.dolphinscheduler.server.master.registry.MasterRegistryClient:[439] - heartbeat executor shutdown
[INFO] 2021-12-13 12:14:54.274 org.apache.curator.framework.imps.CuratorFrameworkImpl:[955] - backgroundOperationsLoop exiting
[ERROR] 2021-12-13 12:14:54.274 org.apache.dolphinscheduler.server.master.registry.ServerNodeManager:[307] - update master nodes error
org.apache.dolphinscheduler.registry.api.RegistryException: zookeeper release lock error
at org.apache.dolphinscheduler.plugin.registry.zookeeper.ZookeeperRegistry.acquireLock(ZookeeperRegistry.java:214)
at org.apache.dolphinscheduler.service.registry.RegistryClient.getLock(RegistryClient.java:237)
at org.apache.dolphinscheduler.server.master.registry.ServerNodeManager.updateMasterNodes(ServerNodeManager.java:302)
at org.apache.dolphinscheduler.server.master.registry.ServerNodeManager.access$700(ServerNodeManager.java:67)
at org.apache.dolphinscheduler.server.master.registry.ServerNodeManager$MasterDataListener.notify(ServerNodeManager.java:287)
at org.apache.dolphinscheduler.plugin.registry.zookeeper.ZookeeperRegistry.lambda$subscribe$1(ZookeeperRegistry.java:127)
at org.apache.curator.framework.recipes.cache.TreeCache$2.apply(TreeCache.java:760)
at org.apache.curator.framework.recipes.cache.TreeCache$2.apply(TreeCache.java:754)
at org.apache.curator.framework.listen.ListenerContainer$1.run(ListenerContainer.java:100)
at org.apache.curator.shaded.com.google.common.util.concurrent.DirectExecutor.execute(DirectExecutor.java:30)
at org.apache.curator.framework.listen.ListenerContainer.forEach(ListenerContainer.java:92)
at org.apache.curator.framework.recipes.cache.TreeCache.callListeners(TreeCache.java:753)
at org.apache.curator.framework.recipes.cache.TreeCache.access$1900(TreeCache.java:75)
at org.apache.curator.framework.recipes.cache.TreeCache$4.run(TreeCache.java:865)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.IOException: Lost connection while trying to acquire lock: /lock/masters
at org.apache.curator.framework.recipes.locks.InterProcessMutex.acquire(InterProcessMutex.java:91)
at org.apache.dolphinscheduler.plugin.registry.zookeeper.ZookeeperRegistry.acquireLock(ZookeeperRegistry.java:203)
... 18 common frames omitted
[ERROR] 2021-12-13 12:14:54.275 org.apache.dolphinscheduler.server.master.registry.ServerNodeManager:[291] - MasterNodeListener capture data change and get data failed.
java.lang.NullPointerException: null
at org.apache.dolphinscheduler.plugin.registry.zookeeper.ZookeeperRegistry.releaseLock(ZookeeperRegistry.java:221)
at org.apache.dolphinscheduler.service.registry.RegistryClient.releaseLock(RegistryClient.java:241)
at org.apache.dolphinscheduler.server.master.registry.ServerNodeManager.updateMasterNodes(ServerNodeManager.java:309)
at org.apache.dolphinscheduler.server.master.registry.ServerNodeManager.access$700(ServerNodeManager.java:67)
at org.apache.dolphinscheduler.server.master.registry.ServerNodeManager$MasterDataListener.notify(ServerNodeManager.java:287)
at org.apache.dolphinscheduler.plugin.registry.zookeeper.ZookeeperRegistry.lambda$subscribe$1(ZookeeperRegistry.java:127)
at org.apache.curator.framework.recipes.cache.TreeCache$2.apply(TreeCache.java:760)
at org.apache.curator.framework.recipes.cache.TreeCache$2.apply(TreeCache.java:754)
at org.apache.curator.framework.listen.ListenerContainer$1.run(ListenerContainer.java:100)
at org.apache.curator.shaded.com.google.common.util.concurrent.DirectExecutor.execute(DirectExecutor.java:30)
at org.apache.curator.framework.listen.ListenerContainer.forEach(ListenerContainer.java:92)
at org.apache.curator.framework.recipes.cache.TreeCache.callListeners(TreeCache.java:753)
at org.apache.curator.framework.recipes.cache.TreeCache.access$1900(TreeCache.java:75)
at org.apache.curator.framework.recipes.cache.TreeCache$4.run(TreeCache.java:865)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
[INFO] 2021-12-13 12:14:54.279 org.apache.zookeeper.ZooKeeper:[693] - Session: 0x20001523448001d closed
[INFO] 2021-12-13 12:14:54.280 org.apache.zookeeper.ClientCnxn:[522] - EventThread shut down for session: 0x20001523448001d
[INFO] 2021-12-13 12:14:54.280 org.quartz.core.QuartzScheduler:[666] - Scheduler DolphinScheduler_$honghuo-master-1.honghuo-master-headless.honghuo.svc.cluster.local1639368841996 shutting down.
[INFO] 2021-12-13 12:14:54.281 org.quartz.core.QuartzScheduler:[585] - Scheduler DolphinScheduler$honghuo-master-1.honghuo-master-headless.honghuo.svc.cluster.local1639368841996 paused.
[INFO] 2021-12-13 12:14:54.283 com.zaxxer.hikari.HikariDataSource:[350] - DolphinScheduler - Shutdown initiated...
[INFO] 2021-12-13 12:14:54.294 com.zaxxer.hikari.HikariDataSource:[352] - DolphinScheduler - Shutdown completed.
[INFO] 2021-12-13 12:14:54.296 org.quartz.core.QuartzScheduler:[740] - Scheduler DolphinScheduler$honghuo-master-1.honghuo-master-headless.honghuo.svc.cluster.local1639368841996 shutdown complete.
[INFO] 2021-12-13 12:14:54.296 org.apache.dolphinscheduler.service.quartz.QuartzExecutors:[210] - Quartz service stopped, and halt all tasks
[INFO] 2021-12-13 12:14:54.297 org.apache.dolphinscheduler.server.master.MasterServer:[214] - Quartz service stopped
[INFO] 2021-12-13 12:14:54.302 org.quartz.core.QuartzScheduler:[585] - Scheduler quartzScheduler$NON_CLUSTERED paused.
[INFO] 2021-12-13 12:14:54.307 org.apache.dolphinscheduler.remote.NettyRemotingClient:[390] - netty client closed
[INFO] 2021-12-13 12:14:54.307 org.apache.dolphinscheduler.service.log.LogClientService:[74] - logger client closed
[INFO] 2021-12-13 12:14:54.307 org.springframework.scheduling.quartz.SchedulerFactoryBean:[845] - Shutting down Quartz Scheduler
[INFO] 2021-12-13 12:14:54.308 org.quartz.core.QuartzScheduler:[666] - Scheduler quartzScheduler$NON_CLUSTERED shutting down.
[INFO] 2021-12-13 12:14:54.308 org.quartz.core.QuartzScheduler:[585] - Scheduler quartzScheduler$NON_CLUSTERED paused.
[INFO] 2021-12-13 12:14:54.309 org.quartz.core.QuartzScheduler:[740] - Scheduler quartzScheduler$_NON_CLUSTERED shutdown complete.
[INFO] 2021-12-13 12:14:54.314 org.apache.dolphinscheduler.server.master.processor.queue.TaskResponseService:[139] - StateEventResponseWorker stopped
[WARN] 2021-12-13 12:14:54.314 org.apache.dolphinscheduler.server.master.processor.queue.StateEventResponseService:[115] - persist task error
java.lang.InterruptedException: null
at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2014)
at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2048)
at java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
at org.apache.dolphinscheduler.server.master.processor.queue.StateEventResponseService$StateEventResponseWorker.run(StateEventResponseService.java:112)
[INFO] 2021-12-13 12:14:54.315 org.apache.dolphinscheduler.server.master.processor.queue.StateEventResponseService:[120] - StateEventResponseWorker stopped
What you expected to happen
but it should by stopped and process killed when it judged to death
How to reproduce
kill master when it has processes to schedule
Anything else
No response
Version
dev
Are you willing to submit PR?
- Yes I am willing to submit a PR!
Code of Conduct
- I agree to follow this project's Code of Conduct