Skip to content

[Bug] [dolphinscheduler-server] MasterServer not die as expected  #7361

@zwZjut

Description

@zwZjut

Search before asking

  • I had searched in the issues and found no similar issues.

What happened

the process of MasterServer is still active, but it should be stopped.
now it need manual restart.

logs:
[INFO] 2021-12-13 12:14:51.254 org.apache.dolphinscheduler.server.master.MasterServer:[196] - master server is stopping ..., cause : i was judged to death, release resources and stop myself
/opt/apache-dolphinscheduler-2.0.1-bin/logs # grep -A 10000 'master server is stopping' dolphinscheduler-master.log
[INFO] 2021-12-13 12:14:51.254 org.apache.dolphinscheduler.server.master.MasterServer:[196] - master server is stopping ..., cause : i was judged to death, release resources and stop myself
[WARN] 2021-12-13 12:14:51.570 org.apache.dolphinscheduler.server.master.dispatch.host.LowerWeightHostManager:[161] - worker honghuo-worker-1.honghuo-worker-headless:1234 current cpu load average 11.18 is too high or available memory 8.08G is too low
[WARN] 2021-12-13 12:14:51.571 org.apache.dolphinscheduler.server.master.dispatch.host.LowerWeightHostManager:[161] - worker honghuo-worker-0.honghuo-worker-headless:1234 current cpu load average 9.54 is too high or available memory 6.27G is too low
[WARN] 2021-12-13 12:14:52.571 org.apache.dolphinscheduler.server.master.dispatch.host.LowerWeightHostManager:[161] - worker honghuo-worker-1.honghuo-worker-headless:1234 current cpu load average 10.77 is too high or available memory 8.17G is too low
[WARN] 2021-12-13 12:14:52.572 org.apache.dolphinscheduler.server.master.dispatch.host.LowerWeightHostManager:[161] - worker honghuo-worker-0.honghuo-worker-headless:1234 current cpu load average 9.54 is too high or available memory 6.27G is too low
[WARN] 2021-12-13 12:14:53.572 org.apache.dolphinscheduler.server.master.dispatch.host.LowerWeightHostManager:[161] - worker honghuo-worker-1.honghuo-worker-headless:1234 current cpu load average 10.77 is too high or available memory 8.17G is too low
[WARN] 2021-12-13 12:14:53.573 org.apache.dolphinscheduler.server.master.dispatch.host.LowerWeightHostManager:[161] - worker honghuo-worker-0.honghuo-worker-headless:1234 current cpu load average 9.54 is too high or available memory 6.27G is too low
[INFO] 2021-12-13 12:14:54.260 org.apache.dolphinscheduler.remote.NettyRemotingClient:[390] - netty client closed
[INFO] 2021-12-13 12:14:54.263 org.apache.dolphinscheduler.server.master.runner.MasterSchedulerService:[159] - master schedule service stopped...
[INFO] 2021-12-13 12:14:54.265 org.apache.dolphinscheduler.remote.NettyRemotingServer:[243] - netty server closed
[INFO] 2021-12-13 12:14:54.270 org.apache.dolphinscheduler.server.master.registry.MasterRegistryClient:[437] - master node : honghuo-master-1.honghuo-master-headless:5678 unRegistry to register center.
[INFO] 2021-12-13 12:14:54.270 org.apache.dolphinscheduler.server.master.registry.ServerNodeManager:[286] - master node : /nodes/master/honghuo-master-1.honghuo-master-headless:5678 down.
[INFO] 2021-12-13 12:14:54.271 org.apache.dolphinscheduler.server.master.registry.MasterRegistryClient:[439] - heartbeat executor shutdown
[INFO] 2021-12-13 12:14:54.274 org.apache.curator.framework.imps.CuratorFrameworkImpl:[955] - backgroundOperationsLoop exiting
[ERROR] 2021-12-13 12:14:54.274 org.apache.dolphinscheduler.server.master.registry.ServerNodeManager:[307] - update master nodes error
org.apache.dolphinscheduler.registry.api.RegistryException: zookeeper release lock error
at org.apache.dolphinscheduler.plugin.registry.zookeeper.ZookeeperRegistry.acquireLock(ZookeeperRegistry.java:214)
at org.apache.dolphinscheduler.service.registry.RegistryClient.getLock(RegistryClient.java:237)
at org.apache.dolphinscheduler.server.master.registry.ServerNodeManager.updateMasterNodes(ServerNodeManager.java:302)
at org.apache.dolphinscheduler.server.master.registry.ServerNodeManager.access$700(ServerNodeManager.java:67)
at org.apache.dolphinscheduler.server.master.registry.ServerNodeManager$MasterDataListener.notify(ServerNodeManager.java:287)
at org.apache.dolphinscheduler.plugin.registry.zookeeper.ZookeeperRegistry.lambda$subscribe$1(ZookeeperRegistry.java:127)
at org.apache.curator.framework.recipes.cache.TreeCache$2.apply(TreeCache.java:760)
at org.apache.curator.framework.recipes.cache.TreeCache$2.apply(TreeCache.java:754)
at org.apache.curator.framework.listen.ListenerContainer$1.run(ListenerContainer.java:100)
at org.apache.curator.shaded.com.google.common.util.concurrent.DirectExecutor.execute(DirectExecutor.java:30)
at org.apache.curator.framework.listen.ListenerContainer.forEach(ListenerContainer.java:92)
at org.apache.curator.framework.recipes.cache.TreeCache.callListeners(TreeCache.java:753)
at org.apache.curator.framework.recipes.cache.TreeCache.access$1900(TreeCache.java:75)
at org.apache.curator.framework.recipes.cache.TreeCache$4.run(TreeCache.java:865)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.IOException: Lost connection while trying to acquire lock: /lock/masters
at org.apache.curator.framework.recipes.locks.InterProcessMutex.acquire(InterProcessMutex.java:91)
at org.apache.dolphinscheduler.plugin.registry.zookeeper.ZookeeperRegistry.acquireLock(ZookeeperRegistry.java:203)
... 18 common frames omitted
[ERROR] 2021-12-13 12:14:54.275 org.apache.dolphinscheduler.server.master.registry.ServerNodeManager:[291] - MasterNodeListener capture data change and get data failed.
java.lang.NullPointerException: null
at org.apache.dolphinscheduler.plugin.registry.zookeeper.ZookeeperRegistry.releaseLock(ZookeeperRegistry.java:221)
at org.apache.dolphinscheduler.service.registry.RegistryClient.releaseLock(RegistryClient.java:241)
at org.apache.dolphinscheduler.server.master.registry.ServerNodeManager.updateMasterNodes(ServerNodeManager.java:309)
at org.apache.dolphinscheduler.server.master.registry.ServerNodeManager.access$700(ServerNodeManager.java:67)
at org.apache.dolphinscheduler.server.master.registry.ServerNodeManager$MasterDataListener.notify(ServerNodeManager.java:287)
at org.apache.dolphinscheduler.plugin.registry.zookeeper.ZookeeperRegistry.lambda$subscribe$1(ZookeeperRegistry.java:127)
at org.apache.curator.framework.recipes.cache.TreeCache$2.apply(TreeCache.java:760)
at org.apache.curator.framework.recipes.cache.TreeCache$2.apply(TreeCache.java:754)
at org.apache.curator.framework.listen.ListenerContainer$1.run(ListenerContainer.java:100)
at org.apache.curator.shaded.com.google.common.util.concurrent.DirectExecutor.execute(DirectExecutor.java:30)
at org.apache.curator.framework.listen.ListenerContainer.forEach(ListenerContainer.java:92)
at org.apache.curator.framework.recipes.cache.TreeCache.callListeners(TreeCache.java:753)
at org.apache.curator.framework.recipes.cache.TreeCache.access$1900(TreeCache.java:75)
at org.apache.curator.framework.recipes.cache.TreeCache$4.run(TreeCache.java:865)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
[INFO] 2021-12-13 12:14:54.279 org.apache.zookeeper.ZooKeeper:[693] - Session: 0x20001523448001d closed
[INFO] 2021-12-13 12:14:54.280 org.apache.zookeeper.ClientCnxn:[522] - EventThread shut down for session: 0x20001523448001d
[INFO] 2021-12-13 12:14:54.280 org.quartz.core.QuartzScheduler:[666] - Scheduler DolphinScheduler_$honghuo-master-1.honghuo-master-headless.honghuo.svc.cluster.local1639368841996 shutting down.
[INFO] 2021-12-13 12:14:54.281 org.quartz.core.QuartzScheduler:[585] - Scheduler DolphinScheduler
$honghuo-master-1.honghuo-master-headless.honghuo.svc.cluster.local1639368841996 paused.
[INFO] 2021-12-13 12:14:54.283 com.zaxxer.hikari.HikariDataSource:[350] - DolphinScheduler - Shutdown initiated...
[INFO] 2021-12-13 12:14:54.294 com.zaxxer.hikari.HikariDataSource:[352] - DolphinScheduler - Shutdown completed.
[INFO] 2021-12-13 12:14:54.296 org.quartz.core.QuartzScheduler:[740] - Scheduler DolphinScheduler
$honghuo-master-1.honghuo-master-headless.honghuo.svc.cluster.local1639368841996 shutdown complete.
[INFO] 2021-12-13 12:14:54.296 org.apache.dolphinscheduler.service.quartz.QuartzExecutors:[210] - Quartz service stopped, and halt all tasks
[INFO] 2021-12-13 12:14:54.297 org.apache.dolphinscheduler.server.master.MasterServer:[214] - Quartz service stopped
[INFO] 2021-12-13 12:14:54.302 org.quartz.core.QuartzScheduler:[585] - Scheduler quartzScheduler
$NON_CLUSTERED paused.
[INFO] 2021-12-13 12:14:54.307 org.apache.dolphinscheduler.remote.NettyRemotingClient:[390] - netty client closed
[INFO] 2021-12-13 12:14:54.307 org.apache.dolphinscheduler.service.log.LogClientService:[74] - logger client closed
[INFO] 2021-12-13 12:14:54.307 org.springframework.scheduling.quartz.SchedulerFactoryBean:[845] - Shutting down Quartz Scheduler
[INFO] 2021-12-13 12:14:54.308 org.quartz.core.QuartzScheduler:[666] - Scheduler quartzScheduler
$NON_CLUSTERED shutting down.
[INFO] 2021-12-13 12:14:54.308 org.quartz.core.QuartzScheduler:[585] - Scheduler quartzScheduler
$NON_CLUSTERED paused.
[INFO] 2021-12-13 12:14:54.309 org.quartz.core.QuartzScheduler:[740] - Scheduler quartzScheduler
$_NON_CLUSTERED shutdown complete.
[INFO] 2021-12-13 12:14:54.314 org.apache.dolphinscheduler.server.master.processor.queue.TaskResponseService:[139] - StateEventResponseWorker stopped
[WARN] 2021-12-13 12:14:54.314 org.apache.dolphinscheduler.server.master.processor.queue.StateEventResponseService:[115] - persist task error
java.lang.InterruptedException: null
at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2014)
at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2048)
at java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
at org.apache.dolphinscheduler.server.master.processor.queue.StateEventResponseService$StateEventResponseWorker.run(StateEventResponseService.java:112)
[INFO] 2021-12-13 12:14:54.315 org.apache.dolphinscheduler.server.master.processor.queue.StateEventResponseService:[120] - StateEventResponseWorker stopped

What you expected to happen

but it should by stopped and process killed when it judged to death

How to reproduce

kill master when it has processes to schedule

Anything else

No response

Version

dev

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct

Metadata

Metadata

Assignees

No one assigned

    Labels

    Waiting for replyWaiting for replybugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions