-
Notifications
You must be signed in to change notification settings - Fork 5k
[Bug] [Worker] Worker fakes death when it stop itself fail. #6616
Copy link
Copy link
Closed
Labels
bugSomething isn't workingSomething isn't working
Milestone
Description
Search before asking
- I had searched in the issues and found no similar issues.
What happened
When I try a stress test, I found that worker fakes death and print nothing to log file. At the same time, worker is not exist in zk node path and Master can't dispatch task because no worker.
That's the log before worker stop:
[ERROR] 2021-10-15 15:10:57.590 org.apache.dolphinscheduler.server.worker.WorkerServer:[223] - worker server stop exception
org.apache.dolphinscheduler.spi.register.RegistryException: zookeeper delete key error
at org.apache.dolphinscheduler.plugin.registry.zookeeper.ZookeeperRegistry.delete(ZookeeperRegistry.java:272)
at org.apache.dolphinscheduler.plugin.registry.zookeeper.ZookeeperRegistry.remove(ZookeeperRegistry.java:199)
at org.apache.dolphinscheduler.service.registry.RegistryCenter.remove(RegistryCenter.java:157)
at org.apache.dolphinscheduler.server.worker.registry.WorkerRegistryClient.unRegistry(WorkerRegistryClient.java:128)
at org.apache.dolphinscheduler.server.worker.WorkerServer.close(WorkerServer.java:219)
at org.apache.dolphinscheduler.server.worker.WorkerServer.stop(WorkerServer.java:229)
at org.apache.dolphinscheduler.server.registry.HeartBeatTask.run(HeartBeatTask.java:81)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /dolphinscheduler/nodes/worker/default/172.28.132.15:1234
at org.apache.zookeeper.KeeperException.create(KeeperException.java:102)
at org.apache.zookeeper.KeeperException.create(KeeperException.java:54)
at org.apache.zookeeper.ZooKeeper.delete(ZooKeeper.java:882)
at org.apache.curator.framework.imps.DeleteBuilderImpl$5.call(DeleteBuilderImpl.java:274)
at org.apache.curator.framework.imps.DeleteBuilderImpl$5.call(DeleteBuilderImpl.java:268)
at org.apache.curator.connection.StandardConnectionHandlingPolicy.callWithRetry(StandardConnectionHandlingPolicy.java:67)
at org.apache.curator.RetryLoop.callWithRetry(RetryLoop.java:81)
at org.apache.curator.framework.imps.DeleteBuilderImpl.pathInForeground(DeleteBuilderImpl.java:265)
at org.apache.curator.framework.imps.DeleteBuilderImpl.forPath(DeleteBuilderImpl.java:249)
at org.apache.curator.framework.imps.DeleteBuilderImpl.forPath(DeleteBuilderImpl.java:34)
at org.apache.dolphinscheduler.plugin.registry.zookeeper.ZookeeperRegistry.delete(ZookeeperRegistry.java:267)
... 13 common frames omitted
Maybe set stop single to true after close zk and netty successfully is better.
What you expected to happen
worker can stop itself successfully when it is judged dead server.
How to reproduce
do some stress test with many task running.
Anything else
No response
Are you willing to submit PR?
- Yes I am willing to submit a PR!
Code of Conduct
- I agree to follow this project's Code of Conduct
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working