Skip to content

[Bug] [Worker] Worker fakes death when it stop itself fail. #6616

@caishunfeng

Description

@caishunfeng

Search before asking

  • I had searched in the issues and found no similar issues.

What happened

When I try a stress test, I found that worker fakes death and print nothing to log file. At the same time, worker is not exist in zk node path and Master can't dispatch task because no worker.

That's the log before worker stop:

[ERROR] 2021-10-15 15:10:57.590 org.apache.dolphinscheduler.server.worker.WorkerServer:[223] - worker server stop exception 
org.apache.dolphinscheduler.spi.register.RegistryException: zookeeper delete key error
        at org.apache.dolphinscheduler.plugin.registry.zookeeper.ZookeeperRegistry.delete(ZookeeperRegistry.java:272)
        at org.apache.dolphinscheduler.plugin.registry.zookeeper.ZookeeperRegistry.remove(ZookeeperRegistry.java:199)
        at org.apache.dolphinscheduler.service.registry.RegistryCenter.remove(RegistryCenter.java:157)
        at org.apache.dolphinscheduler.server.worker.registry.WorkerRegistryClient.unRegistry(WorkerRegistryClient.java:128)
        at org.apache.dolphinscheduler.server.worker.WorkerServer.close(WorkerServer.java:219)
        at org.apache.dolphinscheduler.server.worker.WorkerServer.stop(WorkerServer.java:229)
        at org.apache.dolphinscheduler.server.registry.HeartBeatTask.run(HeartBeatTask.java:81)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /dolphinscheduler/nodes/worker/default/172.28.132.15:1234
        at org.apache.zookeeper.KeeperException.create(KeeperException.java:102)
        at org.apache.zookeeper.KeeperException.create(KeeperException.java:54)
        at org.apache.zookeeper.ZooKeeper.delete(ZooKeeper.java:882)
        at org.apache.curator.framework.imps.DeleteBuilderImpl$5.call(DeleteBuilderImpl.java:274)
        at org.apache.curator.framework.imps.DeleteBuilderImpl$5.call(DeleteBuilderImpl.java:268)
        at org.apache.curator.connection.StandardConnectionHandlingPolicy.callWithRetry(StandardConnectionHandlingPolicy.java:67)
        at org.apache.curator.RetryLoop.callWithRetry(RetryLoop.java:81)
        at org.apache.curator.framework.imps.DeleteBuilderImpl.pathInForeground(DeleteBuilderImpl.java:265)
        at org.apache.curator.framework.imps.DeleteBuilderImpl.forPath(DeleteBuilderImpl.java:249)
        at org.apache.curator.framework.imps.DeleteBuilderImpl.forPath(DeleteBuilderImpl.java:34)
        at org.apache.dolphinscheduler.plugin.registry.zookeeper.ZookeeperRegistry.delete(ZookeeperRegistry.java:267)
        ... 13 common frames omitted

Maybe set stop single to true after close zk and netty successfully is better.

What you expected to happen

worker can stop itself successfully when it is judged dead server.

How to reproduce

do some stress test with many task running.

Anything else

No response

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions