Skip to content

[Bug] [Master] zookeeper failover error #7534

@SbloodyS

Description

@SbloodyS

Search before asking

  • I had searched in the issues and found no similar issues.

What happened

[INFO] 2021-12-22 07:02:38.909 org.apache.zookeeper.ClientCnxn:[1158] - Unable to read additional data from server sessionid 0x200000aec62004e, likely server has closed socket, closing socket connection and attempting reconnect
[INFO] 2021-12-22 07:02:39.010 org.apache.curator.framework.state.ConnectionStateManager:[251] - State change: SUSPENDED
[WARN] 2021-12-22 07:02:39.011 org.apache.dolphinscheduler.plugin.registry.zookeeper.ZookeeperConnectionStateListener:[50] - Registry suspended
[WARN] 2021-12-22 07:02:39.011 org.apache.dolphinscheduler.server.master.registry.MasterRegistryClient:[417] - registry connection state is SUSPENDED, ready to stop myself
[INFO] 2021-12-22 07:02:39.011 org.apache.dolphinscheduler.server.master.MasterServer:[196] - master server is stopping ..., cause : registry connection state is SUSPENDED, stop myself
[INFO] 2021-12-22 07:02:39.681 org.apache.zookeeper.ClientCnxn:[1025] - Opening socket connection to server z-3.bi-kafka.d797mt.c8.kafka.us-west-2.amazonaws.com/192.168.22.252:2181. Will not attempt to authenticate using SASL (unknown error)
[INFO] 2021-12-22 07:02:39.684 org.apache.zookeeper.ClientCnxn:[879] - Socket connection established to z-3.bi-kafka.d797mt.c8.kafka.us-west-2.amazonaws.com/192.168.22.252:2181, initiating session
[INFO] 2021-12-22 07:02:39.686 org.apache.zookeeper.ClientCnxn:[1158] - Unable to read additional data from server sessionid 0x200000aec62004e, likely server has closed socket, closing socket connection and attempting reconnect
[INFO] 2021-12-22 07:02:39.811 org.apache.zookeeper.ClientCnxn:[1025] - Opening socket connection to server z-1.bi-kafka.d797mt.c8.kafka.us-west-2.amazonaws.com/192.168.25.7:2181. Will not attempt to authenticate using SASL (unknown error)
[INFO] 2021-12-22 07:02:39.811 org.apache.zookeeper.ClientCnxn:[879] - Socket connection established to z-1.bi-kafka.d797mt.c8.kafka.us-west-2.amazonaws.com/192.168.25.7:2181, initiating session
[INFO] 2021-12-22 07:02:39.812 org.apache.zookeeper.ClientCnxn:[1158] - Unable to read additional data from server sessionid 0x200000aec62004e, likely server has closed socket, closing socket connection and attempting reconnect
[INFO] 2021-12-22 07:02:41.740 org.apache.zookeeper.ClientCnxn:[1025] - Opening socket connection to server z-2.bi-kafka.d797mt.c8.kafka.us-west-2.amazonaws.com/192.168.20.121:2181. Will not attempt to authenticate using SASL (unknown error)
[INFO] 2021-12-22 07:02:41.741 org.apache.zookeeper.ClientCnxn:[1162] - Socket error occurred: z-2.bi-kafka.d797mt.c8.kafka.us-west-2.amazonaws.com/192.168.20.121:2181: Connection refused
[INFO] 2021-12-22 07:02:42.098 org.apache.dolphinscheduler.remote.NettyRemotingClient:[390] - netty client closed
[INFO] 2021-12-22 07:02:42.098 org.apache.dolphinscheduler.server.master.runner.MasterSchedulerService:[160] - master schedule service stopped...
[INFO] 2021-12-22 07:02:42.104 org.apache.dolphinscheduler.remote.NettyRemotingServer:[243] - netty server closed
[INFO] 2021-12-22 07:02:42.492 org.apache.zookeeper.ClientCnxn:[1025] - Opening socket connection to server z-3.bi-kafka.d797mt.c8.kafka.us-west-2.amazonaws.com/192.168.22.252:2181. Will not attempt to authenticate using SASL (unknown error)
[INFO] 2021-12-22 07:02:42.493 org.apache.zookeeper.ClientCnxn:[879] - Socket connection established to z-3.bi-kafka.d797mt.c8.kafka.us-west-2.amazonaws.com/192.168.22.252:2181, initiating session
[INFO] 2021-12-22 07:02:42.494 org.apache.zookeeper.ClientCnxn:[1299] - Session establishment complete on server z-3.bi-kafka.d797mt.c8.kafka.us-west-2.amazonaws.com/192.168.22.252:2181, sessionid = 0x200000aec62004e, negotiated timeout = 30000
[INFO] 2021-12-22 07:02:42.494 org.apache.curator.framework.state.ConnectionStateManager:[251] - State change: RECONNECTED
[INFO] 2021-12-22 07:02:42.502 org.apache.dolphinscheduler.server.master.registry.MasterRegistryClient:[437] - master node : 192.168.25.200:15678 unRegistry to register center.
[INFO] 2021-12-22 07:02:42.502 org.apache.dolphinscheduler.server.master.registry.ServerNodeManager:[286] - master node : /nodes/master/192.168.25.200:15678 down.
[INFO] 2021-12-22 07:02:42.502 org.apache.dolphinscheduler.server.master.registry.MasterRegistryClient:[160] - MASTER node deleted : /nodes/master/192.168.25.200:15678
[INFO] 2021-12-22 07:02:42.502 org.apache.dolphinscheduler.server.master.registry.MasterRegistryClient:[439] - heartbeat executor shutdown
[INFO] 2021-12-22 07:02:42.505 org.apache.curator.framework.imps.CuratorFrameworkImpl:[955] - backgroundOperationsLoop exiting
[INFO] 2021-12-22 07:02:42.505 org.apache.zookeeper.ClientCnxn:[522] - EventThread shut down for session: 0x200000aec62004e
[INFO] 2021-12-22 07:02:42.505 org.apache.zookeeper.ZooKeeper:[693] - Session: 0x200000aec62004e closed
[INFO] 2021-12-22 07:02:42.506 org.quartz.core.QuartzScheduler:[666] - Scheduler DolphinScheduler_$_ip-192-168-25-2001639723938667 shutting down.
[INFO] 2021-12-22 07:02:42.506 org.quartz.core.QuartzScheduler:[585] - Scheduler DolphinScheduler_$_ip-192-168-25-2001639723938667 paused.
[INFO] 2021-12-22 07:02:42.509 com.zaxxer.hikari.HikariDataSource:[350] - DolphinScheduler - Shutdown initiated...
[INFO] 2021-12-22 07:02:42.512 com.zaxxer.hikari.HikariDataSource:[352] - DolphinScheduler - Shutdown completed.
[INFO] 2021-12-22 07:02:42.513 org.quartz.core.QuartzScheduler:[740] - Scheduler DolphinScheduler_$_ip-192-168-25-2001639723938667 shutdown complete.
[INFO] 2021-12-22 07:02:42.513 org.apache.dolphinscheduler.service.quartz.QuartzExecutors:[210] - Quartz service stopped, and halt all tasks
[INFO] 2021-12-22 07:02:42.513 org.apache.dolphinscheduler.server.master.MasterServer:[214] - Quartz service stopped
[INFO] 2021-12-22 07:02:42.515 org.quartz.core.QuartzScheduler:[585] - Scheduler quartzScheduler_$_NON_CLUSTERED paused.
[INFO] 2021-12-22 07:02:42.517 org.apache.dolphinscheduler.remote.NettyRemotingClient:[390] - netty client closed
[INFO] 2021-12-22 07:02:42.517 org.apache.dolphinscheduler.service.log.LogClientService:[74] - logger client closed
[INFO] 2021-12-22 07:02:42.517 org.springframework.scheduling.quartz.SchedulerFactoryBean:[845] - Shutting down Quartz Scheduler
[INFO] 2021-12-22 07:02:42.517 org.quartz.core.QuartzScheduler:[666] - Scheduler quartzScheduler_$_NON_CLUSTERED shutting down.
[INFO] 2021-12-22 07:02:42.517 org.quartz.core.QuartzScheduler:[585] - Scheduler quartzScheduler_$_NON_CLUSTERED paused.
[INFO] 2021-12-22 07:02:42.518 org.quartz.core.QuartzScheduler:[740] - Scheduler quartzScheduler_$_NON_CLUSTERED shutdown complete.
[INFO] 2021-12-22 07:02:42.520 org.apache.dolphinscheduler.server.master.processor.queue.TaskResponseService:[139] - StateEventResponseWorker stopped
[WARN] 2021-12-22 07:02:42.522 org.apache.dolphinscheduler.server.master.processor.queue.StateEventResponseService:[115] - persist task error
java.lang.InterruptedException: null
        at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2014)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2048)
        at java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
        at org.apache.dolphinscheduler.server.master.processor.queue.StateEventResponseService$StateEventResponseWorker.run(StateEventResponseService.java:112)
[INFO] 2021-12-22 07:02:42.522 org.apache.dolphinscheduler.server.master.processor.queue.StateEventResponseService:[120] - StateEventResponseWorker stopped
[ERROR] 2021-12-22 07:02:50.169 org.apache.curator.framework.imps.CuratorFrameworkImpl:[703] - Background exception was not retry-able or retry gave up
java.lang.IllegalStateException: Client is not started
        at org.apache.curator.shaded.com.google.common.base.Preconditions.checkState(Preconditions.java:507)
        at org.apache.curator.CuratorZookeeperClient.getZooKeeper(CuratorZookeeperClient.java:162)
        at org.apache.curator.framework.imps.CuratorFrameworkImpl.performBackgroundOperation(CuratorFrameworkImpl.java:969)
        at org.apache.curator.framework.imps.CuratorFrameworkImpl.processBackgroundOperation(CuratorFrameworkImpl.java:638)
        at org.apache.curator.framework.imps.WatcherRemovalFacade.processBackgroundOperation(WatcherRemovalFacade.java:152)
        at org.apache.curator.framework.imps.FindAndDeleteProtectedNodeInBackground.execute(FindAndDeleteProtectedNodeInBackground.java:60)
        at org.apache.curator.framework.imps.CreateBuilderImpl.protectedPathInForeground(CreateBuilderImpl.java:619)
        at org.apache.curator.framework.imps.CreateBuilderImpl.forPath(CreateBuilderImpl.java:597)
        at org.apache.curator.framework.imps.CreateBuilderImpl.forPath(CreateBuilderImpl.java:575)
        at org.apache.curator.framework.imps.CreateBuilderImpl.forPath(CreateBuilderImpl.java:51)
        at org.apache.curator.framework.recipes.locks.StandardLockInternalsDriver.createsTheLock(StandardLockInternalsDriver.java:54)
        at org.apache.curator.framework.recipes.locks.LockInternals.attemptLock(LockInternals.java:225)
        at org.apache.curator.framework.recipes.locks.InterProcessMutex.internalLock(InterProcessMutex.java:237)
        at org.apache.curator.framework.recipes.locks.InterProcessMutex.acquire(InterProcessMutex.java:89)
        at org.apache.dolphinscheduler.plugin.registry.zookeeper.ZookeeperRegistry.acquireLock(ZookeeperRegistry.java:203)
        at org.apache.dolphinscheduler.service.registry.RegistryClient.getLock(RegistryClient.java:237)
        at org.apache.dolphinscheduler.server.master.registry.MasterRegistryClient.removeNodePath(MasterRegistryClient.java:163)
        at org.apache.dolphinscheduler.server.master.registry.MasterRegistryDataListener.handleMasterEvent(MasterRegistryDataListener.java:66)
        at org.apache.dolphinscheduler.server.master.registry.MasterRegistryDataListener.notify(MasterRegistryDataListener.java:52)
        at org.apache.dolphinscheduler.plugin.registry.zookeeper.ZookeeperRegistry.lambda$subscribe$1(ZookeeperRegistry.java:127)
        at org.apache.curator.framework.recipes.cache.TreeCache$2.apply(TreeCache.java:760)
        at org.apache.curator.framework.recipes.cache.TreeCache$2.apply(TreeCache.java:754)
        at org.apache.curator.framework.listen.ListenerContainer$1.run(ListenerContainer.java:100)
        at org.apache.curator.shaded.com.google.common.util.concurrent.DirectExecutor.execute(DirectExecutor.java:30)
        at org.apache.curator.framework.listen.ListenerContainer.forEach(ListenerContainer.java:92)
        at org.apache.curator.framework.recipes.cache.TreeCache.callListeners(TreeCache.java:753)
        at org.apache.curator.framework.recipes.cache.TreeCache.access$1900(TreeCache.java:75)
        at org.apache.curator.framework.recipes.cache.TreeCache$4.run(TreeCache.java:865)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)

What you expected to happen

failover success.

How to reproduce

zookeeper failover

Anything else

No response

Version

2.0.1

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct

Metadata

Metadata

Assignees

No one assigned

    Labels

    Waiting for replyWaiting for replybugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions