Skip to content

[DSIP-82][Master/Worker] Use FAILOVER_FINISH_NODES to avoid duplicate workflow/task when failover #16825

@ruanwenjun

Description

@ruanwenjun

Search before asking

  • I had searched in the DSIP and found no similar DSIP.

Motivation

When the master/worker disconnect from registry, then it might reconnect latter.
e.g. We use curator to connect to zk, if the session timeout is 120s, the server will go into suspend if the heartbeat is failure in 80s, and then it will reconnect to another zk node, if reconnect success, then the server continue work. But sometimes, other server might receive a disconnect event of the reconnect server in this case.

We need to make sure if someone has failover a node, then the node must go died.

Design Detail

We import a FAILOVER_FINISH_NODES in registry, each server use address+server startup time as it's identify, once a server has been failovered, then it will be put under FAILOVER_FINISH_NODES, so if someone find it is under FAILOVER_FINISH_NODES then it should go died.

Compatibility, Deprecation, and Migration Plan

No response

Test Plan

No response

Code of Conduct

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions