Skip to content

[Bug] [Master] Network exception occurred between Master and ZooKeeper, triggering failover mechanism, which caused duplicate task execution on the next node #16759

@1105560808

Description

@1105560808

Search before asking

  • I had searched in the issues and found no similar issues.

What happened

"Due to network issues, Master lost connection with ZooKeeper, triggering the failover mechanism. However, the original Master was still running with tasks in execution and next nodes waiting in memory. Meanwhile, other Master nodes detected the issue and regenerated the task DAG. When the previous node completed, both Masters simultaneously executed the next node, causing multiple Worker nodes to process the same task. This may lead to subsequent task state inconsistency issues."

What you expected to happen

After Master loses connection with ZooKeeper due to network issues, concurrent execution of the same task should not occur

How to reproduce

Steps:

  1. Identify a workflow with long-running node
  2. During node execution:
    • Disconnect Master from ZooKeeper
    • Use pause strategy (not stop)
    • Trigger Master failover
  3. Wait for current node completion
  4. Verify:
    • Check for duplicate execution of subsequent nodes
    • Monitor task state consistency

Anything else

Proposed Solution:
Before submitting next node task, Master should:

  1. Verify host in processInstance
  2. Compare with current Master's host
  3. Exit if mismatch detected

Version

3.2.x

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct

Metadata

Metadata

Assignees

Labels

StalebugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions