Search before asking
What happened
"Due to network issues, Master lost connection with ZooKeeper, triggering the failover mechanism. However, the original Master was still running with tasks in execution and next nodes waiting in memory. Meanwhile, other Master nodes detected the issue and regenerated the task DAG. When the previous node completed, both Masters simultaneously executed the next node, causing multiple Worker nodes to process the same task. This may lead to subsequent task state inconsistency issues."
What you expected to happen
After Master loses connection with ZooKeeper due to network issues, concurrent execution of the same task should not occur
How to reproduce
Steps:
- Identify a workflow with long-running node
- During node execution:
- Disconnect Master from ZooKeeper
- Use pause strategy (not stop)
- Trigger Master failover
- Wait for current node completion
- Verify:
- Check for duplicate execution of subsequent nodes
- Monitor task state consistency
Anything else
Proposed Solution:
Before submitting next node task, Master should:
- Verify host in processInstance
- Compare with current Master's host
- Exit if mismatch detected
Version
3.2.x
Are you willing to submit PR?
Code of Conduct