Skip to content

[DSIP-69] Fix master dispatch task timeout might cause task duplicate running in worker #16481

@ruanwenjun

Description

@ruanwenjun

Search before asking

  • I had searched in the DSIP and found no similar DSIP.

Motivation

Right now, there exist some case might cause the task duplicated dispatched.
e.g.

image The master dispatch task a to worker A first, but receive a timeout response, this might happen when the worker rpc is busy, then master will select a new worker B and retry the dispatch.

Then there might exist two situations:

  1. The task has been received by worker A, then take will duplicate exist in worker A and worker B, both the two worker will execute the task, a worser case is the task might duplicated in more worker.
  2. The task hasn't been received by worker A, then task will not duplicate executed.

The first situation is not accepted.

Design Detail

In order to solve this, we should change the dispatch logic.

image

Compatibility, Deprecation, and Migration Plan

No response

Test Plan

No response

Code of Conduct

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions