Improve handling of tasks failing/succeeding on unexpected workers

This issue is a follow-up on an offline discussion centering around https://github.com/dask/distributed/pull/6884#issuecomment-1224775494 and #6939 with @crusaderky and @gjoseph92 where we encountered some issues with the way we currently handle tasks failing/succeeding on unexpected workers.

In `transition_processing_memory`, when encountering an unexpected worker that has successfully completed a task, we send a `cancel-compute` message to the worker that was supposed to process this task assuming that the unexpected worker would hold onto our result for us.

https://github.com/dask/distributed/blob/c15a10e87ca5d03e62f0ad4f38adb63163522979/distributed/scheduler.py#L1987-L2001

However, this series of events _should_ only be triggered in a chain of events where `free-keys` message for that task should already be on its way to the unexpected worker, causing it to remove the result. This leaves us with no worker holding on to the task/result.

While we do suspect this chain of events to unfold, we need a test verifying it. If this is indeed the case, we should not send the `cancel-compute` message to `ts.processing_on`, but rather make sure that we clean up anything related to the work happening on the unexpected worker.

In `stimulus_task_erred`, we don't check whether the task `erred` on an unexpected worker at all and run through the retry-logic. In https://github.com/dask/distributed/pull/6884#issuecomment-1217824082, @fjetter suggested instead ignoring that the task `erred` on an unexpected worker and continuing on as if nothing happened. This general approach is reasonable as the unexpected worker suggests something faulty going on. We might need to do additional work to ensure that the unexpected worker is reset, e.g., sending a `free-keys` message to the unexpected workers and we want to log a warning. To ensure everything is reset correctly, we need a test that verifies the intended behavior.

	if ws != ts.processing_on: # someone else has this task
	logger.info(
	"Unexpected worker completed task. Expected: %s, Got: %s, Key: %s",
	ts.processing_on,
	ws,
	key,
	)
	assert ts.processing_on
	worker_msgs[ts.processing_on.address] = [
	{
	"op": "cancel-compute",
	"key": key,
	"stimulus_id": stimulus_id,
	}
	]

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Improve handling of tasks failing/succeeding on unexpected workers #6956

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Improve handling of tasks failing/succeeding on unexpected workers #6956

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions