Paused workers should not be able to fetch any more data

The current implementation of the worker `ensure_communicating` will continue to fetch dependencies for as long as there are dependencies to fetch. This can push an already overloaded worker over the edge and cause it to fail.

https://github.com/dask/distributed/blob/8734c9db78a68f6bea5d2a1deb39412a9e7b739a/distributed/worker.py#L2684-L2687

A paused worker should not be allowed to fetch more data.

There are two possible ways to achieve this

1. Add another guard to `ensure_communicating` to stop scheduling additional `gather_dep` coroutines
2. Remove all tasks from a paused worker that aren't in memory. This would indirectly empty the `data_needed` heap and cause a worker to stabilize. This could be achieved by either aggressively stealing or by implementing a custom scheduler handler.

I think both options have a certain appeal. I'm wondering which one is the best to choose, specifically in context of the latest changes to AMM / retirement / pause. 

cc @crusaderky 

Note: right now, network traffic is only restricted for egress, i.e. incoming `get_data` requests from other workers, see https://github.com/dask/distributed/blob/8734c9db78a68f6bea5d2a1deb39412a9e7b739a/distributed/worker.py#L1712-L1716 

	while self.data_needed and (
	len(self.in_flight_workers) < self.total_out_connections
	or self.comm_nbytes < self.comm_threshold_bytes
	):

	if self.status == Status.paused:
	max_connections = 1
	throttle_msg = " Throttling outgoing connections because worker is paused."
	else:
	throttle_msg = ""

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Paused workers should not be able to fetch any more data #5702

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Paused workers should not be able to fetch any more data #5702

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions