-
-
Notifications
You must be signed in to change notification settings - Fork 757
Description
There is an issue with how the scheduler assigns tasks from the unrannable queue to workers who meet the resource requirements joining the scheduler.
The use case is some long running complex task where some tasks require an expensive resource, say GPUs, but those resources are only provided (through Adaptive) once the tasks requiring those resources are ready to be run. Say we come to a point in the computation where 5 tasks could be run, if GPUs were available. Managing the scale_up behaviour through adaptive is fairly straightforward and allows adding new compute nodes (for instance on AWS) with the required resources. The problem appears when the first of the new workers connects.
Scheduler.addWorker will go through the list of unrunnable tasks and check if there are workers that meet the requirements, since only the first worker has thus far connected there is only one worker that meets the requirements (some possibly positive number of workers may or may not booting up and joining shortly, but that hasn't happened yet).
for ts in list(self.unrunnable):
valid = self.valid_workers(ts)
if valid is True or ws in valid:
recommendations[ts.key] = 'waiting'
The task goes through released -> waiting and then waiting -> processing, the transition_waiting_processing again calls valid_workers to get a list of workers where the task(s) can be run (this list still just contains a single worker because the other ones haven't yet connected).
The end result of all of this is that the worker who happens connect first and have the resource required by the tasks gets all the tasks dumped onto it with all the other workers, who potentially connect just seconds later, get nothing and are shutdown by the scheduler because they are idling.
In short, it appears to be the case that the purpose of the resource_requirements is to act as a hint of required peak performance (memory, GPU, whatever) from the workers, and not to be a dynamically changing resource allocation. Is this the case, and is there any interested in changing that? The resources available and resources consumed are taken into account in transition_waiting_processing, but only on the worker_state not for the scheduler in general.
If this is not the intended behaviour and should be fixed, I'm more than happy to work on this.