-
-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Description
First of all I apologies because this is more a question than an issue or feature request (though it might turn to one of those in the end).
I am having a really hard time to figure out how to influence task scheduling to get it the way I want it.
Here is my setup. I have two set of tasks, say A and B. Each set has a few thousands tasks in it. A tasks depends on nothing. Each B task depends on a few A tasks output (5 in average).
In my code I delay all A tasks, then I delay B tasks using delayed A objects to model dependencies.
Then I call client.compute(B_delayed) and I retrieve results of B on master node on the flow, using as_completed (client is from dask.distributed).
I would expect the scheduler to figure out that each B task has very few dependencies to A tasks, and therefore to try to complete entire B tasks to be able to clear ressources for A tasks that are no longer needed.
What happens is quite the opposite : the scheduler prioritize A tasks, trying to complete almost all of them while very scarcely computing B tasks (even when a lot of them become ready to run).
This is problematic for us:
- On the flow retrieval is less efficient if all B tasks come at the end (it involves IO operations)
- In some cases, the amount of memory on workers is insufficient to store all A results. This memory could be easily cleared, if only the scheduler decided to complete the corresponding B tasks.
What I tried so far:
- Also submit A tasks with
client.compute(), with a lower priority (did not change the behavior) - Submit B tasks in batch with decreasing priority (it helps a bit, but scheduler still wants to complete as many A tasks as possible).
- Mix of both : no noticeable progress