[REVIEW] Prioritize getitem tasks in shuffle#7826
Conversation
|
Is there something we can do here that would be a good test? The implementation here with Maybe we look at the Scheduler.transition_log and see that some set of tasks occurs before some other set of tasks? |
|
Also, I think that doing this in unpack is fine, but I'm also curious if it would make sense to put these annotations on the Layer itself and then trust the general machinery to move those annotations up to the scheduler. This feels somewhat less efficient, but also somewhat more general. |
I have been thinking of a more general solution but that will require some API redesign, I think. |
Notice, before layer unpack we don't know the task keys. We could do this in |
|
There is a middle ground. We can put annotations on the layer instance. This would be equivalent to what happens when we call with dask.annotate(priority=...):
... create some layers ...We can just populate the layer instance with an |
We only want to prioritize |
Agreed. I believe that you can put the information you placed in the unpack method directly on the Layer and that the Client will send that information up. |
I am not sure I follow, do you suggest that we set annotations when creating In that case we need to use a |
|
Yes there, or even within the layer's constructor
My hope is that we could figure out the relevant keys without generating the entire graph. What is in this PR is fine too. I just think that if we can keep this logic on the client side then it's less likely to need maintenance going forward. |
I will have to think about this :) |
|
@rjzamora I'm not sure if @madsbk mentioned this to you or not, but if it was easy to generate a set of keys for shuffle layers without generating the full graph then that would make it easier for us to specify nicer priorities at the layer level to shuffles. I would love to get this PR in, one way or another, by next release. My hope is that you understand the shuffle system well enough to be able to quickly generate a If we can't do that this week then let's punt and find a way to merge this in. |
Makes sense - I’ll look through this PR and give you my thoughts tomorrow. There is no doubt that we can generate a set of keys without “materializing” the graph, but the algorithm will have the same complexity as _construct_graph. |
|
Oh interesting. I would have expected it to be fairly cheap to generate just the keys. |
I should say may have - I'll need to think about it :). |
|
Also, we only need to generate the keys that need to have a set priority here. |
This PR implements custom task prioritization of
getitem()tasks in shuffle operations. See #6051 for a detailed discussion of the performance benefits.black dask/flake8 dask/isort dask