-
Notifications
You must be signed in to change notification settings - Fork 7.4k
Description
What is the problem?
Discovered from the shuffling workload: https://gist.github.com/ericl/d419c6373928b4b7c8739738262da287
When there are many tasks with a large number of dependencies, tasks are not properly scheduled because they are waiting for dependencies to be local, but some of the dependencies are evicted (to make space) before every dependency becomes local. These dependencies are restored again, but after that, it is re-evicted. This leads all of the tasks to be in the waiting state and not executed forever (which causes the hanging).
There might be many solutions, but one of them is to temporarily pin the restored objects until it is used by tasks. Indeed, this sort of thrashing behavior can happen even for non-object spilling scenarios (pull->evicted->pull->evicted), so this can help to optimize the object pulling in general.
Reproduction (REQUIRED)
Please provide a script that can be run to reproduce the issue. The script should have no external library dependencies (i.e., use fake or mock data / environments):
If we cannot run your script, we cannot fix your issue.
- I have verified my script runs in a clean environment and reproduces the issue.
- I have verified the issue also occurs with the latest wheels.