Skip to content

[Object Spilling] Thrashing when there are large number of dependencies for many tasks #12663

@rkooo567

Description

@rkooo567

What is the problem?

Discovered from the shuffling workload: https://gist.github.com/ericl/d419c6373928b4b7c8739738262da287

When there are many tasks with a large number of dependencies, tasks are not properly scheduled because they are waiting for dependencies to be local, but some of the dependencies are evicted (to make space) before every dependency becomes local. These dependencies are restored again, but after that, it is re-evicted. This leads all of the tasks to be in the waiting state and not executed forever (which causes the hanging).

There might be many solutions, but one of them is to temporarily pin the restored objects until it is used by tasks. Indeed, this sort of thrashing behavior can happen even for non-object spilling scenarios (pull->evicted->pull->evicted), so this can help to optimize the object pulling in general.

Reproduction (REQUIRED)

Please provide a script that can be run to reproduce the issue. The script should have no external library dependencies (i.e., use fake or mock data / environments):

If we cannot run your script, we cannot fix your issue.

  • I have verified my script runs in a clean environment and reproduces the issue.
  • I have verified the issue also occurs with the latest wheels.

Metadata

Metadata

Labels

P1Issue that should be fixed within a few weeks

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions