Skip to content

Tasks are not stolen if worker runs out of memory #3761

@crusaderky

Description

@crusaderky

Executive summary

The original intent around the design of worker pause was to deal with a spill disk drive that is much slower than the tasks that produce managed memory.
However, if a worker reaches the paused state either because of unmanaged memory alone or because it runs out of spill disk space, it will never get out of it.
Any tasks that are queued on the worker will remain there waiting forever, causing whole computations to hang.

This issue requires the pause on the worker to trigger all queued tasks on the scheduler (as opposed to currently running) to be immediately sent back to the scheduler so that they can be computed somewhere else.


Original ticket

distributed git tip
Python 3.8
Linux x64
16 cores, 32 threads CPU

I'm using a dask distributed cluster running on localhost to perform a leaf-to-root aggregation of a large hierarchical tree with ~420k nodes, each node being a dask.delayed function that partially blocks the GIL for 1-100ms. The client retrieves the very small output of the root node.
My dask workers are running with 1 thread per worker.

My computation is getting randomly stuck. Some things I noticed:

  • the issue gets substantially less frequent if I optimize the graph with fuse with an extremely aggressive ave_width=50, thus reducing the number of nodes to ~27k
  • it gets even less frequent, but doesn't disappear completely, if I pass fuse_subgraphs=True to fuse. This leads me to strongly suspect that the time it takes to pickle, unpickle, or traverse each node makes a difference.
  • it gets substantially worse if I feed multiple copies of the problem (with different, randomized keys) to the scheduler in parallel. E.g. if I feed 10 copies of the tree in parallel I'm pretty much guaranteed I'll experience a deadlock.
  • if I send SIGTERM to one worker at a time, it eventually gets unstuck.
  • using another client to push a task to explicitly singled out workers gets stuck forever.

I have a high degree of confidence that there are no potential race conditions in my application code.

I have spent a considerable amount of effort trying to reproduce the issue in a POC and failed.
I have no idea how to debug it and I'm open to suggestions...

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions