Tasks are not stolen if worker runs out of memory

# Executive summary

The original intent around the design of worker pause was to deal with a spill disk drive that is much slower than the tasks that produce managed memory.
However, if a worker reaches the paused state either because of unmanaged memory alone or because it runs out of spill disk space, it will never get out of it.
Any tasks that are queued on the worker will remain there waiting forever, causing whole computations to hang.

This issue requires the pause on the worker to trigger all queued tasks on the scheduler (as opposed to currently running) to be immediately sent back to the scheduler so that they can be computed somewhere else.

---
# Original ticket

distributed git tip
Python 3.8
Linux x64
16 cores, 32 threads CPU

I'm using a dask distributed cluster running on localhost to perform a leaf-to-root aggregation of a large hierarchical tree with ~420k nodes, each node being a dask.delayed function that partially blocks the GIL for 1-100ms. The client retrieves the very small output of the root node.
My dask workers are running with 1 thread per worker.

My computation is getting randomly stuck. Some things I noticed:

- the issue gets substantially less frequent if I optimize the graph with fuse with an extremely aggressive ave_width=50, thus reducing the number of nodes to ~27k
- it gets even less frequent, but doesn't disappear completely, if I pass fuse_subgraphs=True to fuse. This leads me to strongly suspect that the time it takes to pickle, unpickle, or traverse each node makes a difference.
- it gets substantially worse if I feed multiple copies of the problem (with different, randomized keys) to the scheduler in parallel. E.g. if I feed 10 copies of the tree in parallel I'm pretty much guaranteed I'll experience a deadlock.
- if I send SIGTERM to one worker at a time, it eventually gets unstuck.
- using another client to push a task to explicitly singled out workers gets stuck forever.

I have a high degree of confidence that there are no potential race conditions in my application code.

I have spent a considerable amount of effort trying to reproduce the issue in a POC and failed. 
I have no idea how to debug it and I'm open to suggestions...


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Tasks are not stolen if worker runs out of memory #3761

Executive summary

Original ticket

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Tasks are not stolen if worker runs out of memory #3761

Description

Executive summary

Original ticket

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions