-
-
Notifications
You must be signed in to change notification settings - Fork 757
Description
Executive summary
The original intent around the design of worker pause was to deal with a spill disk drive that is much slower than the tasks that produce managed memory.
However, if a worker reaches the paused state either because of unmanaged memory alone or because it runs out of spill disk space, it will never get out of it.
Any tasks that are queued on the worker will remain there waiting forever, causing whole computations to hang.
This issue requires the pause on the worker to trigger all queued tasks on the scheduler (as opposed to currently running) to be immediately sent back to the scheduler so that they can be computed somewhere else.
Original ticket
distributed git tip
Python 3.8
Linux x64
16 cores, 32 threads CPU
I'm using a dask distributed cluster running on localhost to perform a leaf-to-root aggregation of a large hierarchical tree with ~420k nodes, each node being a dask.delayed function that partially blocks the GIL for 1-100ms. The client retrieves the very small output of the root node.
My dask workers are running with 1 thread per worker.
My computation is getting randomly stuck. Some things I noticed:
- the issue gets substantially less frequent if I optimize the graph with fuse with an extremely aggressive ave_width=50, thus reducing the number of nodes to ~27k
- it gets even less frequent, but doesn't disappear completely, if I pass fuse_subgraphs=True to fuse. This leads me to strongly suspect that the time it takes to pickle, unpickle, or traverse each node makes a difference.
- it gets substantially worse if I feed multiple copies of the problem (with different, randomized keys) to the scheduler in parallel. E.g. if I feed 10 copies of the tree in parallel I'm pretty much guaranteed I'll experience a deadlock.
- if I send SIGTERM to one worker at a time, it eventually gets unstuck.
- using another client to push a task to explicitly singled out workers gets stuck forever.
I have a high degree of confidence that there are no potential race conditions in my application code.
I have spent a considerable amount of effort trying to reproduce the issue in a POC and failed.
I have no idea how to debug it and I'm open to suggestions...