-
Notifications
You must be signed in to change notification settings - Fork 7.4k
Description
What is the problem?
Ray version and other system information (Python version, TensorFlow version, OS): dev
IO workers are used both to spill and restore objects from external storage. If an IO worker is trying to restore an object, but there is not enough space, another object needs to be spilled to make room. This can cause a deadlock since the same IO worker is needed to spill the object.
We could fix this by queuing spill/restore requests at the IO worker, so that the worker can process multiple requests in parallel.
Reproduction (REQUIRED)
Please provide a script that can be run to reproduce the issue. The script should have no external library dependencies (i.e., use fake or mock data / environments):
The unit test test_object_spilling.py::test_spill_during_get shows this issue if max_io_workers is set to 1.