Skip to content

[Object spilling] Object spilling can deadlock if there aren't enough IO workers #11789

@stephanie-wang

Description

@stephanie-wang

What is the problem?

Ray version and other system information (Python version, TensorFlow version, OS): dev

IO workers are used both to spill and restore objects from external storage. If an IO worker is trying to restore an object, but there is not enough space, another object needs to be spilled to make room. This can cause a deadlock since the same IO worker is needed to spill the object.

We could fix this by queuing spill/restore requests at the IO worker, so that the worker can process multiple requests in parallel.

Reproduction (REQUIRED)

Please provide a script that can be run to reproduce the issue. The script should have no external library dependencies (i.e., use fake or mock data / environments):

The unit test test_object_spilling.py::test_spill_during_get shows this issue if max_io_workers is set to 1.

Metadata

Metadata

Assignees

Labels

P1Issue that should be fixed within a few weeksenhancementRequest for new feature and/or capability

Type

No type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions