[Object spilling] Add policy to automatically spill objects on OutOfMemory by stephanie-wang · Pull Request #11673 · ray-project/ray

stephanie-wang · 2020-10-28T16:19:22Z

Why are these changes needed?

This adds a callback when an object store node runs out of memory to choose objects to spill, after first attempting to make space through eviction of objects not currently referenced. Since spilling is asynchronous, the object store client must try to create the object again. Once the object spilling is complete, the object creation will succeed.

A TODO is to modify the object store to respond to the client asynchronously. This is so that, in the case that we can definitely make enough space by spilling other objects, the object store client does not have to retry the create call on a timer and we do not block the object store while the objects are being spilled.

This PR also introduces some changes to the way configs are passed around the cluster to accommodate passing around the object spilling config, which is a JSON string. Long-term, we should have a less brittle way to pass around arbitrary config values.

Related issue number

Closes #9849.

Checks

I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

ericl · 2020-10-29T00:31:11Z

I think there are some missing BUILD files (doesn't compile).

python/ray/tests/test_object_spilling.py

src/ray/common/ray_config_def.h

python/ray/workers/default_worker.py

src/ray/object_manager/common.h

src/ray/object_manager/plasma/store.cc

src/ray/raylet/local_object_manager.cc

rkooo567 · 2020-10-29T05:19:40Z

src/ray/object_manager/plasma/store.cc

+        // TODO: Only respond to the client with OutOfMemory if we could not
+        // make enough space through spilling. If we could make enough space,
+        // respond to the plasma client once spilling is complete.
+        static_cast<void>(spill_objects_callback_(space_needed));


I might misunderstand here, but doesn't this mean raylet can try creating new objects or pulling new objects "before" all objects are actually spilled, and it can lead to OOM in some scenarios?

Or is it handled from the object store side?

Yes, it's possible that if the spilling is too slow, the client will still receive an OOM even though enough space will be made eventually.

ericl · 2020-10-29T05:41:34Z

src/ray/object_manager/common.h

+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+namespace ray {


#include <functional>?

python/ray/_private/services.py

src/ray/raylet/local_object_manager.cc

ericl · 2020-10-29T05:56:05Z

This is pretty cool. One thing I noticed is that

import json
import numpy as np
import ray

ray.init(
    object_store_memory=100 * 1024 * 1024,
    _system_config={
        "automatic_object_spilling_enabled": True,
        "object_spilling_config": json.dumps(
            {"type": "filesystem", "params": {"directory_path": "/tmp/spill"}},
            separators=(",", ":")
        )
    },
)

@ray.remote
def f():
    return np.zeros(10 * 1024 * 1024)

ids = []
for _ in range(10):
    x = f.remote()
    ids.append(x)

for x in ids:
    print(ray.get(x).shape)

This kind of workload will hang because the retries are waiting not enough time. I guess this will be addressed with the async RPC, but in the interim can we return a special return code to allow indefinite retries with low delay (e.g., every 10ms retry). This would allow spilling to be used for real workloads with pretty good performance.

Co-authored-by: SangBin Cho <rkooo567@gmail.com>

stephanie-wang · 2020-10-31T17:09:37Z

This kind of workload will hang because the retries are waiting not enough time. I guess this will be addressed with the async RPC, but in the interim can we return a special return code to allow indefinite retries with low delay (e.g., every 10ms retry). This would allow spilling to be used for real workloads with pretty good performance.

Hmm so I tried this and it actually still has the same problem. I added a new error code for the object store being transiently out of memory, and the client will retry quickly after receiving this error code. This is to prevent indefinite retries in the case where two clients need to create an object and there are pending spills, but only one of the clients will be able to create its object after the spills complete. If we just did an infinite retry loop on the second client, we could hang and never throw OutOfMemory.

There are two issues in the script you posted, both pretty subtle:

Since there are multiple clients trying to create an object at the same time, only one of them will succeed after an object has been spilled. But then since all the clients retry at around the same time, the others retry while the one that succeeded is still creating its object (so its object can't be evicted or spilled). I managed to get this to work by resetting the normal OutOfMemory retries if the client ever receives the transient error, but I don't think that will work in all cases.
During the script, there is one client (the driver) trying to get objects. With only one IO worker, this will hang because the IO worker will try to restore the driver's object, but that requires the other object to be spilled, which also requires an IO worker. I confirmed that it works once max IO workers is set to 2.

I'll push what I have, but it seems like there's a lot more to do here.

…automatic-spill

rkooo567 · 2020-11-02T04:07:06Z

src/ray/raylet/local_object_manager.cc

 }

+int64_t LocalObjectManager::SpillObjectsOfSize(int64_t num_bytes_required) {
+  if (RayConfig::instance().object_spilling_config().empty() ||


Btw, isn't the plasma store and object store running in a separate thread? If so, don't we need to lock this method? (since it is called within store.cc)?

Oh hmm I think you're right about that. I'll add a lock. It seems weird that this wasn't triggered in any of the tests yet.

rkooo567 · 2020-11-02T04:09:46Z

@stephanie-wang I need more familiarity to the codebase, so it is probably not possible, but why don't we just queue up the creation requests and invoke them after object spilling requests are completed? Is it tricky to be implemented?

stephanie-wang · 2020-11-02T15:41:28Z

@stephanie-wang I need more familiarity to the codebase, so it is probably not possible, but why don't we just queue up the creation requests and invoke them after object spilling requests are completed? Is it tricky to be implemented?

It is possible, and I think we should do it, but it's a bit complicated to do right now without refactoring the plasma store. I thought it'd be better to leave it for a separate PR.

stephanie-wang and others added 3 commits October 28, 2020 10:40

automatic spill

1e4946a

Object spilling enabled flag

8833184

Fix config issues

a4df790

stephanie-wang assigned ericl and rkooo567 Oct 28, 2020

Fix unit test

c47de3f

ericl reviewed Oct 29, 2020

View reviewed changes

python/ray/tests/test_object_spilling.py Outdated Show resolved Hide resolved

ericl added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Oct 29, 2020

missing file

19a711e

stephanie-wang removed the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Oct 29, 2020

rkooo567 reviewed Oct 29, 2020

View reviewed changes

ericl reviewed Oct 29, 2020

View reviewed changes

python/ray/_private/services.py Outdated Show resolved Hide resolved

ericl reviewed Oct 29, 2020

View reviewed changes

python/ray/_private/services.py Outdated Show resolved Hide resolved

ericl reviewed Oct 29, 2020

View reviewed changes

src/ray/raylet/local_object_manager.cc Outdated Show resolved Hide resolved

ericl added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Oct 29, 2020

stephanie-wang and others added 6 commits October 29, 2020 16:24

Update src/ray/object_manager/plasma/store.cc

96dbd69

Co-authored-by: SangBin Cho <rkooo567@gmail.com>

Update src/ray/raylet/local_object_manager.cc

78c9dde

Co-authored-by: SangBin Cho <rkooo567@gmail.com>

update

3767ddb

update

831d822

update

d47b732

log

a418c21

stephanie-wang added 5 commits October 31, 2020 13:14

Infinite and short retries on transient out of memory

726afb7

Add unit tests

40b0915

Merge remote-tracking branch 'upstream/master' into automatic-spill

b052926

Fix configs, clean up tmp dir

ed7a541

Merge branch 'automatic-spill' of github.com:stephanie-wang/ray into …

65fd691

…automatic-spill

stephanie-wang removed the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Oct 31, 2020

Fix error

159bb95

rkooo567 reviewed Nov 2, 2020

View reviewed changes

Lock

d37f5b4

ericl approved these changes Nov 2, 2020

View reviewed changes

ericl merged commit 0ba777a into ray-project:master Nov 2, 2020

ericl mentioned this pull request Nov 2, 2020

[Object Spilling] Avoid client starvation when spilling #11772

Closed

rkooo567 mentioned this pull request Nov 4, 2020

[Object spilling] Queue failed object creation requests until objects have been spilled #11796

Merged

6 tasks

stephanie-wang mentioned this pull request Nov 12, 2020

[Object Spilling] Raylet should be more resilient to IO worker failures. #11496

Closed

2 tasks

Conversation

stephanie-wang commented Oct 28, 2020

Why are these changes needed?

Related issue number

Checks

Uh oh!

ericl commented Oct 29, 2020

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

rkooo567 Oct 29, 2020

Choose a reason for hiding this comment

Uh oh!

rkooo567 Oct 29, 2020

Choose a reason for hiding this comment

Uh oh!

stephanie-wang Oct 29, 2020

Choose a reason for hiding this comment

Uh oh!

ericl Oct 29, 2020

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ericl commented Oct 29, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

stephanie-wang commented Oct 31, 2020

Uh oh!

rkooo567 Nov 2, 2020

Choose a reason for hiding this comment

Uh oh!

stephanie-wang Nov 2, 2020

Choose a reason for hiding this comment

Uh oh!

rkooo567 commented Nov 2, 2020

Uh oh!

stephanie-wang commented Nov 2, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ericl commented Oct 29, 2020 •

edited

Loading