[core] Move out-of-memory handling into the plasma store and support async object creation by stephanie-wang · Pull Request #12186 · ray-project/ray

stephanie-wang · 2020-11-20T00:27:37Z

Why are these changes needed?

This moves out-of-memory handling into the plasma store instead of letting the plasma clients retry, to improve the system's ability to detect and respond to OOM conditions (#11772). All retry logic, LRU eviction logic, etc is now moved into the plasma store.

This also requires making the object Create request to the object store async so that:

The raylet thread does not get blocked when trying to create an object that cannot be created immediately (although this PR does not address the blocking IPC request-reply to the plasma store).
Workers do not deadlock because one thread is stuck in Create while another thread is trying to release memory.

To fix 1, this PR adds an option to fulfill or fail a creation request immediately. To fix 2, this PR adds a "request ID" to creation replies that the worker uses to contact the plasma store again.

Related issue number

Closes #11772 and #11994.

Checks

I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

I thought about ways to try and split this PR down, but it's a little challenging because I don't think the system would be stable if we only moved OOM handling into the plasma store but didn't address 1 or 2.

…a-oom

ericl

This looks awesome!

src/ray/object_manager/plasma/create_request_queue.cc

src/ray/object_manager/plasma/create_request_queue.h

src/ray/object_manager/plasma/create_request_queue.cc

ericl · 2020-11-20T21:30:41Z

//python/ray/tests:test_reference_counting TIMEOUT in 3 out of 3 in 315.0s
Stats over 3 runs: max = 315.0s, min = 315.0s, avg = 315.0s, dev = 0.0s
/home/travis/.cache/bazel/_bazel_travis/b88c129a127452fc94033a29d9f90e20/execroot/com_github_ray_project_ray/bazel-out/k8-opt/testlogs/python/ray/tests/test_reference_counting/test.log
/home/travis/.cache/bazel/_bazel_travis/b88c129a127452fc94033a29d9f90e20/execroot/com_github_ray_project_ray/bazel-out/k8-opt/testlogs/python/ray/tests/test_reference_counting/test_attempts/attempt_1.log
/home/travis/.cache/bazel/_bazel_travis/b88c129a127452fc94033a29d9f90e20/execroot/com_github_ray_project_ray/bazel-out/k8-opt/testlogs/python/ray/tests/test_reference_counting/test_attempts/attempt_2.log
//python/ray/tests:test_memory_scheduling FAILED in 3 out of 3 in 55.9s
Stats over 3 runs: max = 55.9s, min = 48.6s, avg = 51.2s, dev = 3.3s
/home/travis/.cache/bazel/_bazel_travis/b88c129a127452fc94033a29d9f90e20/execroot/com_github_ray_project_ray/bazel-out/k8-opt/testlogs/python/ray/tests/test_memory_scheduling/test.log
/home/travis/.cache/bazel/_bazel_travis/b88c129a127452fc94033a29d9f90e20/execroot/com_github_ray_project_ray/bazel-out/k8-opt/testlogs/python/ray/tests/test_memory_scheduling/test_attempts/attempt_1.log
/home/travis/.cache/bazel/_bazel_travis/b88c129a127452fc94033a29d9f90e20/execroot/com_github_ray_project_ray/bazel-out/k8-opt/testlogs/python/ray/tests/test_memory_scheduling/test_attempts/attempt_2.log
//python/ray/tests:test_object_spilling FAILED in 3 out of 3 in 75.1s
Stats over 3 runs: max = 75.1s, min = 74.2s, avg = 74.8s, dev = 0.4s
/home/travis/.cache/bazel/_bazel_travis/b88c129a127452fc94033a29d9f90e20/execroot/com_github_ray_project_ray/bazel-out/k8-opt/testlogs/python/ray/tests/test_object_spilling/test.log
/home/travis/.cache/bazel/_bazel_travis/b88c129a127452fc94033a29d9f90e20/execroot/com_github_ray_project_ray/bazel-out/k8-opt/testlogs/python/ray/tests/test_object_spilling/test_attempts/attempt_1.log
/home/travis/.cache/bazel/_bazel_travis/b88c129a127452fc94033a29d9f90e20/execroot/com_github_ray_project_ray/bazel-out/k8-opt/testlogs/python/ray/tests/test_object_spilling/test_attempts/attempt_2.log

…a-oom

This reverts commit 6bf2f6e. Revert "debug" This reverts commit 7301709. Revert "debug" This reverts commit 5a15552. Revert "debug" This reverts commit b50c210. Revert "debug travis" This reverts commit 012b872.

stephanie-wang · 2020-12-02T18:25:50Z

bash src/ray/test/run_object_manager_tests.sh failed on OSX CI on the initial run, but it passed on a second run. Merging for now, but not sure if this was a pre-existing issue.

ConeyLiu · 2020-12-03T01:13:45Z

Hi @stephanie-wang, I got the following errors when call ray.init().

 1 [2020-12-03 09:04:47,136 C 20925 20939] store.cc:1089:  Check failed: 0
  2 [2020-12-03 09:04:47,136 E 20925 20939] logging.cc:414: *** Aborted at 1606957487 (unix time) try "date -d @1606957487" if you ar    e using GNU date ***
  3 [2020-12-03 09:04:47,137 E 20925 20939] logging.cc:414: PC: @                0x0 (unknown)
  4 [2020-12-03 09:04:47,137 E 20925 20939] logging.cc:414: *** SIGABRT (@0x3e8000051bd) received by PID 20925 (TID 0x7f3cdbfff700) f    rom PID 20925; stack trace: ***
  5 [2020-12-03 09:04:47,138 E 20925 20939] logging.cc:414:     @     0x7f3cf4f148a0 (unknown)
  6 [2020-12-03 09:04:47,138 E 20925 20939] logging.cc:414:     @     0x7f3cf472ff47 gsignal
  7 [2020-12-03 09:04:47,139 E 20925 20939] logging.cc:414:     @     0x7f3cf47318b1 abort
  8 [2020-12-03 09:04:47,141 E 20925 20939] logging.cc:414:     @     0x55bdf139c015 ray::RayLog::~RayLog()
  9 [2020-12-03 09:04:47,142 E 20925 20939] logging.cc:414:     @     0x55bdf10bce03 plasma::PlasmaStore::ProcessMessage()
 10 [2020-12-03 09:04:47,144 E 20925 20939] logging.cc:414:     @     0x55bdf10ae451 std::_Function_handler<>::_M_invoke()
 11 [2020-12-03 09:04:47,144 E 20925 20939] logging.cc:414:     @     0x55bdf10d2c5e _ZNSt17_Function_handlerIFvSt10shared_ptrIN3ray1    6ClientConnectionEElRKSt6vectorIhSaIhEEEZN6plasma6Client6CreateESt8functionIFNS1_6StatusES0_ISB_ENSA_7flatbuf11MessageTypeES8_EEO    N5boost4asio19basic_stream_socketINSK_7generic15stream_protocolENSK_8executorEEEEUlS3_lS8_E_E9_M_invokeERKSt9_Any_dataOS3_OlS8_
 12 [2020-12-03 09:04:47,145 E 20925 20939] logging.cc:414:     @     0x55bdf1357833 ray::ClientConnection::ProcessMessage()
 13 [2020-12-03 09:04:47,146 E 20925 20939] logging.cc:414:     @     0x55bdf1353eec boost::asio::detail::read_op<>::operator()()
 14 [2020-12-03 09:04:47,147 E 20925 20939] logging.cc:414:     @     0x55bdf1354ef1 boost::asio::detail::reactive_socket_recv_op<>::    do_complete()
 15 [2020-12-03 09:04:47,147 E 20925 20939] logging.cc:414:     @     0x55bdf1710661 boost::asio::detail::scheduler::do_run_one()
 16 [2020-12-03 09:04:47,148 E 20925 20939] logging.cc:414:     @     0x55bdf1711041 boost::asio::detail::scheduler::run()
 17 [2020-12-03 09:04:47,149 E 20925 20939] logging.cc:414:     @     0x55bdf17147f3 boost::asio::io_context::run()
 18 [2020-12-03 09:04:47,151 E 20925 20939] logging.cc:414:     @     0x55bdf10be38b plasma::PlasmaStoreRunner::Start()
 19 [2020-12-03 09:04:47,152 E 20925 20939] logging.cc:414:     @     0x55bdf1067acc std::thread::_State_impl<>::_M_run()
 20 [2020-12-03 09:04:47,153 E 20925 20939] logging.cc:414:     @     0x7f3cf558fd80 (unknown)
 21 [2020-12-03 09:04:47,153 E 20925 20939] logging.cc:414:     @     0x7f3cf4f096db start_thread
 22 [2020-12-03 09:04:47,153 E 20925 20939] logging.cc:414:     @     0x7f3cf4812a3f clone

kfstorm · 2020-12-03T12:37:05Z

@stephanie-wang I believe this PR breaks Java CI. I see this in the CI results of the last commit of this PR:

ETRIED: testGlobalGcWhenFullWithPut
io.ray.runtime.exception.RayException: Failed to put object ffffffffffffffffffffffff0100000008000000 in object store because it is full. Object size is 83886139 bytes.
	at io.ray.runtime.object.NativeObjectStore.nativePut(Native Method)
	at io.ray.runtime.object.NativeObjectStore.putRaw(NativeObjectStore.java:34)
	at io.ray.runtime.object.ObjectStore.put(ObjectStore.java:55)
	at io.ray.runtime.AbstractRayRuntime.put(AbstractRayRuntime.java:80)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at io.ray.runtime.RayRuntimeProxy.invoke(RayRuntimeProxy.java:45)
	at com.sun.proxy.$Proxy8.put(Unknown Source)
	at io.ray.api.Ray.put(Ray.java:73)
	at io.ray.test.GlobalGcTest.testGlobalGcWhenFull(GlobalGcTest.java:77)
	at io.ray.test.GlobalGcTest.testGlobalGcWhenFullWithPut(GlobalGcTest.java:90)

stephanie-wang · 2020-12-03T14:11:15Z

Hi @stephanie-wang, I got the following errors when call ray.init().

 1 [2020-12-03 09:04:47,136 C 20925 20939] store.cc:1089:  Check failed: 0
  2 [2020-12-03 09:04:47,136 E 20925 20939] logging.cc:414: *** Aborted at 1606957487 (unix time) try "date -d @1606957487" if you ar    e using GNU date ***
  3 [2020-12-03 09:04:47,137 E 20925 20939] logging.cc:414: PC: @                0x0 (unknown)
  4 [2020-12-03 09:04:47,137 E 20925 20939] logging.cc:414: *** SIGABRT (@0x3e8000051bd) received by PID 20925 (TID 0x7f3cdbfff700) f    rom PID 20925; stack trace: ***
  5 [2020-12-03 09:04:47,138 E 20925 20939] logging.cc:414:     @     0x7f3cf4f148a0 (unknown)
  6 [2020-12-03 09:04:47,138 E 20925 20939] logging.cc:414:     @     0x7f3cf472ff47 gsignal
  7 [2020-12-03 09:04:47,139 E 20925 20939] logging.cc:414:     @     0x7f3cf47318b1 abort
  8 [2020-12-03 09:04:47,141 E 20925 20939] logging.cc:414:     @     0x55bdf139c015 ray::RayLog::~RayLog()
  9 [2020-12-03 09:04:47,142 E 20925 20939] logging.cc:414:     @     0x55bdf10bce03 plasma::PlasmaStore::ProcessMessage()
 10 [2020-12-03 09:04:47,144 E 20925 20939] logging.cc:414:     @     0x55bdf10ae451 std::_Function_handler<>::_M_invoke()
 11 [2020-12-03 09:04:47,144 E 20925 20939] logging.cc:414:     @     0x55bdf10d2c5e _ZNSt17_Function_handlerIFvSt10shared_ptrIN3ray1    6ClientConnectionEElRKSt6vectorIhSaIhEEEZN6plasma6Client6CreateESt8functionIFNS1_6StatusES0_ISB_ENSA_7flatbuf11MessageTypeES8_EEO    N5boost4asio19basic_stream_socketINSK_7generic15stream_protocolENSK_8executorEEEEUlS3_lS8_E_E9_M_invokeERKSt9_Any_dataOS3_OlS8_
 12 [2020-12-03 09:04:47,145 E 20925 20939] logging.cc:414:     @     0x55bdf1357833 ray::ClientConnection::ProcessMessage()
 13 [2020-12-03 09:04:47,146 E 20925 20939] logging.cc:414:     @     0x55bdf1353eec boost::asio::detail::read_op<>::operator()()
 14 [2020-12-03 09:04:47,147 E 20925 20939] logging.cc:414:     @     0x55bdf1354ef1 boost::asio::detail::reactive_socket_recv_op<>::    do_complete()
 15 [2020-12-03 09:04:47,147 E 20925 20939] logging.cc:414:     @     0x55bdf1710661 boost::asio::detail::scheduler::do_run_one()
 16 [2020-12-03 09:04:47,148 E 20925 20939] logging.cc:414:     @     0x55bdf1711041 boost::asio::detail::scheduler::run()
 17 [2020-12-03 09:04:47,149 E 20925 20939] logging.cc:414:     @     0x55bdf17147f3 boost::asio::io_context::run()
 18 [2020-12-03 09:04:47,151 E 20925 20939] logging.cc:414:     @     0x55bdf10be38b plasma::PlasmaStoreRunner::Start()
 19 [2020-12-03 09:04:47,152 E 20925 20939] logging.cc:414:     @     0x55bdf1067acc std::thread::_State_impl<>::_M_run()
 20 [2020-12-03 09:04:47,153 E 20925 20939] logging.cc:414:     @     0x7f3cf558fd80 (unknown)
 21 [2020-12-03 09:04:47,153 E 20925 20939] logging.cc:414:     @     0x7f3cf4f096db start_thread
 22 [2020-12-03 09:04:47,153 E 20925 20939] logging.cc:414:     @     0x7f3cf4812a3f clone

Can you post an issue with more information to reproduce?

kfstorm · 2020-12-03T14:13:14Z

@stephanie-wang Please take a look at the Java CI in this link: https://github.com/ray-project/ray/runs/1482982242, or you can find similar failures in recent master builds.

…support async object creation (ray-project#12186)" This reverts commit 443339a.

ConeyLiu · 2020-12-03T14:49:06Z

Hi @stephanie-wang

Can you post an issue with more information to reproduce?

After cleaning up all bazel cache and rebuilt. It works now.

stephanie-wang added 24 commits November 9, 2020 23:01

Refactor to extract creation request queue

75f7968

timer on oom

9d75df9

Merge branch 'plasma-oom' of github.com:stephanie-wang/ray into plasm…

3c2c91e

…a-oom

move timer out

c3a22a0

Move evict_if_full and on_store_full into plasma store

f2af7d1

Remove client-side code

846eff3

revert

0f75317

Distinguish between transient and permanent OOM delays

ea4c241

Merge remote-tracking branch 'upstream/master' into plasma-oom

eb26f2f

update

9bb7081

Move out create request queue, unit test

b05f9be

unit test

942b568

Fix max retries

ffc6e6a

test

b690e5f

Do not pin restored objects

965d0aa

Merge remote-tracking branch 'upstream/master' into plasma-oom

b62b7f7

First pass to add polling requests, unit test passes

422f9f5

worker plasma client retries plasma requests

1589343

cleanup

01e3373

Clean up after disconnected clients, check memory leaks

158b633

Support immediate requests in request queue

fd9ae80

Option to try creating immediately

f0fe5db

Merge remote-tracking branch 'upstream/master' into plasma-oom

a78aa0d

lint

d0193f2

stephanie-wang assigned ericl and rkooo567 Nov 20, 2020

ericl approved these changes Nov 20, 2020

View reviewed changes

src/ray/object_manager/plasma/create_request_queue.cc Show resolved Hide resolved

src/ray/object_manager/plasma/create_request_queue.h Show resolved Hide resolved

ericl reviewed Nov 20, 2020

View reviewed changes

src/ray/object_manager/plasma/create_request_queue.cc Outdated Show resolved Hide resolved

stephanie-wang added 2 commits November 19, 2020 21:04

Fix build, address comments

41dc394

doc

e1dfa29

rkooo567 approved these changes Nov 20, 2020

View reviewed changes

stephanie-wang added 2 commits November 20, 2020 11:04

Merge remote-tracking branch 'upstream/master' into plasma-oom

a5e6d7a

fixes

4638791

ericl added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Nov 20, 2020

stephanie-wang added 13 commits November 24, 2020 12:41

Merge remote-tracking branch 'upstream/master' into plasma-oom

1d1d754

debug travis

012b872

debug

b50c210

debug

5a15552

Merge remote-tracking branch 'upstream/master' into plasma-oom

dc76abe

debug

7301709

Merge branch 'plasma-oom' of github.com:stephanie-wang/ray into plasm…

c62f7c8

…a-oom

debug

6bf2f6e

Revert "debug"

f1ed619

This reverts commit 6bf2f6e. Revert "debug" This reverts commit 7301709. Revert "debug" This reverts commit 5a15552. Revert "debug" This reverts commit b50c210. Revert "debug travis" This reverts commit 012b872.

Skip if new scheduler enabled

e0c245b

error message

657d8d4

Merge remote-tracking branch 'upstream/master' into plasma-oom

2281b8b

merge

f1e27ee

stephanie-wang merged commit 443339a into ray-project:master Dec 2, 2020

stephanie-wang deleted the plasma-oom branch December 2, 2020 18:26

stephanie-wang added a commit to stephanie-wang/ray that referenced this pull request Dec 3, 2020

Revert "[core] Move out-of-memory handling into the plasma store and …

b90e685

…support async object creation (ray-project#12186)" This reverts commit 443339a.

stephanie-wang mentioned this pull request Dec 3, 2020

Revert "[core] Move out-of-memory handling into the plasma store and … #12602

Closed

fishbone mentioned this pull request Dec 7, 2020

[core] Introduce fetch_local to ray.wait #12526

Merged

6 tasks

kfstorm mentioned this pull request Dec 10, 2020

Add support for Python 3.9 #12613

Merged

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[core] Move out-of-memory handling into the plasma store and support async object creation#12186

[core] Move out-of-memory handling into the plasma store and support async object creation#12186
stephanie-wang merged 41 commits intoray-project:masterfrom
stephanie-wang:plasma-oom

stephanie-wang commented Nov 20, 2020

Uh oh!

ericl left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ericl commented Nov 20, 2020

Uh oh!

stephanie-wang commented Dec 2, 2020

Uh oh!

ConeyLiu commented Dec 3, 2020

Uh oh!

kfstorm commented Dec 3, 2020

Uh oh!

stephanie-wang commented Dec 3, 2020

Uh oh!

kfstorm commented Dec 3, 2020

Uh oh!

ConeyLiu commented Dec 3, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

stephanie-wang commented Nov 20, 2020

Why are these changes needed?

Related issue number

Checks

Uh oh!

ericl left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ericl commented Nov 20, 2020

Uh oh!

stephanie-wang commented Dec 2, 2020

Uh oh!

ConeyLiu commented Dec 3, 2020

Uh oh!

kfstorm commented Dec 3, 2020

Uh oh!

stephanie-wang commented Dec 3, 2020

Uh oh!

kfstorm commented Dec 3, 2020

Uh oh!

ConeyLiu commented Dec 3, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants