Skip to content

[core] Move out-of-memory handling into the plasma store and support async object creation#12186

Merged
stephanie-wang merged 41 commits intoray-project:masterfrom
stephanie-wang:plasma-oom
Dec 2, 2020
Merged

[core] Move out-of-memory handling into the plasma store and support async object creation#12186
stephanie-wang merged 41 commits intoray-project:masterfrom
stephanie-wang:plasma-oom

Conversation

@stephanie-wang
Copy link
Copy Markdown
Contributor

Why are these changes needed?

This moves out-of-memory handling into the plasma store instead of letting the plasma clients retry, to improve the system's ability to detect and respond to OOM conditions (#11772). All retry logic, LRU eviction logic, etc is now moved into the plasma store.

This also requires making the object Create request to the object store async so that:

  1. The raylet thread does not get blocked when trying to create an object that cannot be created immediately (although this PR does not address the blocking IPC request-reply to the plasma store).
  2. Workers do not deadlock because one thread is stuck in Create while another thread is trying to release memory.

To fix 1, this PR adds an option to fulfill or fail a creation request immediately. To fix 2, this PR adds a "request ID" to creation replies that the worker uses to contact the plasma store again.

Related issue number

Closes #11772 and #11994.

Checks

  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

I thought about ways to try and split this PR down, but it's a little challenging because I don't think the system would be stable if we only moved OOM handling into the plasma store but didn't address 1 or 2.

Copy link
Copy Markdown
Contributor

@ericl ericl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks awesome!

@ericl
Copy link
Copy Markdown
Contributor

ericl commented Nov 20, 2020

//python/ray/tests:test_reference_counting TIMEOUT in 3 out of 3 in 315.0s
Stats over 3 runs: max = 315.0s, min = 315.0s, avg = 315.0s, dev = 0.0s
/home/travis/.cache/bazel/_bazel_travis/b88c129a127452fc94033a29d9f90e20/execroot/com_github_ray_project_ray/bazel-out/k8-opt/testlogs/python/ray/tests/test_reference_counting/test.log
/home/travis/.cache/bazel/_bazel_travis/b88c129a127452fc94033a29d9f90e20/execroot/com_github_ray_project_ray/bazel-out/k8-opt/testlogs/python/ray/tests/test_reference_counting/test_attempts/attempt_1.log
/home/travis/.cache/bazel/_bazel_travis/b88c129a127452fc94033a29d9f90e20/execroot/com_github_ray_project_ray/bazel-out/k8-opt/testlogs/python/ray/tests/test_reference_counting/test_attempts/attempt_2.log
//python/ray/tests:test_memory_scheduling FAILED in 3 out of 3 in 55.9s
Stats over 3 runs: max = 55.9s, min = 48.6s, avg = 51.2s, dev = 3.3s
/home/travis/.cache/bazel/_bazel_travis/b88c129a127452fc94033a29d9f90e20/execroot/com_github_ray_project_ray/bazel-out/k8-opt/testlogs/python/ray/tests/test_memory_scheduling/test.log
/home/travis/.cache/bazel/_bazel_travis/b88c129a127452fc94033a29d9f90e20/execroot/com_github_ray_project_ray/bazel-out/k8-opt/testlogs/python/ray/tests/test_memory_scheduling/test_attempts/attempt_1.log
/home/travis/.cache/bazel/_bazel_travis/b88c129a127452fc94033a29d9f90e20/execroot/com_github_ray_project_ray/bazel-out/k8-opt/testlogs/python/ray/tests/test_memory_scheduling/test_attempts/attempt_2.log
//python/ray/tests:test_object_spilling FAILED in 3 out of 3 in 75.1s
Stats over 3 runs: max = 75.1s, min = 74.2s, avg = 74.8s, dev = 0.4s
/home/travis/.cache/bazel/_bazel_travis/b88c129a127452fc94033a29d9f90e20/execroot/com_github_ray_project_ray/bazel-out/k8-opt/testlogs/python/ray/tests/test_object_spilling/test.log
/home/travis/.cache/bazel/_bazel_travis/b88c129a127452fc94033a29d9f90e20/execroot/com_github_ray_project_ray/bazel-out/k8-opt/testlogs/python/ray/tests/test_object_spilling/test_attempts/attempt_1.log
/home/travis/.cache/bazel/_bazel_travis/b88c129a127452fc94033a29d9f90e20/execroot/com_github_ray_project_ray/bazel-out/k8-opt/testlogs/python/ray/tests/test_object_spilling/test_attempts/attempt_2.log

@ericl ericl added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Nov 20, 2020
@stephanie-wang
Copy link
Copy Markdown
Contributor Author

bash src/ray/test/run_object_manager_tests.sh failed on OSX CI on the initial run, but it passed on a second run. Merging for now, but not sure if this was a pre-existing issue.

@stephanie-wang stephanie-wang merged commit 443339a into ray-project:master Dec 2, 2020
@stephanie-wang stephanie-wang deleted the plasma-oom branch December 2, 2020 18:26
@ConeyLiu
Copy link
Copy Markdown
Contributor

ConeyLiu commented Dec 3, 2020

Hi @stephanie-wang, I got the following errors when call ray.init().

 1 [2020-12-03 09:04:47,136 C 20925 20939] store.cc:1089:  Check failed: 0
  2 [2020-12-03 09:04:47,136 E 20925 20939] logging.cc:414: *** Aborted at 1606957487 (unix time) try "date -d @1606957487" if you ar    e using GNU date ***
  3 [2020-12-03 09:04:47,137 E 20925 20939] logging.cc:414: PC: @                0x0 (unknown)
  4 [2020-12-03 09:04:47,137 E 20925 20939] logging.cc:414: *** SIGABRT (@0x3e8000051bd) received by PID 20925 (TID 0x7f3cdbfff700) f    rom PID 20925; stack trace: ***
  5 [2020-12-03 09:04:47,138 E 20925 20939] logging.cc:414:     @     0x7f3cf4f148a0 (unknown)
  6 [2020-12-03 09:04:47,138 E 20925 20939] logging.cc:414:     @     0x7f3cf472ff47 gsignal
  7 [2020-12-03 09:04:47,139 E 20925 20939] logging.cc:414:     @     0x7f3cf47318b1 abort
  8 [2020-12-03 09:04:47,141 E 20925 20939] logging.cc:414:     @     0x55bdf139c015 ray::RayLog::~RayLog()
  9 [2020-12-03 09:04:47,142 E 20925 20939] logging.cc:414:     @     0x55bdf10bce03 plasma::PlasmaStore::ProcessMessage()
 10 [2020-12-03 09:04:47,144 E 20925 20939] logging.cc:414:     @     0x55bdf10ae451 std::_Function_handler<>::_M_invoke()
 11 [2020-12-03 09:04:47,144 E 20925 20939] logging.cc:414:     @     0x55bdf10d2c5e _ZNSt17_Function_handlerIFvSt10shared_ptrIN3ray1    6ClientConnectionEElRKSt6vectorIhSaIhEEEZN6plasma6Client6CreateESt8functionIFNS1_6StatusES0_ISB_ENSA_7flatbuf11MessageTypeES8_EEO    N5boost4asio19basic_stream_socketINSK_7generic15stream_protocolENSK_8executorEEEEUlS3_lS8_E_E9_M_invokeERKSt9_Any_dataOS3_OlS8_
 12 [2020-12-03 09:04:47,145 E 20925 20939] logging.cc:414:     @     0x55bdf1357833 ray::ClientConnection::ProcessMessage()
 13 [2020-12-03 09:04:47,146 E 20925 20939] logging.cc:414:     @     0x55bdf1353eec boost::asio::detail::read_op<>::operator()()
 14 [2020-12-03 09:04:47,147 E 20925 20939] logging.cc:414:     @     0x55bdf1354ef1 boost::asio::detail::reactive_socket_recv_op<>::    do_complete()
 15 [2020-12-03 09:04:47,147 E 20925 20939] logging.cc:414:     @     0x55bdf1710661 boost::asio::detail::scheduler::do_run_one()
 16 [2020-12-03 09:04:47,148 E 20925 20939] logging.cc:414:     @     0x55bdf1711041 boost::asio::detail::scheduler::run()
 17 [2020-12-03 09:04:47,149 E 20925 20939] logging.cc:414:     @     0x55bdf17147f3 boost::asio::io_context::run()
 18 [2020-12-03 09:04:47,151 E 20925 20939] logging.cc:414:     @     0x55bdf10be38b plasma::PlasmaStoreRunner::Start()
 19 [2020-12-03 09:04:47,152 E 20925 20939] logging.cc:414:     @     0x55bdf1067acc std::thread::_State_impl<>::_M_run()
 20 [2020-12-03 09:04:47,153 E 20925 20939] logging.cc:414:     @     0x7f3cf558fd80 (unknown)
 21 [2020-12-03 09:04:47,153 E 20925 20939] logging.cc:414:     @     0x7f3cf4f096db start_thread
 22 [2020-12-03 09:04:47,153 E 20925 20939] logging.cc:414:     @     0x7f3cf4812a3f clone

@kfstorm
Copy link
Copy Markdown
Member

kfstorm commented Dec 3, 2020

@stephanie-wang I believe this PR breaks Java CI. I see this in the CI results of the last commit of this PR:

ETRIED: testGlobalGcWhenFullWithPut
io.ray.runtime.exception.RayException: Failed to put object ffffffffffffffffffffffff0100000008000000 in object store because it is full. Object size is 83886139 bytes.
	at io.ray.runtime.object.NativeObjectStore.nativePut(Native Method)
	at io.ray.runtime.object.NativeObjectStore.putRaw(NativeObjectStore.java:34)
	at io.ray.runtime.object.ObjectStore.put(ObjectStore.java:55)
	at io.ray.runtime.AbstractRayRuntime.put(AbstractRayRuntime.java:80)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at io.ray.runtime.RayRuntimeProxy.invoke(RayRuntimeProxy.java:45)
	at com.sun.proxy.$Proxy8.put(Unknown Source)
	at io.ray.api.Ray.put(Ray.java:73)
	at io.ray.test.GlobalGcTest.testGlobalGcWhenFull(GlobalGcTest.java:77)
	at io.ray.test.GlobalGcTest.testGlobalGcWhenFullWithPut(GlobalGcTest.java:90)

@stephanie-wang
Copy link
Copy Markdown
Contributor Author

Hi @stephanie-wang, I got the following errors when call ray.init().

 1 [2020-12-03 09:04:47,136 C 20925 20939] store.cc:1089:  Check failed: 0
  2 [2020-12-03 09:04:47,136 E 20925 20939] logging.cc:414: *** Aborted at 1606957487 (unix time) try "date -d @1606957487" if you ar    e using GNU date ***
  3 [2020-12-03 09:04:47,137 E 20925 20939] logging.cc:414: PC: @                0x0 (unknown)
  4 [2020-12-03 09:04:47,137 E 20925 20939] logging.cc:414: *** SIGABRT (@0x3e8000051bd) received by PID 20925 (TID 0x7f3cdbfff700) f    rom PID 20925; stack trace: ***
  5 [2020-12-03 09:04:47,138 E 20925 20939] logging.cc:414:     @     0x7f3cf4f148a0 (unknown)
  6 [2020-12-03 09:04:47,138 E 20925 20939] logging.cc:414:     @     0x7f3cf472ff47 gsignal
  7 [2020-12-03 09:04:47,139 E 20925 20939] logging.cc:414:     @     0x7f3cf47318b1 abort
  8 [2020-12-03 09:04:47,141 E 20925 20939] logging.cc:414:     @     0x55bdf139c015 ray::RayLog::~RayLog()
  9 [2020-12-03 09:04:47,142 E 20925 20939] logging.cc:414:     @     0x55bdf10bce03 plasma::PlasmaStore::ProcessMessage()
 10 [2020-12-03 09:04:47,144 E 20925 20939] logging.cc:414:     @     0x55bdf10ae451 std::_Function_handler<>::_M_invoke()
 11 [2020-12-03 09:04:47,144 E 20925 20939] logging.cc:414:     @     0x55bdf10d2c5e _ZNSt17_Function_handlerIFvSt10shared_ptrIN3ray1    6ClientConnectionEElRKSt6vectorIhSaIhEEEZN6plasma6Client6CreateESt8functionIFNS1_6StatusES0_ISB_ENSA_7flatbuf11MessageTypeES8_EEO    N5boost4asio19basic_stream_socketINSK_7generic15stream_protocolENSK_8executorEEEEUlS3_lS8_E_E9_M_invokeERKSt9_Any_dataOS3_OlS8_
 12 [2020-12-03 09:04:47,145 E 20925 20939] logging.cc:414:     @     0x55bdf1357833 ray::ClientConnection::ProcessMessage()
 13 [2020-12-03 09:04:47,146 E 20925 20939] logging.cc:414:     @     0x55bdf1353eec boost::asio::detail::read_op<>::operator()()
 14 [2020-12-03 09:04:47,147 E 20925 20939] logging.cc:414:     @     0x55bdf1354ef1 boost::asio::detail::reactive_socket_recv_op<>::    do_complete()
 15 [2020-12-03 09:04:47,147 E 20925 20939] logging.cc:414:     @     0x55bdf1710661 boost::asio::detail::scheduler::do_run_one()
 16 [2020-12-03 09:04:47,148 E 20925 20939] logging.cc:414:     @     0x55bdf1711041 boost::asio::detail::scheduler::run()
 17 [2020-12-03 09:04:47,149 E 20925 20939] logging.cc:414:     @     0x55bdf17147f3 boost::asio::io_context::run()
 18 [2020-12-03 09:04:47,151 E 20925 20939] logging.cc:414:     @     0x55bdf10be38b plasma::PlasmaStoreRunner::Start()
 19 [2020-12-03 09:04:47,152 E 20925 20939] logging.cc:414:     @     0x55bdf1067acc std::thread::_State_impl<>::_M_run()
 20 [2020-12-03 09:04:47,153 E 20925 20939] logging.cc:414:     @     0x7f3cf558fd80 (unknown)
 21 [2020-12-03 09:04:47,153 E 20925 20939] logging.cc:414:     @     0x7f3cf4f096db start_thread
 22 [2020-12-03 09:04:47,153 E 20925 20939] logging.cc:414:     @     0x7f3cf4812a3f clone

Can you post an issue with more information to reproduce?

@kfstorm
Copy link
Copy Markdown
Member

kfstorm commented Dec 3, 2020

@stephanie-wang Please take a look at the Java CI in this link: https://github.com/ray-project/ray/runs/1482982242, or you can find similar failures in recent master builds.

stephanie-wang added a commit to stephanie-wang/ray that referenced this pull request Dec 3, 2020
@ConeyLiu
Copy link
Copy Markdown
Contributor

ConeyLiu commented Dec 3, 2020

Hi @stephanie-wang

Can you post an issue with more information to reproduce?

After cleaning up all bazel cache and rebuilt. It works now.

@kfstorm kfstorm mentioned this pull request Dec 10, 2020
6 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

@author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Object Spilling] Avoid client starvation when spilling

5 participants