Skip to content

[core] Introduce fetch_local to ray.wait#12526

Merged
ericl merged 25 commits intoray-project:masterfrom
fishbone:fetch_local
Dec 17, 2020
Merged

[core] Introduce fetch_local to ray.wait#12526
ericl merged 25 commits intoray-project:masterfrom
fishbone:fetch_local

Conversation

@fishbone
Copy link
Copy Markdown
Contributor

@fishbone fishbone commented Dec 1, 2020

Why are these changes needed?

This PR introduce a new API for ray.wait method to indicate that it can return once the object is ready anywhere in the cluster.

Related issue number

#12190

Checks

  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@fishbone
Copy link
Copy Markdown
Contributor Author

fishbone commented Dec 1, 2020

2020-12-01 07:55:33,331	INFO worker.py:657 -- Connecting to existing Ray cluster at address: 172.17.0.2:6379
local ObjectRef(ffffffffffffffffffffffff0100000001000000)
remote ObjectRef(df5a1a828c9685d3ffffffff0100000001000000)
----
(pid=raylet) [2020-12-01 07:55:33,869 E 656 671] store.cc:300: Not enough memory to create the object df5a1a828c9685d3ffffffff0100000001000000, data_size=41943287, metadata_size=6, will send a reply of PlasmaError::OutOfMemory
(pid=raylet) [2020-12-01 07:55:33,889 E 656 671] store.cc:300: Not enough memory to create the object df5a1a828c9685d3ffffffff0100000001000000, data_size=41943287, metadata_size=6, will send a reply of PlasmaError::OutOfMemory
(pid=raylet) [2020-12-01 07:55:33,897 E 656 671] store.cc:300: Not enough memory to create the object df5a1a828c9685d3ffffffff0100000001000000, data_size=41943287, metadata_size=6, will send a reply of PlasmaError::OutOfMemory
(pid=raylet) [2020-12-01 07:55:33,900 E 656 671] store.cc:300: Not enough memory to create the object df5a1a828c9685d3ffffffff0100000001000000, data_size=41943287, metadata_size=6, will send a reply of PlasmaError::OutOfMemory
(pid=raylet) [2020-12-01 07:55:33,900 E 656 671] store.cc:300: Not enough memory to create the object df5a1a828c9685d3ffffffff0100000001000000, data_size=41943287, metadata_size=6, will send a reply of PlasmaError::OutOfMemory
(pid=raylet) [2020-12-01 07:55:33,902 E 656 671] store.cc:300: Not enough memory to create the object df5a1a828c9685d3ffffffff0100000001000000, data_size=41943287, metadata_size=6, will send a reply of PlasmaError::OutOfMemory
(pid=raylet) [2020-12-01 07:55:33,907 E 656 671] store.cc:300: Not enough memory to create the object df5a1a828c9685d3ffffffff0100000001000000, data_size=41943287, metadata_size=6, will send a reply of PlasmaError::OutOfMemory
(pid=raylet) [2020-12-01 07:55:33,907 E 656 671] store.cc:300: Not enough memory to create the object df5a1a828c9685d3ffffffff0100000001000000, data_size=41943287, metadata_size=6, will send a reply of PlasmaError::OutOfMemory
(pid=raylet) [2020-12-01 07:55:33,910 E 656 671] store.cc:300: Not enough memory to create the object df5a1a828c9685d3ffffffff0100000001000000, data_size=41943287, metadata_size=6, will send a reply of PlasmaError::OutOfMemory
----
END

@fishbone fishbone marked this pull request as ready for review December 4, 2020 07:47
@fishbone fishbone changed the title Fetch local [core] Introduce fetch_local to ray.wait Dec 4, 2020
Copy link
Copy Markdown
Contributor

@ericl ericl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ahbone can you fix the test? I think you want to schedule the task on a remote node (perhaps by setting num_cpus=0 on the head node or using custom node resources to force scheduling on the remote node). Then, the remote object can only be awaited with fetch_local=True.


ray.wait([remote_ref], timeout=1, fetch_local=False)
del local_ref
ray.get(remote_ref)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is being tested here?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we could test calling both versions of ray.wait. So check that ray.wait(fetch_local=True) times out and ray.wait(fetch_local=False) returns the remote ref as ready.

@ericl ericl added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Dec 6, 2020
num_returns (int): The number of object refs that should be returned.
timeout (float): The maximum amount of time in seconds to wait before
returning.
fetch_local (bool): Fetch the object to local node if it's not.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you expand on this comment? I think it's clearer to say that if this is true, the call will only return a ref as ready once the object is fetched to the local node.

@fishbone
Copy link
Copy Markdown
Contributor Author

fishbone commented Dec 6, 2020

@ericl I do have some questions about running python's testing cases. I tried to use bazel test to run the test cases of python and it complaints fixture not found. I then use pytest to run it directly and it looks fine. Am I running it through the right way?
Besides, is there any way to check the testing log of specific failed test case?
For example

//python/ray/tests:test_joblib                                          TIMEOUT in 3 out of 3 in 300.2s
  Stats over 3 runs: max = 300.2s, min = 300.2s, avg = 300.2s, dev = 0.0s
  C:/users/runneradmin/_bazel_runneradmin/vlncal46/execroot/com_github_ray_project_ray/bazel-out/x64_windows-opt/testlogs/python/ray/tests/test_joblib/test.log
  C:/users/runneradmin/_bazel_runneradmin/vlncal46/execroot/com_github_ray_project_ray/bazel-out/x64_windows-opt/testlogs/python/ray/tests/test_joblib/test_attempts/attempt_1.log
  C:/users/runneradmin/_bazel_runneradmin/vlncal46/execroot/com_github_ray_project_ray/bazel-out/x64_windows-opt/testlogs/python/ray/tests/test_joblib/test_attempts/attempt_2.log

Is there a way to get C:/users/runneradmin/_bazel_runneradmin/vlncal46/execroot/com_github_ray_project_ray/bazel-out/x64_windows-opt/testlogs/python/ray/tests/test_joblib/test_attempts/attempt_2.log ?

@ericl
Copy link
Copy Markdown
Contributor

ericl commented Dec 6, 2020 via email

@fishbone
Copy link
Copy Markdown
Contributor Author

fishbone commented Dec 7, 2020

I rebased the codebase to upstream's trunk, and it looks like the default behavior got updated. I run the following code in trunk branch, and it used to hang there.

import numpy as np

import ray
from ray.cluster_utils import Cluster
import pytest


cluster = Cluster()
# Force task onto the second node.
cluster.add_node(num_cpus=0, object_store_memory=75 * 1024 * 1024)
cluster.add_node(object_store_memory=75 * 1024 * 1024)

ray.init(cluster.address)

@ray.remote
def put():
    return np.random.rand(5 * 1024 * 1024)  # 40 MB data

local_ref = ray.put(np.random.rand(5 * 1024 * 1024))
print("local", local_ref)
remote_ref = put.remote()
print("remote", remote_ref)
print("Waiting")
ray.wait([remote_ref], num_returns=1)
print("Wait done")

Running this code will output

2020-12-07 09:18:57,456	INFO worker.py:654 -- Connecting to existing Ray cluster at address: 172.17.0.2:6379
local ObjectRef(ffffffffffffffffffffffff0100000001000000)
remote ObjectRef(df5a1a828c9685d3ffffffff0100000001000000)
Waiting
Wait done

Are you aware of anything updated? @stephanie-wang

@stephanie-wang
Copy link
Copy Markdown
Contributor

Are you aware of anything updated? @stephanie-wang

Hmm the script still hangs for me on master. Are you sure your version of Ray isn't pointing to your new branch?

@fishbone
Copy link
Copy Markdown
Contributor Author

fishbone commented Dec 7, 2020

I just double-checked it and it's the trunk of upstream. I also build by creating a new env to test it and it's the same.

I did a bisect check and I found out it's this PR #12499 which will skip the wait. cc @ffbin

@stephanie-wang
Copy link
Copy Markdown
Contributor

Hmm I still can't reproduce it. I tried installing with ray install-nightly (latest wheels, should be close or equivalent to master) and the script still hangs. I took a quick look at the other PR, but it doesn't seem related. Can you try with ray install-nightly?

@fishbone
Copy link
Copy Markdown
Contributor Author

fishbone commented Dec 7, 2020

I made one mistake about that PR. I tried to do bi-sect again and found that it might be related to this one: #12186

With this commit, it output the following:

2020-12-07 23:25:29,805	INFO worker.py:634 -- Connecting to existing Ray cluster at address: 172.17.0.2:6379
local ObjectRef(ffffffffffffffffffffffff0100000001000000)
remote ObjectRef(df5a1a828c9685d3ffffffff0100000001000000)
Waiting
Wait done
(pid=raylet) [2020-12-07 23:25:35,283 E 10908 10917] create_request_queue.cc:119: Not enough memory to create object df5a1a828c9685d3ffffffff0100000001000000 after 5 tries, will return OutOfMemory to the client
Aborted

And without this commit, it'll output

2020-12-07 23:19:45,551	INFO worker.py:634 -- Connecting to existing Ray cluster at address: 172.17.0.2:6379
local ObjectRef(ffffffffffffffffffffffff0100000001000000)
remote ObjectRef(df5a1a828c9685d3ffffffff0100000001000000)
Waiting
(pid=raylet) [2020-12-07 23:19:46,065 E 10172 10187] store.cc:305: Not enough memory to create the object df5a1a828c9685d3ffffffff0100000001000000, data_size=41943287, metadata_size=6, will send a reply of PlasmaError::OutOfMemory
(pid=raylet) [2020-12-07 23:19:46,094 E 10172 10187] store.cc:305: Not enough memory to create the object df5a1a828c9685d3ffffffff0100000001000000, data_size=41943287, metadata_size=6, will send a reply of PlasmaError::OutOfMemory
(pid=raylet) [2020-12-07 23:19:46,096 E 10172 10187] store.cc:305: Not enough memory to create the object df5a1a828c9685d3ffffffff0100000001000000, data_size=41943287, metadata_size=6, will send a reply of PlasmaError::OutOfMemory
(pid=raylet) [2020-12-07 23:19:46,101 E 10172 10187] store.cc:305: Not enough memory to create the object df5a1a828c9685d3ffffffff0100000001000000, data_size=41943287, metadata_size=6, will send a reply of PlasmaError::OutOfMemory
(pid=raylet) [2020-12-07 23:19:46,101 E 10172 10187] store.cc:305: Not enough memory to create the object df5a1a828c9685d3ffffffff0100000001000000, data_size=41943287, metadata_size=6, will send a reply of PlasmaError::OutOfMemory
(pid=raylet) [2020-12-07 23:19:46,102 E 10172 10187] store.cc:305: Not enough memory to create the object df5a1a828c9685d3ffffffff0100000001000000, data_size=41943287, metadata_size=6, will send a reply of PlasmaError::OutOfMemory
(pid=raylet) [2020-12-07 23:19:46,105 E 10172 10187] store.cc:305: Not enough memory to create the object df5a1a828c9685d3ffffffff0100000001000000, data_size=41943287, metadata_size=6, will send a reply of PlasmaError::OutOfMemory
(pid=raylet) [2020-12-07 23:19:46,106 E 10172 10187] store.cc:305: Not enough memory to create the object df5a1a828c9685d3ffffffff0100000001000000, data_size=41943287, metadata_size=6, will send a reply of PlasmaError::OutOfMemory

I tried nightly-build. Pretty wired things happening, actually. I created a new env and did ray install-nightly. It downloaded https://s3-us-west-2.amazonaws.com/ray-wheels/latest/ray-1.1.0.dev0-cp38-cp38-manylinux1_x86_64.whl and the running log is similar to the one before that PR.

Somehow, after I did some bi-sect check for commits, it later tried to download another wheel when I run ray install-nightly: https://s3-us-west-2.amazonaws.com/ray-wheels/latest/ray-1.1.0.dev0-cp38-cp38-manylinux2014_x86_64.whl

For this one, it'll show the log as the first one.

My question here is that is there any way to check which commit is the wheel built on? I think maybe the first nightly build is not based on the trunk.

@fishbone
Copy link
Copy Markdown
Contributor Author

fishbone commented Dec 8, 2020

Besides this, the timeout of the test case looks related to deleting this code. I'll put it back for now.

I'll update the test case once we figure out why wait not blocking.

@stephanie-wang
Copy link
Copy Markdown
Contributor

Besides this, the timeout of the test case looks related to deleting this code. I'll put it back for now.

I'll update the test case once we figure out why wait not blocking.

Hmm looking at the code again, it's probably because the callback that replies to the client no longer gets called. Perhaps we could just call that code directly? If that doesn't work, we can just not delete it. Could you leave a TODO that we no longer need the call to the object manager though?

@fishbone
Copy link
Copy Markdown
Contributor Author

Updated and thanks for the review.

@ericl
Copy link
Copy Markdown
Contributor

ericl commented Dec 11, 2020

Seems a C++ unit test is broken:

[2020-12-11 04:47:00,628 W 14255 14255] object_manager.cc:350: Invalid Push request ObjectID: e7ab9917445d7046ca8357b0bc32b97200000000 after waiting for 1000 ms.
[2020-12-11 04:47:01,644 C 14255 14255] object_manager_test.cc:304:  Check failed: found.size() == 1 
*** StackTrace Information ***
    @     0x557590e7d6da  google::GetStackTraceToString()
    @     0x557590e4fdc4  ray::GetCallTrace()
    @     0x557590e6ff70  ray::RayLog::~RayLog()
    @     0x557590b23644  _ZZN3ray17TestObjectManager23TestWaitWhileSubscribedENS_8UniqueIDENS_8ObjectIDES2_ENKUlRKSt6vectorIS2_SaIS2_EES7_E_clES7_S7_
    @     0x557590b4a239  ray::ObjectManager::WaitComplete()
    @     0x557590b4b9a8  _ZN5boost4asio6detail12wait_handlerIZN3ray13ObjectManager29SubscribeRemainingWaitObjectsERKNS3_8UniqueIDEEUlRKNS_6system10error_codeEE0_NS1_18io_object_executorINS0_8executorEEEE11do_completeEPvPNS1_19scheduler_operationESB_m
    @     0x557590e9cec1  boost::asio::detail::scheduler::do_run_one()
    @     0x557590e9d571  boost::asio::detail::scheduler::run()
    @     0x557590ea01e3  boost::asio::io_context::run()
    @     0x557590b1f9b6  ray::TestObjectManager_StartTestObjectManager_Test::TestBody()
    @     0x557590f8176d  testing::internal::HandleExceptionsInMethodIfSupported<>()
    @     0x557590f819a2  testing::Test::Run()
    @     0x557590f81c44  testing::TestInfo::Run()
    @     0x557590f81f37  testing::TestCase::Run()
    @     0x557590f823d0  testing::internal::UnitTestImpl::RunAllTests()
    @     0x557590f8251d  testing::internal::HandleExceptionsInMethodIfSupported<>()
    @     0x557590f82726  testing::UnitTest::Run()
    @     0x557590b101a5  main
    @     0x7fae984c8b97  __libc_start_main
    @     0x557590b1935a  _start

@fishbone
Copy link
Copy Markdown
Contributor Author

fishbone commented Dec 13, 2020

The broken unit test is due to removing wait_local in the test itself. There is no Push/Pull of data, so that no data will be ready. I add a Push and increase the timeout from 1000ms to 1500ms to give extra time to process the Pull request. Local testing looks fine. I'll give it a try.

@fishbone
Copy link
Copy Markdown
Contributor Author

TImeout in wait, which is introduced after merging with base. I'll take another look.

@fishbone
Copy link
Copy Markdown
Contributor Author

After investigating the hanging issue, I realized that it's because the failed push never got a chance to be re-pushed. I have a draft here: #12872
I tested the failing test case with this one, and it looks OK.

The fixing has some problems too, which I detailed in the PR. Please have a look and give me some comments. Since that one blocks this one, I'll plan to fix that first.
cc @ericl @stephanie-wang

@ericl
Copy link
Copy Markdown
Contributor

ericl commented Dec 15, 2020

Interesting, I think there is also supposed to be a higher-level retry in PullManager (10 second delay). Is that not triggering a re-push for some reason?

@fishbone
Copy link
Copy Markdown
Contributor Author

fishbone commented Dec 15, 2020

The new test case I added worked before, and it went timeout after I fixed the broken c++ unit test, which doesn't touch anything other than the test case. I tested it without the merge, and it looks fine. It might be related to the code changes from the trunk. So I suppose here the direction should be to make the pull retry work. I'll take another look tomorrow.

Btw, how about the second case in this PR #12872 where the wait was not blocked, but the exception got thrown in the end. Any idea about that?

@ericl
Copy link
Copy Markdown
Contributor

ericl commented Dec 15, 2020 via email

@fishbone
Copy link
Copy Markdown
Contributor Author

I've closed that PR since when I installed a nightly build from scratch, and I can never reproduce that issue :( I hope the nightly-build I use has the latest commit. It might be better to include some version info in ray.pkg. I can still reproduce it from the trunk built on my side though.

It may be related to my dev environment. I'm using a docker image of ubuntu with a local volume mounted.
Sorry for keeping this PR opening for such a long time. A lot of wired issues here somehow. I'll get a new environment tomorrow and speed it up on my side.

@ericl
Copy link
Copy Markdown
Contributor

ericl commented Dec 15, 2020 via email

@fishbone
Copy link
Copy Markdown
Contributor Author

Thanks, @ericl that's a super useful feature.
I took one look at the nightly build I'm using and it looks like it's way back to Nov 3 :

>>> import ray
>>> ray.__commit__
'9527220a868fbd57aa7617fa9a5eacecd56602db'

$ git log 9527220a868fbd57aa7617fa9a5eacecd56602db

commit 9527220a868fbd57aa7617fa9a5eacecd56602db
Author: Ian Rodney <ian.rodney@gmail.com>
Date:   Tue Nov 3 16:54:16 2020 -0800

    [serve] Fix Controller Crashes on Win (#11792)

So the two cases mentioned in this PR #12872 should still exist. I'll take a look later today.

@ericl
Copy link
Copy Markdown
Contributor

ericl commented Dec 15, 2020

cc @wuisawesome for pullmanager

@fishbone
Copy link
Copy Markdown
Contributor Author

Temporarily reverted #12335 for CI testing. I'll put it back once the issue got fixed from upstream.

@ericl
Copy link
Copy Markdown
Contributor

ericl commented Dec 17, 2020

Looks good to merge? Btw, consider pulling in master instead of rebasing, it works nicer with GitHub.

@fishbone
Copy link
Copy Markdown
Contributor Author

I feel it's good to go. I can't reproduce the failed test case.
Do you mean I should pull in master and do merging instead of rebasing the current one to master?

@ericl ericl merged commit 4003254 into ray-project:master Dec 17, 2020
@ericl
Copy link
Copy Markdown
Contributor

ericl commented Dec 17, 2020

Yep, that keeps all commits stable, easier to handle conflicts, locally, and helps GitHub diffs. They all get squashed in the end so the rebasing doesn't help anyways.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

@author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants