[core] Introduce fetch_local to ray.wait#12526
[core] Introduce fetch_local to ray.wait#12526ericl merged 25 commits intoray-project:masterfrom fishbone:fetch_local
ray.wait#12526Conversation
|
ericl
left a comment
There was a problem hiding this comment.
@ahbone can you fix the test? I think you want to schedule the task on a remote node (perhaps by setting num_cpus=0 on the head node or using custom node resources to force scheduling on the remote node). Then, the remote object can only be awaited with fetch_local=True.
python/ray/tests/test_basic.py
Outdated
|
|
||
| ray.wait([remote_ref], timeout=1, fetch_local=False) | ||
| del local_ref | ||
| ray.get(remote_ref) |
There was a problem hiding this comment.
I think we could test calling both versions of ray.wait. So check that ray.wait(fetch_local=True) times out and ray.wait(fetch_local=False) returns the remote ref as ready.
python/ray/worker.py
Outdated
| num_returns (int): The number of object refs that should be returned. | ||
| timeout (float): The maximum amount of time in seconds to wait before | ||
| returning. | ||
| fetch_local (bool): Fetch the object to local node if it's not. |
There was a problem hiding this comment.
Can you expand on this comment? I think it's clearer to say that if this is true, the call will only return a ref as ready once the object is fetched to the local node.
|
@ericl I do have some questions about running python's testing cases. I tried to use Is there a way to get |
|
You can run the test locally with pytest, or running the file directly.
There are also a bunch of flaky tests (see the test dashboard).
…On Sat, Dec 5, 2020, 10:20 PM Yi Cheng ***@***.***> wrote:
@ericl <https://github.com/ericl> I do have some questions about running
python's testing cases. I tried to use bazel test to run the test cases
of python and it complaints fixture not found. I then use pytest to run
it directly and it looks fine. Am I running it through the right way?
Besides, is there any way to check the testing log of specific failed test
case?
For example
//python/ray/tests:test_joblib TIMEOUT in 3 out of 3 in 300.2s
Stats over 3 runs: max = 300.2s, min = 300.2s, avg = 300.2s, dev = 0.0s
C:/users/runneradmin/_bazel_runneradmin/vlncal46/execroot/com_github_ray_project_ray/bazel-out/x64_windows-opt/testlogs/python/ray/tests/test_joblib/test.log
C:/users/runneradmin/_bazel_runneradmin/vlncal46/execroot/com_github_ray_project_ray/bazel-out/x64_windows-opt/testlogs/python/ray/tests/test_joblib/test_attempts/attempt_1.log
C:/users/runneradmin/_bazel_runneradmin/vlncal46/execroot/com_github_ray_project_ray/bazel-out/x64_windows-opt/testlogs/python/ray/tests/test_joblib/test_attempts/attempt_2.log
Is there a way to get
C:/users/runneradmin/_bazel_runneradmin/vlncal46/execroot/com_github_ray_project_ray/bazel-out/x64_windows-opt/testlogs/python/ray/tests/test_joblib/test_attempts/attempt_2.log
?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#12526 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAADUST7X5DMUYQ7OG7BRZ3STMPBNANCNFSM4UIXBTYQ>
.
|
|
I rebased the codebase to upstream's trunk, and it looks like the default behavior got updated. I run the following code in trunk branch, and it used to hang there. Running this code will output Are you aware of anything updated? @stephanie-wang |
Hmm the script still hangs for me on master. Are you sure your version of Ray isn't pointing to your new branch? |
|
Hmm I still can't reproduce it. I tried installing with |
|
I made one mistake about that PR. I tried to do bi-sect again and found that it might be related to this one: #12186 With this commit, it output the following: And without this commit, it'll output I tried nightly-build. Pretty wired things happening, actually. I created a new env and did Somehow, after I did some bi-sect check for commits, it later tried to download another wheel when I run For this one, it'll show the log as the first one. My question here is that is there any way to check which commit is the wheel built on? I think maybe the first nightly build is not based on the trunk. |
|
Besides this, the timeout of the test case looks related to deleting this code. I'll put it back for now. I'll update the test case once we figure out why |
Hmm looking at the code again, it's probably because the callback that replies to the client no longer gets called. Perhaps we could just call that code directly? If that doesn't work, we can just not delete it. Could you leave a TODO that we no longer need the call to the object manager though? |
|
Updated and thanks for the review. |
|
Seems a C++ unit test is broken: |
|
The broken unit test is due to removing |
|
TImeout in |
|
After investigating the hanging issue, I realized that it's because the failed push never got a chance to be re-pushed. I have a draft here: #12872 The fixing has some problems too, which I detailed in the PR. Please have a look and give me some comments. Since that one blocks this one, I'll plan to fix that first. |
|
Interesting, I think there is also supposed to be a higher-level retry in PullManager (10 second delay). Is that not triggering a re-push for some reason? |
|
The new test case I added worked before, and it went timeout after I fixed the broken c++ unit test, which doesn't touch anything other than the test case. I tested it without the merge, and it looks fine. It might be related to the code changes from the trunk. So I suppose here the direction should be to make the pull retry work. I'll take another look tomorrow. Btw, how about the second case in this PR #12872 where the |
|
It might be related to the PullManager refactoring Alex recently merged. If
reverting that patch fixes the issue, it would indicate there was a
regression from the refactoring, do you mind trying the revert out?
…On Mon, Dec 14, 2020, 11:25 PM Yi Cheng ***@***.***> wrote:
The new test case I added worked before, and it went timeout after I fixed
the broken c++ unit test, which doesn't touch anything other than the test
case. I tested it without the merge, and it looks fine. It might be related
to the code changes from the trunk.
So I suppose here the direction should be to make the pull retry work.
I'll take another look tomorrow.
Btw, how about the second case in this PR #12872
<#12872> where the wait was not
blocked, but the exception got thrown in the end. Any idea about that?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#12526 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAADUSSV2LJTQDG556G5A43SU4FONANCNFSM4UIXBTYQ>
.
|
|
I've closed that PR since when I installed a nightly build from scratch, and I can never reproduce that issue :( I hope the nightly-build I use has the latest commit. It might be better to include some version info in It may be related to my dev environment. I'm using a docker image of ubuntu with a local volume mounted. |
|
Btw you can check ray.__commit__ to see the git SHA for the build.
…On Tue, Dec 15, 2020, 12:01 AM Yi Cheng ***@***.***> wrote:
I've closed that PR since when I installed a nightly build from scratch,
and I can never reproduce that issue :( I hope the nightly-build I use has
the latest commit. It might be better to include some version info in
ray.pkg. I can still reproduce it from the trunk built on my side though.
It may be related to my dev environment. I'm using a docker image of
ubuntu with a local volume mounted.
Sorry for keeping this PR opening for such a long time. A lot of wired
issues here somehow. I'll get a new environment tomorrow and speed it up on
my side.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#12526 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAADUST4NYW4B7XR3FGXAFLSU4JVZANCNFSM4UIXBTYQ>
.
|
|
Thanks, @ericl that's a super useful feature. So the two cases mentioned in this PR #12872 should still exist. I'll take a look later today. |
|
cc @wuisawesome for pullmanager |
|
Temporarily reverted #12335 for CI testing. I'll put it back once the issue got fixed from upstream. |
|
Looks good to merge? Btw, consider pulling in master instead of rebasing, it works nicer with GitHub. |
|
I feel it's good to go. I can't reproduce the failed test case. |
|
Yep, that keeps all commits stable, easier to handle conflicts, locally, and helps GitHub diffs. They all get squashed in the end so the rebasing doesn't help anyways. |
Why are these changes needed?
This PR introduce a new API for
ray.waitmethod to indicate that it can return once the object is ready anywhere in the cluster.Related issue number
#12190
Checks
scripts/format.shto lint the changes in this PR.