[Object Manager] Pull Manager refactor by wuisawesome · Pull Request #12335 · ray-project/ray

wuisawesome · 2020-11-24T04:34:32Z

Why are these changes needed?

This PR is a refactor fo the object manager's pull request duties to a PullManager class (analogous to the PushManager).

This PR should introduce no behavior changes with the slight exception that we are now using a global retry timer instead of a timer per request.

Related issue number

Checks

I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

src/ray/object_manager/pull_manager.h

wuisawesome · 2020-11-26T04:03:15Z

src/ray/object_manager/object_directory.h

  uint16_t port;
 };

+/// Callback for object location notifications.


I'm fine with moving this to common.h too, but it needs a larger scope than just the ObjectDirectory.

ericl

The overall structure looks good, but it seems the retry behavior from before isn't implemented yet.

src/ray/object_manager/pull_manager.cc

ericl · 2020-11-28T00:31:51Z

src/ray/object_manager/pull_manager.cc

+    return false;
+  }
+
+  pull_requests_.emplace(object_id, PullRequest());


Shouldn't we call TryPull() here?

I don't think we can do that because we don't have object's locations still since OnLocationChange needs to do it->second.client_locations = std::vector<NodeID>(client_ids.begin(), client_ids.end()); first

Got it. Can you add a comment saying that return true means the caller will notify on location available (including current locs)?

Hmmm can you take a look at the return value doc in the header? Lemme know if that needs clarification.

src/ray/object_manager/pull_manager.cc

src/ray/object_manager/pull_manager.h

src/ray/object_manager/test/pull_manager_test.cc

ericl · 2020-11-28T00:35:49Z

src/ray/object_manager/pull_manager.cc

+    // the next Pull attempt since there are no more clients to try.
+    if (it->second.retry_timer != nullptr) {
+      it->second.retry_timer->cancel();
+      it->second.timer_set = false;


It seems we never initialize the timer now, is this to be implemented?

Hmm this timer can probably be cleaned up now that we will have the global timer

rkooo567 · 2020-11-29T07:51:08Z

Does this PR actually throttle the number of in-flight pull requests? (I couldn't find it). Is it purely for refactoring now?

wuisawesome · 2020-11-30T08:54:25Z

Does this PR actually throttle the number of in-flight pull requests? (I couldn't find it). Is it purely for refactoring now?

Yeah, just updated the description, but yeah pure refactor.

rkooo567

Btw, isn't there any possible performance regression of using the global timer? Is there any way to test this?

src/ray/object_manager/pull_manager.cc

stephanie-wang

Thanks, looks good! Left some minor suggestions.

stephanie-wang · 2020-11-30T22:35:41Z

src/ray/object_manager/pull_manager.cc

+namespace ray {
+
+PullManager::PullManager(
+    NodeID &self_node_id, const std::function<bool(const ObjectID &)> object_is_local,


Consider passing in a const ref to the local objects hashmap here. Not a strong preference, but I think it's a bit nicer when reading the code.

IMO passing a lambda makes the dependency narrower here, which is a win for readability / testing.

Hmm I don't feel strongly about it, but I think const ref is as narrow as a lambda here. It makes it clear that this class has read-only access.

You have the ability to do other things like iterate over the map, etc. which is a much broader set of APIs than a single function.

I don't feel strongly about this at all, but one minor advantage of object_is_local is that it provides proof of existence only, which means we don't have to mock out/know things about the LocalObjectInfo.

stephanie-wang · 2020-11-30T22:38:18Z

src/ray/object_manager/test/pull_manager_test.cc

+        pull_manager_(self_node_id_,
+                      [this](const ObjectID &object_id) { return object_is_local_; },
+                      [this](const ObjectID &object_id, const NodeID &node_id) {
+                        num_send_pull_request_calls_++;


It would be nice to check something about which node IDs are requested. Something like, "all node IDs are requested once" (I'm not sure what the actual invariant is).

So I would love to add an invariant like that, but I don't think the current behavior does that. If the timer ticks multiple times, it is possible to send multiple pull requests to the same worker (not sure if that's a feature or bug tbh...)

Btw please let me know if there's some invariant that I just don't see right now.

Is there an invariant when the timer ticks multiple times for a set of locations (and the set doesn't change in between)?

I don't think so. We just do a simple random sample (with replacement). IMO this is one of the first low hanging fruit that we should deal with though.

Gotcha, thanks for explaining!

src/ray/object_manager/pull_manager.cc

src/ray/object_manager/pull_manager.h

src/ray/object_manager/test/pull_manager_test.cc

ericl

@wuisawesome one problem I realized might happen with the global timer is a pull request can be duplicated immediately if a Tick() happens right after. To mitigate this, can we track the timestamp of the last pull and only pull after that time has elapsed?

Then we can also call Tick() much more frequently and that doesn't affect the actual timeout of the retry.

For testing, we can pass in a fake GetTime() function for mock time.

stephanie-wang · 2020-12-09T20:35:12Z

src/ray/object_manager/pull_manager.h

  /// \return Void.
  void TryPull(const ObjectID &object_id);
+
+  int GetRandInt(int upper_bound);


Doc? Also, do we need a separate method for this?

wuisawesome · 2020-12-09T21:04:29Z

I realized might happen with the global timer is a pull request can be duplicated immediately if a Tick() happens right after

I think in general, this could've happened with the old code too (since the retry timer and new object locations are independent events) right?

To mitigate this, can we track the timestamp...

This feels a little bit like we're reinventing the wheel on a busy wait timer here. I'd propose we mitigate this by just always skipping the first tick instead.

Thoughts?

ericl · 2020-12-09T21:09:14Z

I think tracking time is pretty standard and easy to reason about, do you have any reason not to do that? The potential regression is multiple pulls from the same node which could be troublesome.

…

On Wed, Dec 9, 2020, 1:04 PM Alex Wu ***@***.***> wrote: I realized might happen with the global timer is a pull request can be duplicated immediately if a Tick() happens right after I think in general, this could've happened with the old code too (since the retry timer and new object locations are independent events) right? To mitigate this, can we track the timestamp... This feels a little bit like we're reinventing the wheel on a busy wait timer here. I'd propose we mitigate this by just always skipping the first tick instead. Thoughts? — You are receiving this because you were assigned. Reply to this email directly, view it on GitHub <#12335 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAADUSQKVEXNHQBGHTZ5S3TST7Q6ZANCNFSM4UALIN4Q> .

wuisawesome · 2020-12-09T21:16:06Z

I think tracking time is pretty standard and easy to reason about

Only if your update frequency << retry time. For example, imagine a pull request happens 1ms after the timer tick (because we send the pull requests serially). Now when the next tick comes in, we may not issue a retry because it has only been 9.999 seconds.

Btw, I agree that this behavior is bad, but I'm not sure it's a regression since the existing code can also send multiple pull requests to the same node right?

I think the natural follow up PR is to implement a better node picking algorithm to pick a node which we don't have a pending request to, but that's probably out of scope for this PR.

ericl · 2020-12-09T21:19:15Z

Why not set the timer frequency to 100ms then? I don't think we can merge this without taking care of that edge case, it's true that duplicates can happen but they never happened *immediately* before.

…

On Wed, Dec 9, 2020, 1:16 PM Alex Wu ***@***.***> wrote: I think tracking time is pretty standard and easy to reason about Only if your update frequency << retry time. For example, imagine a pull request happens 1ms after the timer tick (because we send the pull requests serially). Now when the next tick comes in, we may not issue a retry because it has only been 9.999 seconds. Btw, I agree that this behavior is bad, but I'm not sure it's a regression since the existing code can also send multiple pull requests to the same node right? I think the natural follow up PR is to implement a better node picking algorithm to pick a node which we don't have a pending request to, but that's probably out of scope for this PR. — You are receiving this because you were assigned. Reply to this email directly, view it on GitHub <#12335 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAADUSWIERYBOUOM3QI6IV3ST7SKPANCNFSM4UALIN4Q> .

wuisawesome · 2020-12-09T21:33:15Z

Why not set the timer frequency to 100ms then?

Because we're doing lots of unnecessary busy waiting. I'm not saying that would be the end of the world, but if it's easier to avoid busy waiting than it is to do the busy waiting, then why not?

I don't think we can merge this without taking care of that edge case

Sure, but to be clear, I'm not suggesting we ignore the edge case, I'm suggesting we handle it without busy waiting.

they never happened immediately before.

I think they can because there's no rate limit on new object locations appearing. (In practice this could happen often if you quickly create multiple tasks which require the same object). Yes, without a mitigation, this could happen a little more often (so it still makes sense to talk about mitigations), but these mitigations don't solve the underlying problem (we need rate limiting).

…_manager_2

ericl · 2020-12-11T19:55:20Z

src/ray/object_manager/pull_manager.cc

+  for (auto &pair : pull_requests_) {
+    const auto &object_id = pair.first;
+    auto &request = pair.second;
+    const auto time = get_time_();


nit: can call time at the top of the tick outside the loop

ericl

Thanks for making the change, this looks great! One nit

wuisawesome · 2020-12-11T19:57:40Z

Thanks for making the change, this looks great! One nit

Sure, I guess I'll do that in the next PR where real rate limiting is introduced since this is already merged?

Alex added 3 commits November 24, 2020 00:40

.

6300209

.

f0d900e

.

d004673

ericl reviewed Nov 24, 2020

View reviewed changes

src/ray/object_manager/pull_manager.h Outdated Show resolved Hide resolved

ericl reviewed Nov 24, 2020

View reviewed changes

src/ray/object_manager/pull_manager.h Outdated Show resolved Hide resolved

ericl reviewed Nov 24, 2020

View reviewed changes

src/ray/object_manager/pull_manager.h Outdated Show resolved Hide resolved

ericl reviewed Nov 24, 2020

View reviewed changes

src/ray/object_manager/pull_manager.h Outdated Show resolved Hide resolved

Alex added 7 commits November 26, 2020 01:10

onlocationchanged

11c7147

cancel pull

753a425

cancel pull

f2f0d1a

numrequests

162cfef

needs tests

6cc84e0

test done

ae83c8c

Cleanup

2156d6e

wuisawesome commented Nov 26, 2020

View reviewed changes

docs

54d6a50

wuisawesome changed the title ~~[WIP][Object Manager] Pull Manager refactor~~ [Object Manager] Pull Manager refactor Nov 26, 2020

wuisawesome assigned ericl Nov 26, 2020

whoops

910f704

ericl requested changes Nov 28, 2020

View reviewed changes

ericl assigned stephanie-wang Nov 28, 2020

ericl added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Nov 28, 2020

Alex added 5 commits November 30, 2020 06:52

merge

aa104dd

.

1909986

.

4ec1b52

.

9d97dc7

done?

28335cf

wuisawesome force-pushed the pull_manager_2 branch from 910f704 to 28335cf Compare November 30, 2020 08:49

rkooo567 reviewed Nov 30, 2020

View reviewed changes

src/ray/object_manager/pull_manager.cc Show resolved Hide resolved

stephanie-wang requested changes Nov 30, 2020

View reviewed changes

Alex added 7 commits November 30, 2020 23:43

.

d9753a1

.

c630645

builds properly

e9f48ba

good?

82293bb

.

6cf3175

CR + lint

b1fb7e9

CR + lint

e5fb748

ericl removed the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Dec 9, 2020

ericl requested changes Dec 9, 2020

View reviewed changes

ericl added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Dec 9, 2020

stephanie-wang approved these changes Dec 9, 2020

View reviewed changes

Alex Wu added 3 commits December 10, 2020 21:08

Merge branch 'pull_manager_2' of github.com:wuisawesome/ray into pull…

fe554b5

…_manager_2

With time

b0ca5aa

Merge branch 'pull_manager_2' of github.com:wuisawesome/ray into pull…

aa39618

…_manager_2

wuisawesome added tests-ok The tagger certifies test failures are unrelated and assumes personal liability. and removed @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. labels Dec 11, 2020

ericl reviewed Dec 11, 2020

View reviewed changes

ericl approved these changes Dec 11, 2020

View reviewed changes

ericl merged commit 676ec36 into ray-project:master Dec 11, 2020

This was referenced Dec 15, 2020

Retry when push failed #12872

Closed

[core] Introduce fetch_local to ray.wait #12526

Merged

wuisawesome mentioned this pull request Dec 16, 2020

Fix pull manager retry #12907

Merged

6 tasks

Conversation

wuisawesome commented Nov 24, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why are these changes needed?

Related issue number

Checks

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ericl left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rkooo567 commented Nov 29, 2020

Uh oh!

wuisawesome commented Nov 30, 2020

Uh oh!

rkooo567 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

stephanie-wang left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ericl left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wuisawesome commented Dec 9, 2020

Uh oh!

ericl commented Dec 9, 2020 via email

Uh oh!

wuisawesome commented Dec 9, 2020

Uh oh!

wuisawesome commented Nov 24, 2020 •

edited

Loading