Skip to content

[Object Manager] Pull Manager refactor#12335

Merged
ericl merged 28 commits intoray-project:masterfrom
wuisawesome:pull_manager_2
Dec 11, 2020
Merged

[Object Manager] Pull Manager refactor#12335
ericl merged 28 commits intoray-project:masterfrom
wuisawesome:pull_manager_2

Conversation

@wuisawesome
Copy link
Copy Markdown
Contributor

@wuisawesome wuisawesome commented Nov 24, 2020

Why are these changes needed?

This PR is a refactor fo the object manager's pull request duties to a PullManager class (analogous to the PushManager).

This PR should introduce no behavior changes with the slight exception that we are now using a global retry timer instead of a timer per request.

Related issue number

Checks

  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

uint16_t port;
};

/// Callback for object location notifications.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with moving this to common.h too, but it needs a larger scope than just the ObjectDirectory.

@wuisawesome wuisawesome changed the title [WIP][Object Manager] Pull Manager refactor [Object Manager] Pull Manager refactor Nov 26, 2020
Copy link
Copy Markdown
Contributor

@ericl ericl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The overall structure looks good, but it seems the retry behavior from before isn't implemented yet.

return false;
}

pull_requests_.emplace(object_id, PullRequest());
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we call TryPull() here?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we can do that because we don't have object's locations still since OnLocationChange needs to do it->second.client_locations = std::vector<NodeID>(client_ids.begin(), client_ids.end()); first

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. Can you add a comment saying that return true means the caller will notify on location available (including current locs)?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm can you take a look at the return value doc in the header? Lemme know if that needs clarification.

// the next Pull attempt since there are no more clients to try.
if (it->second.retry_timer != nullptr) {
it->second.retry_timer->cancel();
it->second.timer_set = false;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems we never initialize the timer now, is this to be implemented?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm this timer can probably be cleaned up now that we will have the global timer

@ericl ericl added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Nov 28, 2020
@rkooo567
Copy link
Copy Markdown
Contributor

Does this PR actually throttle the number of in-flight pull requests? (I couldn't find it). Is it purely for refactoring now?

@wuisawesome
Copy link
Copy Markdown
Contributor Author

Does this PR actually throttle the number of in-flight pull requests? (I couldn't find it). Is it purely for refactoring now?

Yeah, just updated the description, but yeah pure refactor.

Copy link
Copy Markdown
Contributor

@rkooo567 rkooo567 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Btw, isn't there any possible performance regression of using the global timer? Is there any way to test this?

Copy link
Copy Markdown
Contributor

@stephanie-wang stephanie-wang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, looks good! Left some minor suggestions.

namespace ray {

PullManager::PullManager(
NodeID &self_node_id, const std::function<bool(const ObjectID &)> object_is_local,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider passing in a const ref to the local objects hashmap here. Not a strong preference, but I think it's a bit nicer when reading the code.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO passing a lambda makes the dependency narrower here, which is a win for readability / testing.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm I don't feel strongly about it, but I think const ref is as narrow as a lambda here. It makes it clear that this class has read-only access.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You have the ability to do other things like iterate over the map, etc. which is a much broader set of APIs than a single function.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't feel strongly about this at all, but one minor advantage of object_is_local is that it provides proof of existence only, which means we don't have to mock out/know things about the LocalObjectInfo.

pull_manager_(self_node_id_,
[this](const ObjectID &object_id) { return object_is_local_; },
[this](const ObjectID &object_id, const NodeID &node_id) {
num_send_pull_request_calls_++;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to check something about which node IDs are requested. Something like, "all node IDs are requested once" (I'm not sure what the actual invariant is).

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I would love to add an invariant like that, but I don't think the current behavior does that. If the timer ticks multiple times, it is possible to send multiple pull requests to the same worker (not sure if that's a feature or bug tbh...)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Btw please let me know if there's some invariant that I just don't see right now.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there an invariant when the timer ticks multiple times for a set of locations (and the set doesn't change in between)?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so. We just do a simple random sample (with replacement). IMO this is one of the first low hanging fruit that we should deal with though.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gotcha, thanks for explaining!

@ericl ericl removed the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Dec 9, 2020
Copy link
Copy Markdown
Contributor

@ericl ericl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wuisawesome one problem I realized might happen with the global timer is a pull request can be duplicated immediately if a Tick() happens right after. To mitigate this, can we track the timestamp of the last pull and only pull after that time has elapsed?

Then we can also call Tick() much more frequently and that doesn't affect the actual timeout of the retry.

For testing, we can pass in a fake GetTime() function for mock time.

@ericl ericl added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Dec 9, 2020
/// \return Void.
void TryPull(const ObjectID &object_id);

int GetRandInt(int upper_bound);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doc? Also, do we need a separate method for this?

@wuisawesome
Copy link
Copy Markdown
Contributor Author

I realized might happen with the global timer is a pull request can be duplicated immediately if a Tick() happens right after

I think in general, this could've happened with the old code too (since the retry timer and new object locations are independent events) right?

To mitigate this, can we track the timestamp...

This feels a little bit like we're reinventing the wheel on a busy wait timer here. I'd propose we mitigate this by just always skipping the first tick instead.

Thoughts?

@ericl
Copy link
Copy Markdown
Contributor

ericl commented Dec 9, 2020 via email

@wuisawesome
Copy link
Copy Markdown
Contributor Author

I think tracking time is pretty standard and easy to reason about

Only if your update frequency << retry time. For example, imagine a pull request happens 1ms after the timer tick (because we send the pull requests serially). Now when the next tick comes in, we may not issue a retry because it has only been 9.999 seconds.

Btw, I agree that this behavior is bad, but I'm not sure it's a regression since the existing code can also send multiple pull requests to the same node right?

I think the natural follow up PR is to implement a better node picking algorithm to pick a node which we don't have a pending request to, but that's probably out of scope for this PR.

@ericl
Copy link
Copy Markdown
Contributor

ericl commented Dec 9, 2020 via email

@wuisawesome
Copy link
Copy Markdown
Contributor Author

Why not set the timer frequency to 100ms then?

Because we're doing lots of unnecessary busy waiting. I'm not saying that would be the end of the world, but if it's easier to avoid busy waiting than it is to do the busy waiting, then why not?

I don't think we can merge this without taking care of that edge case

Sure, but to be clear, I'm not suggesting we ignore the edge case, I'm suggesting we handle it without busy waiting.

they never happened immediately before.

I think they can because there's no rate limit on new object locations appearing. (In practice this could happen often if you quickly create multiple tasks which require the same object). Yes, without a mitigation, this could happen a little more often (so it still makes sense to talk about mitigations), but these mitigations don't solve the underlying problem (we need rate limiting).

@wuisawesome wuisawesome added tests-ok The tagger certifies test failures are unrelated and assumes personal liability. and removed @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. labels Dec 11, 2020
for (auto &pair : pull_requests_) {
const auto &object_id = pair.first;
auto &request = pair.second;
const auto time = get_time_();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: can call time at the top of the tick outside the loop

Copy link
Copy Markdown
Contributor

@ericl ericl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for making the change, this looks great! One nit

@ericl ericl merged commit 676ec36 into ray-project:master Dec 11, 2020
@wuisawesome
Copy link
Copy Markdown
Contributor Author

Thanks for making the change, this looks great! One nit

Sure, I guess I'll do that in the next PR where real rate limiting is introduced since this is already merged?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

tests-ok The tagger certifies test failures are unrelated and assumes personal liability.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants