[Object Manager] Pull Manager refactor#12335
Conversation
| uint16_t port; | ||
| }; | ||
|
|
||
| /// Callback for object location notifications. |
There was a problem hiding this comment.
I'm fine with moving this to common.h too, but it needs a larger scope than just the ObjectDirectory.
ericl
left a comment
There was a problem hiding this comment.
The overall structure looks good, but it seems the retry behavior from before isn't implemented yet.
| return false; | ||
| } | ||
|
|
||
| pull_requests_.emplace(object_id, PullRequest()); |
There was a problem hiding this comment.
Shouldn't we call TryPull() here?
There was a problem hiding this comment.
I don't think we can do that because we don't have object's locations still since OnLocationChange needs to do it->second.client_locations = std::vector<NodeID>(client_ids.begin(), client_ids.end()); first
There was a problem hiding this comment.
Got it. Can you add a comment saying that return true means the caller will notify on location available (including current locs)?
There was a problem hiding this comment.
Hmmm can you take a look at the return value doc in the header? Lemme know if that needs clarification.
| // the next Pull attempt since there are no more clients to try. | ||
| if (it->second.retry_timer != nullptr) { | ||
| it->second.retry_timer->cancel(); | ||
| it->second.timer_set = false; |
There was a problem hiding this comment.
It seems we never initialize the timer now, is this to be implemented?
There was a problem hiding this comment.
Hmm this timer can probably be cleaned up now that we will have the global timer
|
Does this PR actually throttle the number of in-flight pull requests? (I couldn't find it). Is it purely for refactoring now? |
910f704 to
28335cf
Compare
Yeah, just updated the description, but yeah pure refactor. |
rkooo567
left a comment
There was a problem hiding this comment.
Btw, isn't there any possible performance regression of using the global timer? Is there any way to test this?
stephanie-wang
left a comment
There was a problem hiding this comment.
Thanks, looks good! Left some minor suggestions.
| namespace ray { | ||
|
|
||
| PullManager::PullManager( | ||
| NodeID &self_node_id, const std::function<bool(const ObjectID &)> object_is_local, |
There was a problem hiding this comment.
Consider passing in a const ref to the local objects hashmap here. Not a strong preference, but I think it's a bit nicer when reading the code.
There was a problem hiding this comment.
IMO passing a lambda makes the dependency narrower here, which is a win for readability / testing.
There was a problem hiding this comment.
Hmm I don't feel strongly about it, but I think const ref is as narrow as a lambda here. It makes it clear that this class has read-only access.
There was a problem hiding this comment.
You have the ability to do other things like iterate over the map, etc. which is a much broader set of APIs than a single function.
There was a problem hiding this comment.
I don't feel strongly about this at all, but one minor advantage of object_is_local is that it provides proof of existence only, which means we don't have to mock out/know things about the LocalObjectInfo.
| pull_manager_(self_node_id_, | ||
| [this](const ObjectID &object_id) { return object_is_local_; }, | ||
| [this](const ObjectID &object_id, const NodeID &node_id) { | ||
| num_send_pull_request_calls_++; |
There was a problem hiding this comment.
It would be nice to check something about which node IDs are requested. Something like, "all node IDs are requested once" (I'm not sure what the actual invariant is).
There was a problem hiding this comment.
So I would love to add an invariant like that, but I don't think the current behavior does that. If the timer ticks multiple times, it is possible to send multiple pull requests to the same worker (not sure if that's a feature or bug tbh...)
There was a problem hiding this comment.
Btw please let me know if there's some invariant that I just don't see right now.
There was a problem hiding this comment.
Is there an invariant when the timer ticks multiple times for a set of locations (and the set doesn't change in between)?
There was a problem hiding this comment.
I don't think so. We just do a simple random sample (with replacement). IMO this is one of the first low hanging fruit that we should deal with though.
There was a problem hiding this comment.
Gotcha, thanks for explaining!
ericl
left a comment
There was a problem hiding this comment.
@wuisawesome one problem I realized might happen with the global timer is a pull request can be duplicated immediately if a Tick() happens right after. To mitigate this, can we track the timestamp of the last pull and only pull after that time has elapsed?
Then we can also call Tick() much more frequently and that doesn't affect the actual timeout of the retry.
For testing, we can pass in a fake GetTime() function for mock time.
| /// \return Void. | ||
| void TryPull(const ObjectID &object_id); | ||
|
|
||
| int GetRandInt(int upper_bound); |
There was a problem hiding this comment.
Doc? Also, do we need a separate method for this?
I think in general, this could've happened with the old code too (since the retry timer and new object locations are independent events) right?
This feels a little bit like we're reinventing the wheel on a busy wait timer here. I'd propose we mitigate this by just always skipping the first tick instead. Thoughts? |
|
I think tracking time is pretty standard and easy to reason about, do you
have any reason not to do that? The potential regression is multiple pulls
from the same node which could be troublesome.
…On Wed, Dec 9, 2020, 1:04 PM Alex Wu ***@***.***> wrote:
I realized might happen with the global timer is a pull request can be
duplicated immediately if a Tick() happens right after
I think in general, this could've happened with the old code too (since
the retry timer and new object locations are independent events) right?
To mitigate this, can we track the timestamp...
This feels a little bit like we're reinventing the wheel on a busy wait
timer here. I'd propose we mitigate this by just always skipping the first
tick instead.
Thoughts?
—
You are receiving this because you were assigned.
Reply to this email directly, view it on GitHub
<#12335 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAADUSQKVEXNHQBGHTZ5S3TST7Q6ZANCNFSM4UALIN4Q>
.
|
Only if your update frequency << retry time. For example, imagine a pull request happens 1ms after the timer tick (because we send the pull requests serially). Now when the next tick comes in, we may not issue a retry because it has only been 9.999 seconds. Btw, I agree that this behavior is bad, but I'm not sure it's a regression since the existing code can also send multiple pull requests to the same node right? I think the natural follow up PR is to implement a better node picking algorithm to pick a node which we don't have a pending request to, but that's probably out of scope for this PR. |
|
Why not set the timer frequency to 100ms then?
I don't think we can merge this without taking care of that edge case, it's
true that duplicates can happen but they never happened *immediately*
before.
…On Wed, Dec 9, 2020, 1:16 PM Alex Wu ***@***.***> wrote:
I think tracking time is pretty standard and easy to reason about
Only if your update frequency << retry time. For example, imagine a pull
request happens 1ms after the timer tick (because we send the pull requests
serially). Now when the next tick comes in, we may not issue a retry
because it has only been 9.999 seconds.
Btw, I agree that this behavior is bad, but I'm not sure it's a regression
since the existing code can also send multiple pull requests to the same
node right?
I think the natural follow up PR is to implement a better node picking
algorithm to pick a node which we don't have a pending request to, but
that's probably out of scope for this PR.
—
You are receiving this because you were assigned.
Reply to this email directly, view it on GitHub
<#12335 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAADUSWIERYBOUOM3QI6IV3ST7SKPANCNFSM4UALIN4Q>
.
|
Because we're doing lots of unnecessary busy waiting. I'm not saying that would be the end of the world, but if it's easier to avoid busy waiting than it is to do the busy waiting, then why not?
Sure, but to be clear, I'm not suggesting we ignore the edge case, I'm suggesting we handle it without busy waiting.
I think they can because there's no rate limit on new object locations appearing. (In practice this could happen often if you quickly create multiple tasks which require the same object). Yes, without a mitigation, this could happen a little more often (so it still makes sense to talk about mitigations), but these mitigations don't solve the underlying problem (we need rate limiting). |
| for (auto &pair : pull_requests_) { | ||
| const auto &object_id = pair.first; | ||
| auto &request = pair.second; | ||
| const auto time = get_time_(); |
There was a problem hiding this comment.
nit: can call time at the top of the tick outside the loop
ericl
left a comment
There was a problem hiding this comment.
Thanks for making the change, this looks great! One nit
Sure, I guess I'll do that in the next PR where real rate limiting is introduced since this is already merged? |
Why are these changes needed?
This PR is a refactor fo the object manager's pull request duties to a PullManager class (analogous to the PushManager).
This PR should introduce no behavior changes with the slight exception that we are now using a global retry timer instead of a timer per request.
Related issue number
Checks
scripts/format.shto lint the changes in this PR.