[Object Spilling] Multi node file spilling V2. #13542
[Object Spilling] Multi node file spilling V2. #13542ericl merged 17 commits intoray-project:masterfrom
Conversation
| << "LookupLocations returns an empty list of locations."; | ||
| io_service_.post([callback, object_id]() { | ||
| callback(object_id, std::unordered_set<NodeID>(), ""); | ||
| callback(object_id, std::unordered_set<NodeID>(), "", NodeID::Nil()); |
There was a problem hiding this comment.
It looks like the ownership based object directory didn't implement spilling yet.
|
cc @stephanie-wang Can you just quickly look at structure before I am cleaning it up? |
| << spilled_node_id << " Object id: " << object_id; | ||
| } | ||
| }); | ||
| // Q: Maybe we shouldn't update the timer and rate limit the restore requests within a |
There was a problem hiding this comment.
cc @stephanie-wang @ericl This can be a potential place for performance degradation. would love to hear suggestions?
There was a problem hiding this comment.
I think we should probably just never update the timer and make the RPC to be sent every 1 second at max or something (and implement this within the AsyncRestoreObject method).
There was a problem hiding this comment.
I think it's premature to try to optimize this since we haven't seen this situation yet. It probably makes sense to reset the timer if the restore failed, though (#13514 adds a similar reset when a needed object is evicted). Then, it should be safe to try again since we shouldn't have any other pull/restores in flight.
There was a problem hiding this comment.
I think this can cause some noticeable delay in some scenarios though. For example, here's a typical object restoration path
node A node B
1. restore request =>
2. restore_object from external storage
3. update the location to GCS
4. message published
5. Send a pull request =>
and if this path takes a long time (which can happen in some scenarios since we have two slow paths, restore_object, and GCS location update), the delay can be pretty significant since we are retrying with exponential backoff. For example, if this delays more than 700ms, the delay could be as long as a half-second (10+20+40+80+160+320+640) whereas if we handle this properly, it'd take only 10ms.
But I agree this type of scenario is not seen in real life yet. What about this? I will try benchmarking the multi-node spilling performance (with cluster utils) against the multi-node file spilling V1 PR (which won't have this issue), and if the performance is not that different, I will just comment about it.
Also, I will follow your suggestion about resetting in upon failures.
There was a problem hiding this comment.
I agree with a fast retry here (10ms?). The only reason we had the exponential backoff previously was to avoid duplicate pull requests. It seems like this isn't an issue since we de-dup restores and restore requests all go to the same node right?
There was a problem hiding this comment.
Yeah. This can still kind of spamming the node though. There will be 100 * (number of nodes requesting restoration) / S requests to the node that spilled the object if we don't have some logic to suppress requests.
Btw, I just verified without removing this, the streaming shuffle (4 nodes) doesn't seem to work.
stephanie-wang
left a comment
There was a problem hiding this comment.
This looks a lot cleaner! I had some questions about the semantics for the restore callbacks.
| num_times_fired++; | ||
| }); | ||
| // Make sure the callback wasn't called. | ||
| ASSERT_EQ(num_times_fired, 1); |
There was a problem hiding this comment.
We actually do want the callback to fire though, right?
There was a problem hiding this comment.
You are right! This assert should've been moved to the end of the test as well (that's why it succeed). Fixed!
|
So, there is a couple of issues I discovered;
But surprisingly, the long-running tests stability were much better. This PR succeeds 45 iterations (and crashed with object lost error) whereas the current master never ran more than 16~17 iterations IIRC. |
|
The cpp test failures are from object_manager_test, which we will remove from Stephanie's PR I believe. |
|
I don't think we should use this protocol for other types of external storage. S3 and others are accessible by any node, so we shouldn't have to request to the original node to restore it. Another way of thinking about this is that ideally, we don't want any state left over on the original node once the object has been spilled. This is to improve stability. Can you update this to only report/use the node ID if we're spilling to disk? |
|
+1 definitely shouldn't let code for this leak into other storage
strategies.
…On Wed, Jan 20, 2021, 5:33 PM Stephanie Wang ***@***.***> wrote:
I don't think we should use this protocol for other types of external
storage. S3 and others are accessible by any node, so we shouldn't have to
request to the original node to restore it. Another way of thinking about
this is that ideally, we don't want any state left over on the original
node once the object has been spilled. This is to improve stability.
Can you update this to only report/use the node ID if we're spilling to
disk?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#13542 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAADUSR4QE2SIYG3FFMYY5DS25777ANCNFSM4WIDT5XQ>
.
|
|
@ericl @stephanie-wang Sounds good. Here is the leftover. I will try completing them by tomorrow night.
|
|
Can we not do (5) please? It seems an unneeded optimization that will add
complexity.
…On Thu, Jan 21, 2021, 12:05 AM SangBin Cho ***@***.***> wrote:
@ericl <https://github.com/ericl> @stephanie-wang
<https://github.com/stephanie-wang> Sounds good. Here is the leftover. I
will try completing them by tomorrow night.
1. Will make this work only for disk spilling.
2. Will add proper unit tests when this is not disk spilling.
3. Fix (raylet) [2021-01-20 00:39:45,804 E 88131 1490662]
object_buffer_pool.cc:62: Failed to get object.
4. I will remove RetryUpdate for remote restore requests (streaming
shuffle fails otherwise)
5. I think we should dedup remote restore requests for both caller /
callee. Otherwise, when we have workload like object broadcasting with
object spilling, the spilling node could get high pressure (It will
theoretically get 100 * num_object * num_nodes RPCs per second). @ericl
<https://github.com/ericl> @stephanie-wang
<https://github.com/stephanie-wang> If you guys don't think we should
do this now, please comment! In that case, I will prepare the workload that
can prove this can cause issues and then make a PR after that.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#13542 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAADUSXGB4RFVXLXWVSZ6EDS27N5LANCNFSM4WIDT5XQ>
.
|
|
@ericl Without this, there could be easily thousands of RPCs per second to just restore a single object though? Is what you saying is this much of overhead is small enough, so not worth adding complexity? Or do you think this sort of situation would be rare? |
|
Well, it's not significant, but also not a problem right now. Let's defer
this until it's actually a bottleneck in a workload, there are bigger
issues to solve.
…On Thu, Jan 21, 2021, 12:14 AM SangBin Cho ***@***.***> wrote:
@ericl <https://github.com/ericl> Without this there could be easily
thousands per second amount of RPC to just restore a single object though?
Is what you saying is this much of overhead is small enough, so not worth
adding complexity?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#13542 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAADUSRVDIRKKCWQE22WXDTS27O6PANCNFSM4WIDT5XQ>
.
|
| py_test_module_list( | ||
| files = [ | ||
| "test_placement_group.py", | ||
| "test_object_spilling.py", |
There was a problem hiding this comment.
This is temporary. I will make a PR to breakdown this test into two (single node vs multi node) after this PR is merged.
|
//python/ray/tests:test_object_spilling TIMEOUT in 3 out of 3 in 903.3s |
|
@ericl Sorry must be fixed now. |
* done. * done. * Fix a mistake. * Ready. * Fix issues. * fix. * Finished the first round of code review. * formatting. * In progress. * Formatting. * Addressed code review. * Formatting * Fix tests. * fix bugs. * Skip flaky tests for now.
Why are these changes needed?
This implements the multi-node file spilling. Unlike the previous iteration, this will keep the abstraction barrier of LocalObjectManager by having all the spilling/restoration logic within the class.
Most of code is boilerplate to add a spilled_node_id field to the GCS server. I could've just added the node id to URL, but it turns out code would become a little ugly (in this way, we need to have 3 types of URLs, URLs core recognized, URLs generated by IO workers, and the URL without offset), so I just decided to go with this approach.
Related issue number
#13336 => Another impl
#13224 => Issue
Checks
scripts/format.shto lint the changes in this PR.