[Object Spilling] Multi node file spilling V2. by rkooo567 · Pull Request #13542 · ray-project/ray

rkooo567 · 2021-01-19T02:56:44Z

Why are these changes needed?

This implements the multi-node file spilling. Unlike the previous iteration, this will keep the abstraction barrier of LocalObjectManager by having all the spilling/restoration logic within the class.

Most of code is boilerplate to add a spilled_node_id field to the GCS server. I could've just added the node id to URL, but it turns out code would become a little ugly (in this way, we need to have 3 types of URLs, URLs core recognized, URLs generated by IO workers, and the URL without offset), so I just decided to go with this approach.

Related issue number

#13336 => Another impl
#13224 => Issue

Checks

I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

rkooo567 · 2021-01-19T02:57:50Z

src/ray/object_manager/ownership_based_object_directory.cc

                     << "LookupLocations returns an empty list of locations.";
    io_service_.post([callback, object_id]() {
-      callback(object_id, std::unordered_set<NodeID>(), "");
+      callback(object_id, std::unordered_set<NodeID>(), "", NodeID::Nil());


It looks like the ownership based object directory didn't implement spilling yet.

src/ray/object_manager/pull_manager.cc

rkooo567 · 2021-01-19T19:46:56Z

cc @stephanie-wang Can you just quickly look at structure before I am cleaning it up?

src/ray/raylet/local_object_manager.cc

rkooo567 · 2021-01-20T01:52:02Z

src/ray/object_manager/pull_manager.cc

+                << spilled_node_id << " Object id: " << object_id;
          }
        });
+    // Q: Maybe we shouldn't update the timer and rate limit the restore requests within a


cc @stephanie-wang @ericl This can be a potential place for performance degradation. would love to hear suggestions?

I think we should probably just never update the timer and make the RPC to be sent every 1 second at max or something (and implement this within the AsyncRestoreObject method).

I think it's premature to try to optimize this since we haven't seen this situation yet. It probably makes sense to reset the timer if the restore failed, though (#13514 adds a similar reset when a needed object is evicted). Then, it should be safe to try again since we shouldn't have any other pull/restores in flight.

I think this can cause some noticeable delay in some scenarios though. For example, here's a typical object restoration path

node A node B 1. restore request => 2. restore_object from external storage 3. update the location to GCS 4. message published 5. Send a pull request =>

and if this path takes a long time (which can happen in some scenarios since we have two slow paths, restore_object, and GCS location update), the delay can be pretty significant since we are retrying with exponential backoff. For example, if this delays more than 700ms, the delay could be as long as a half-second (10+20+40+80+160+320+640) whereas if we handle this properly, it'd take only 10ms.

But I agree this type of scenario is not seen in real life yet. What about this? I will try benchmarking the multi-node spilling performance (with cluster utils) against the multi-node file spilling V1 PR (which won't have this issue), and if the performance is not that different, I will just comment about it.

Also, I will follow your suggestion about resetting in upon failures.

I agree with a fast retry here (10ms?). The only reason we had the exponential backoff previously was to avoid duplicate pull requests. It seems like this isn't an issue since we de-dup restores and restore requests all go to the same node right?

Yeah. This can still kind of spamming the node though. There will be 100 * (number of nodes requesting restoration) / S requests to the node that spilled the object if we don't have some logic to suppress requests.

Btw, I just verified without removing this, the streaming shuffle (4 nodes) doesn't seem to work.

stephanie-wang

This looks a lot cleaner! I had some questions about the semantics for the restore callbacks.

src/ray/object_manager/pull_manager.cc

src/ray/raylet/node_manager.cc

src/ray/raylet/local_object_manager.cc

stephanie-wang · 2021-01-20T03:29:29Z

src/ray/raylet/test/local_object_manager_test.cc

+                                      num_times_fired++;
+                                    });
+  // Make sure the callback wasn't called.
+  ASSERT_EQ(num_times_fired, 1);


We actually do want the callback to fire though, right?

You are right! This assert should've been moved to the end of the test as well (that's why it succeed). Fixed!

rkooo567 · 2021-01-20T08:47:52Z

So, there is a couple of issues I discovered;

When running the streaming shuffle, there's a weird new error, (raylet) [2021-01-20 00:39:45,804 E 88131 1490662] object_buffer_pool.cc:62: Failed to get object
The streaming shuffle performance (4 nodes cluster util) seems to be almost 2X slower (without retryUpdate)

But surprisingly, the long-running tests stability were much better. This PR succeeds 45 iterations (and crashed with object lost error) whereas the current master never ran more than 16~17 iterations IIRC.

rkooo567 · 2021-01-20T23:05:47Z

The cpp test failures are from object_manager_test, which we will remove from Stephanie's PR I believe.

stephanie-wang · 2021-01-21T01:33:36Z

I don't think we should use this protocol for other types of external storage. S3 and others are accessible by any node, so we shouldn't have to request to the original node to restore it. Another way of thinking about this is that ideally, we don't want any state left over on the original node once the object has been spilled. This is to improve stability.

Can you update this to only report/use the node ID if we're spilling to disk?

ericl · 2021-01-21T02:05:45Z

+1 definitely shouldn't let code for this leak into other storage strategies.

…

On Wed, Jan 20, 2021, 5:33 PM Stephanie Wang ***@***.***> wrote: I don't think we should use this protocol for other types of external storage. S3 and others are accessible by any node, so we shouldn't have to request to the original node to restore it. Another way of thinking about this is that ideally, we don't want any state left over on the original node once the object has been spilled. This is to improve stability. Can you update this to only report/use the node ID if we're spilling to disk? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#13542 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAADUSR4QE2SIYG3FFMYY5DS25777ANCNFSM4WIDT5XQ> .

rkooo567 · 2021-01-21T08:05:26Z

@ericl @stephanie-wang Sounds good. Here is the leftover. I will try completing them by tomorrow night.

~~Will make this work only for disk spilling.~~
~~Will add proper unit tests when this is not disk spilling.~~
~~Fix (raylet) [2021-01-20 00:39:45,804 E 88131 1490662] object_buffer_pool.cc:62: Failed to get object.~~
~~I will remove RetryUpdate for remote restore requests (streaming shuffle fails otherwise)~~
I think we should dedup remote restore requests for both caller / callee. Otherwise, when we have workload like object broadcasting with object spilling, the spilling node could get high pressure (It will theoretically get 100 * num_object * num_nodes RPCs per second). @ericl @stephanie-wang If you guys don't think we should do this now, please comment! In that case, I will prepare the workload that can prove this can cause issues and then make a PR after that
=> Deprioritized until issues found in real workload.
~~Merge conflict.~~

ericl · 2021-01-21T08:11:04Z

Can we not do (5) please? It seems an unneeded optimization that will add complexity.

…

On Thu, Jan 21, 2021, 12:05 AM SangBin Cho ***@***.***> wrote: @ericl <https://github.com/ericl> @stephanie-wang <https://github.com/stephanie-wang> Sounds good. Here is the leftover. I will try completing them by tomorrow night. 1. Will make this work only for disk spilling. 2. Will add proper unit tests when this is not disk spilling. 3. Fix (raylet) [2021-01-20 00:39:45,804 E 88131 1490662] object_buffer_pool.cc:62: Failed to get object. 4. I will remove RetryUpdate for remote restore requests (streaming shuffle fails otherwise) 5. I think we should dedup remote restore requests for both caller / callee. Otherwise, when we have workload like object broadcasting with object spilling, the spilling node could get high pressure (It will theoretically get 100 * num_object * num_nodes RPCs per second). @ericl <https://github.com/ericl> @stephanie-wang <https://github.com/stephanie-wang> If you guys don't think we should do this now, please comment! In that case, I will prepare the workload that can prove this can cause issues and then make a PR after that. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#13542 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAADUSXGB4RFVXLXWVSZ6EDS27N5LANCNFSM4WIDT5XQ> .

rkooo567 · 2021-01-21T08:14:16Z

@ericl Without this, there could be easily thousands of RPCs per second to just restore a single object though? Is what you saying is this much of overhead is small enough, so not worth adding complexity? Or do you think this sort of situation would be rare?

ericl · 2021-01-21T08:24:32Z

Well, it's not significant, but also not a problem right now. Let's defer this until it's actually a bottleneck in a workload, there are bigger issues to solve.

…

On Thu, Jan 21, 2021, 12:14 AM SangBin Cho ***@***.***> wrote: @ericl <https://github.com/ericl> Without this there could be easily thousands per second amount of RPC to just restore a single object though? Is what you saying is this much of overhead is small enough, so not worth adding complexity? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#13542 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAADUSRVDIRKKCWQE22WXDTS27O6PANCNFSM4WIDT5XQ> .

rkooo567 · 2021-01-22T18:48:03Z

python/ray/tests/BUILD

 py_test_module_list(
  files = [
    "test_placement_group.py",
+    "test_object_spilling.py",


This is temporary. I will make a PR to breakdown this test into two (single node vs multi node) after this PR is merged.

ericl · 2021-01-22T22:12:33Z

//python/ray/tests:test_object_spilling TIMEOUT in 3 out of 3 in 903.3s
Stats over 3 runs: max = 903.3s, min = 903.2s, avg = 903.3s, dev = 0.1s
/private/var/tmp/_bazel_travis/7c557718f3877739c657a427203800b1/execroot/com_github_ray_project_ray/bazel-out/darwin-opt/testlogs/python/ray/tests/test_object_spilling/test.log
/private/var/tmp/_bazel_travis/7c557718f3877739c657a427203800b1/execroot/com_github_ray_project_ray/bazel-out/darwin-opt/testlogs/python/ray/tests/test_object_spilling/test_attempts/attempt_1.log
/private/var/tmp/_bazel_travis/7c557718f3877739c657a427203800b1/execroot/com_github_ray_project_ray/bazel-out/darwin-opt/testlogs/python/ray/tests/test_object_spilling/test_attempts/attempt_2.log

rkooo567 · 2021-01-23T00:27:03Z

@ericl Sorry must be fixed now.

Done

* done. * done. * Fix a mistake. * Ready. * Fix issues. * fix. * Finished the first round of code review. * formatting. * In progress. * Formatting. * Addressed code review. * Formatting * Fix tests. * fix bugs. * Skip flaky tests for now.

…3542)" This reverts commit 7055a55.

done.

f539168

rkooo567 commented Jan 19, 2021

View reviewed changes

rkooo567 assigned stephanie-wang Jan 19, 2021

stephanie-wang reviewed Jan 19, 2021

View reviewed changes

src/ray/raylet/local_object_manager.cc Outdated Show resolved Hide resolved

stephanie-wang added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Jan 19, 2021

done.

d5023a8

rkooo567 removed the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Jan 20, 2021

rkooo567 assigned ericl Jan 20, 2021

Fix a mistake.

62f9ea1

rkooo567 commented Jan 20, 2021

View reviewed changes

rkooo567 added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Jan 20, 2021

rkooo567 added 2 commits January 19, 2021 18:07

Merge branch 'master' into multi-node-file-spilling-step

cdcbe87

Ready.

3419df3

rkooo567 removed the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Jan 20, 2021

Fix issues.

2590b54

stephanie-wang previously requested changes Jan 20, 2021

View reviewed changes

stephanie-wang added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Jan 20, 2021

fix.

60e19f1

rkooo567 added 2 commits January 20, 2021 10:07

Finished the first round of code review.

c0401a9

formatting.

d8e87fb

rkooo567 removed the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Jan 20, 2021

stephanie-wang added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Jan 21, 2021

rkooo567 added 5 commits January 21, 2021 21:14

In progress.

29738ed

Formatting.

4ccc97a

Addressed code review.

f711bdc

Merge branch 'master' into multi-node-file-spilling-step

fe4a57d

Formatting

3056124

rkooo567 requested a review from stephanie-wang January 22, 2021 10:14

rkooo567 removed the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Jan 22, 2021

Fix tests.

72cbc03

rkooo567 commented Jan 22, 2021

View reviewed changes

ericl added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Jan 22, 2021

fix bugs.

bbba2e5

rkooo567 removed the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Jan 23, 2021

Skip flaky tests for now.

0976637

ericl approved these changes Jan 24, 2021

View reviewed changes

ericl merged commit edbb293 into ray-project:master Jan 24, 2021

fishbone added a commit to fishbone/ray that referenced this pull request Feb 16, 2021

Revert "[Object Spilling] Multi node file spilling V2. (ray-project#1…

69c00d4

…3542)" This reverts commit 7055a55.

Conversation

rkooo567 commented Jan 19, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why are these changes needed?

Related issue number

Checks

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

rkooo567 commented Jan 19, 2021

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rkooo567 Jan 20, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

stephanie-wang left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rkooo567 commented Jan 20, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rkooo567 commented Jan 20, 2021

Uh oh!

stephanie-wang commented Jan 21, 2021

Uh oh!

ericl commented Jan 21, 2021 via email

Uh oh!

rkooo567 commented Jan 21, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ericl commented Jan 21, 2021 via email

Uh oh!

rkooo567 commented Jan 21, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ericl commented Jan 21, 2021 via email

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ericl commented Jan 22, 2021

Uh oh!

rkooo567 commented Jan 23, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

rkooo567 commented Jan 19, 2021 •

edited

Loading

rkooo567 Jan 20, 2021 •

edited

Loading

rkooo567 commented Jan 20, 2021 •

edited

Loading

rkooo567 commented Jan 21, 2021 •

edited

Loading

rkooo567 commented Jan 21, 2021 •

edited

Loading