Skip to content

[Object Spilling] Multi node file spilling V2. #13542

Merged
ericl merged 17 commits intoray-project:masterfrom
rkooo567:multi-node-file-spilling-step
Jan 24, 2021
Merged

[Object Spilling] Multi node file spilling V2. #13542
ericl merged 17 commits intoray-project:masterfrom
rkooo567:multi-node-file-spilling-step

Conversation

@rkooo567
Copy link
Copy Markdown
Contributor

@rkooo567 rkooo567 commented Jan 19, 2021

Why are these changes needed?

This implements the multi-node file spilling. Unlike the previous iteration, this will keep the abstraction barrier of LocalObjectManager by having all the spilling/restoration logic within the class.

Most of code is boilerplate to add a spilled_node_id field to the GCS server. I could've just added the node id to URL, but it turns out code would become a little ugly (in this way, we need to have 3 types of URLs, URLs core recognized, URLs generated by IO workers, and the URL without offset), so I just decided to go with this approach.

Related issue number

#13336 => Another impl
#13224 => Issue

Checks

  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

<< "LookupLocations returns an empty list of locations.";
io_service_.post([callback, object_id]() {
callback(object_id, std::unordered_set<NodeID>(), "");
callback(object_id, std::unordered_set<NodeID>(), "", NodeID::Nil());
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like the ownership based object directory didn't implement spilling yet.

@rkooo567
Copy link
Copy Markdown
Contributor Author

cc @stephanie-wang Can you just quickly look at structure before I am cleaning it up?

@stephanie-wang stephanie-wang added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Jan 19, 2021
@rkooo567 rkooo567 removed the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Jan 20, 2021
<< spilled_node_id << " Object id: " << object_id;
}
});
// Q: Maybe we shouldn't update the timer and rate limit the restore requests within a
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @stephanie-wang @ericl This can be a potential place for performance degradation. would love to hear suggestions?

Copy link
Copy Markdown
Contributor Author

@rkooo567 rkooo567 Jan 20, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should probably just never update the timer and make the RPC to be sent every 1 second at max or something (and implement this within the AsyncRestoreObject method).

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's premature to try to optimize this since we haven't seen this situation yet. It probably makes sense to reset the timer if the restore failed, though (#13514 adds a similar reset when a needed object is evicted). Then, it should be safe to try again since we shouldn't have any other pull/restores in flight.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this can cause some noticeable delay in some scenarios though. For example, here's a typical object restoration path

   node A                         node B
1. restore request  =>
2.                                      restore_object from external storage
3.                                      update the location to GCS
4. message published
5. Send a pull request => 

and if this path takes a long time (which can happen in some scenarios since we have two slow paths, restore_object, and GCS location update), the delay can be pretty significant since we are retrying with exponential backoff. For example, if this delays more than 700ms, the delay could be as long as a half-second (10+20+40+80+160+320+640) whereas if we handle this properly, it'd take only 10ms.

But I agree this type of scenario is not seen in real life yet. What about this? I will try benchmarking the multi-node spilling performance (with cluster utils) against the multi-node file spilling V1 PR (which won't have this issue), and if the performance is not that different, I will just comment about it.

Also, I will follow your suggestion about resetting in upon failures.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with a fast retry here (10ms?). The only reason we had the exponential backoff previously was to avoid duplicate pull requests. It seems like this isn't an issue since we de-dup restores and restore requests all go to the same node right?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah. This can still kind of spamming the node though. There will be 100 * (number of nodes requesting restoration) / S requests to the node that spilled the object if we don't have some logic to suppress requests.

Btw, I just verified without removing this, the streaming shuffle (4 nodes) doesn't seem to work.

@rkooo567 rkooo567 added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Jan 20, 2021
@rkooo567 rkooo567 removed the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Jan 20, 2021
Copy link
Copy Markdown
Contributor

@stephanie-wang stephanie-wang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks a lot cleaner! I had some questions about the semantics for the restore callbacks.

num_times_fired++;
});
// Make sure the callback wasn't called.
ASSERT_EQ(num_times_fired, 1);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We actually do want the callback to fire though, right?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right! This assert should've been moved to the end of the test as well (that's why it succeed). Fixed!

@stephanie-wang stephanie-wang added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Jan 20, 2021
@rkooo567
Copy link
Copy Markdown
Contributor Author

rkooo567 commented Jan 20, 2021

So, there is a couple of issues I discovered;

  1. When running the streaming shuffle, there's a weird new error, (raylet) [2021-01-20 00:39:45,804 E 88131 1490662] object_buffer_pool.cc:62: Failed to get object

  2. The streaming shuffle performance (4 nodes cluster util) seems to be almost 2X slower (without retryUpdate)

But surprisingly, the long-running tests stability were much better. This PR succeeds 45 iterations (and crashed with object lost error) whereas the current master never ran more than 16~17 iterations IIRC.

@rkooo567 rkooo567 removed the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Jan 20, 2021
@rkooo567
Copy link
Copy Markdown
Contributor Author

The cpp test failures are from object_manager_test, which we will remove from Stephanie's PR I believe.

@stephanie-wang
Copy link
Copy Markdown
Contributor

I don't think we should use this protocol for other types of external storage. S3 and others are accessible by any node, so we shouldn't have to request to the original node to restore it. Another way of thinking about this is that ideally, we don't want any state left over on the original node once the object has been spilled. This is to improve stability.

Can you update this to only report/use the node ID if we're spilling to disk?

@stephanie-wang stephanie-wang added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Jan 21, 2021
@ericl
Copy link
Copy Markdown
Contributor

ericl commented Jan 21, 2021 via email

@rkooo567
Copy link
Copy Markdown
Contributor Author

rkooo567 commented Jan 21, 2021

@ericl @stephanie-wang Sounds good. Here is the leftover. I will try completing them by tomorrow night.

  1. Will make this work only for disk spilling.
  2. Will add proper unit tests when this is not disk spilling.
  3. Fix (raylet) [2021-01-20 00:39:45,804 E 88131 1490662] object_buffer_pool.cc:62: Failed to get object.
  4. I will remove RetryUpdate for remote restore requests (streaming shuffle fails otherwise)
  5. I think we should dedup remote restore requests for both caller / callee. Otherwise, when we have workload like object broadcasting with object spilling, the spilling node could get high pressure (It will theoretically get 100 * num_object * num_nodes RPCs per second). @ericl @stephanie-wang If you guys don't think we should do this now, please comment! In that case, I will prepare the workload that can prove this can cause issues and then make a PR after that
    => Deprioritized until issues found in real workload.
  6. Merge conflict.

@ericl
Copy link
Copy Markdown
Contributor

ericl commented Jan 21, 2021 via email

@rkooo567
Copy link
Copy Markdown
Contributor Author

rkooo567 commented Jan 21, 2021

@ericl Without this, there could be easily thousands of RPCs per second to just restore a single object though? Is what you saying is this much of overhead is small enough, so not worth adding complexity? Or do you think this sort of situation would be rare?

@ericl
Copy link
Copy Markdown
Contributor

ericl commented Jan 21, 2021 via email

@rkooo567 rkooo567 removed the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Jan 22, 2021
py_test_module_list(
files = [
"test_placement_group.py",
"test_object_spilling.py",
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is temporary. I will make a PR to breakdown this test into two (single node vs multi node) after this PR is merged.

@ericl
Copy link
Copy Markdown
Contributor

ericl commented Jan 22, 2021

//python/ray/tests:test_object_spilling TIMEOUT in 3 out of 3 in 903.3s
Stats over 3 runs: max = 903.3s, min = 903.2s, avg = 903.3s, dev = 0.1s
/private/var/tmp/_bazel_travis/7c557718f3877739c657a427203800b1/execroot/com_github_ray_project_ray/bazel-out/darwin-opt/testlogs/python/ray/tests/test_object_spilling/test.log
/private/var/tmp/_bazel_travis/7c557718f3877739c657a427203800b1/execroot/com_github_ray_project_ray/bazel-out/darwin-opt/testlogs/python/ray/tests/test_object_spilling/test_attempts/attempt_1.log
/private/var/tmp/_bazel_travis/7c557718f3877739c657a427203800b1/execroot/com_github_ray_project_ray/bazel-out/darwin-opt/testlogs/python/ray/tests/test_object_spilling/test_attempts/attempt_2.log

@ericl ericl added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Jan 22, 2021
@rkooo567
Copy link
Copy Markdown
Contributor Author

@ericl Sorry must be fixed now.

@rkooo567 rkooo567 removed the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Jan 23, 2021
@ericl ericl merged commit edbb293 into ray-project:master Jan 24, 2021
fishbone pushed a commit to fishbone/ray that referenced this pull request Feb 16, 2021
* done.

* done.

* Fix a mistake.

* Ready.

* Fix issues.

* fix.

* Finished the first round of code review.

* formatting.

* In progress.

* Formatting.

* Addressed code review.

* Formatting

* Fix tests.

* fix bugs.

* Skip flaky tests for now.
fishbone added a commit to fishbone/ray that referenced this pull request Feb 16, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants