[autoscaler] rsync to node by richardliaw · Pull Request #11122 · ray-project/ray

richardliaw · 2020-09-29T21:09:31Z

Why are these changes needed?

This PR allows users to rsync to/from a node in the cluster. This should work for both laptop-cluster and headnode-workernode.

This PR enables this functionality by making rsync taking in a IP address, which is resolved internally to the node id.

This PR also includes a change for the provider behavior, so that a created provider is automatically cached within an interpreter session. This should have no impact on existing autoscaler users (as node providers shouldn't be changed).

Each NodeProvider now exposes a method to resolve node_ip to node_id. This is done so by caching ip addresses, and updating them if not found.

The open question here is whether the ip addresses of a machine can change in its runtime, resulting in a faulty node-id being returned.

TODOs:

This currently is not exposed to the CLI
This currently does not implement for Staroid Node Provider (which seems like it should just subclass KubernetesNodeProvider).

Related issue number

Checks

I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: Richard Liaw <rliaw@berkeley.edu>

python/ray/autoscaler/_private/aws/node_provider.py

ericl

Main comment is about code complexity concerns

Signed-off-by: Richard Liaw <rliaw@berkeley.edu>

python/ray/autoscaler/_private/commands.py

Signed-off-by: Richard Liaw <rliaw@berkeley.edu>

addressed

ericl · 2020-09-30T21:32:16Z

@wuisawesome can you review this?

wuisawesome

Left a few minor comments, but mostly looks good!

wuisawesome · 2020-09-30T21:57:04Z

python/ray/autoscaler/_private/commands.py

          target: Optional[str],
          override_cluster_name: Optional[str],
          down: bool,
+          ip_address: Optional[str] = None,


document pls

this function is actually getting kinda confusing. if i'm reading it correctly, ip_address and all_nodes=True should be mutually exclusive and if ip_address is not set, then then we default to syncing to the head node only?

I'm not advocating for refactoring this rn, but we should probably write it down at least.

documented, thanks for the catch!

wuisawesome · 2020-09-30T22:02:02Z

python/ray/autoscaler/_private/commands.py

-            # and _get_head_node does this too
-            nodes = _get_worker_nodes(config, override_cluster_name)
-
        head_node = _get_head_node(


Can we move this into the else since it's potentially super expensive/could have side effects if Tune uses rsync without the autoscaler?

I originally had it there, but you want to do special processing for the head node case (required for docker to do some other processing iir).

Tune won't use this command without the autoscaler (since it requires the config file in the first place)

wuisawesome · 2020-09-30T22:12:56Z

python/ray/autoscaler/node_provider.py

+        if not find_node_id():
+            all_nodes = self.non_terminated_nodes({})
+            for node_id in all_nodes:
+                self._external_ip_cache[self.external_ip(node_id)] = node_id


Can we put an if statement here and only do the call that's necessary?

self.external_ip and self.internal_ip sometimes make API calls (for example on AWS).

Are you sure that's preferable?

For our 3 major cloud providers, self.non_terminated_nodes({}) already populates the node cache, which means external_ip will (almost) never be making an API call.

Actually there's a bigger problem here, which is that this breaks on node providers without a concept of external node provider (like k8s and staroid), since the unused external call will throw.

Also, some people have projects with custom node providers which aren't cached, so if it's not too much trouble, it would be nice to support it.

pushed an update.

FYI we actually override this for kubernetes, and custom node providers without external IPs need to provide a custom implementation anyways because use_internal_ip=False will not work.

Signed-off-by: Richard Liaw <rliaw@berkeley.edu>

richardliaw added 2 commits September 29, 2020 14:03

update-node-providers

db4ff14

Signed-off-by: Richard Liaw <rliaw@berkeley.edu>

update

8b0a1ee

Signed-off-by: Richard Liaw <rliaw@berkeley.edu>

richardliaw marked this pull request as draft September 29, 2020 21:42

richardliaw mentioned this pull request Sep 29, 2020

[tune] docker syncer #11035

Merged

6 tasks

ericl reviewed Sep 29, 2020

View reviewed changes

python/ray/autoscaler/_private/aws/node_provider.py Outdated Show resolved Hide resolved

ericl previously requested changes Sep 29, 2020

View reviewed changes

richardliaw added 7 commits September 29, 2020 16:09

revert-changes

466be6f

Signed-off-by: Richard Liaw <rliaw@berkeley.edu>

update

dbc6262

Signed-off-by: Richard Liaw <rliaw@berkeley.edu>

minimize

f3d4981

Signed-off-by: Richard Liaw <rliaw@berkeley.edu>

cli-logger

2071f3f

Signed-off-by: Richard Liaw <rliaw@berkeley.edu>

move

7d29454

Signed-off-by: Richard Liaw <rliaw@berkeley.edu>

try-to-add-test

5135846

Signed-off-by: Richard Liaw <rliaw@berkeley.edu>

clearu

b8279ba

Signed-off-by: Richard Liaw <rliaw@berkeley.edu>

richardliaw commented Sep 30, 2020

View reviewed changes

python/ray/autoscaler/_private/commands.py Show resolved Hide resolved

richardliaw added 5 commits September 30, 2020 12:54

Merge branch 'master' into node-id-rsync

a717876

clear-cache

c058887

Signed-off-by: Richard Liaw <rliaw@berkeley.edu>

fix

9422510

Signed-off-by: Richard Liaw <rliaw@berkeley.edu>

update

36ed250

Signed-off-by: Richard Liaw <rliaw@berkeley.edu>

fix

6dc6a8e

Signed-off-by: Richard Liaw <rliaw@berkeley.edu>

richardliaw marked this pull request as ready for review September 30, 2020 21:21

richardliaw requested a review from ericl September 30, 2020 21:21

richardliaw assigned ericl Sep 30, 2020

ericl assigned wuisawesome and unassigned ericl Sep 30, 2020

wuisawesome approved these changes Sep 30, 2020

View reviewed changes

richardliaw added 3 commits September 30, 2020 15:56

fix-docs

8424090

Signed-off-by: Richard Liaw <rliaw@berkeley.edu>

case

7349564

Signed-off-by: Richard Liaw <rliaw@berkeley.edu>

case

c96800d

Signed-off-by: Richard Liaw <rliaw@berkeley.edu>

richardliaw added 5 commits September 30, 2020 16:07

staroid

8b1ad13

Signed-off-by: Richard Liaw <rliaw@berkeley.edu>

Merge branch 'master' into node-id-rsync

5787d2f

Signed-off-by: Richard Liaw <rliaw@berkeley.edu>

fix

f6a37ab

Signed-off-by: Richard Liaw <rliaw@berkeley.edu>

cache-fix

82023a9

Signed-off-by: Richard Liaw <rliaw@berkeley.edu>

fix

88ead20

Signed-off-by: Richard Liaw <rliaw@berkeley.edu>

richardliaw merged commit 0d93b1d into ray-project:master Oct 1, 2020

richardliaw deleted the node-id-rsync branch October 1, 2020 08:42

Conversation

richardliaw commented Sep 29, 2020

Why are these changes needed?

Related issue number

Checks

Uh oh!

Uh oh!

ericl left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ericl commented Sep 30, 2020

Uh oh!

wuisawesome left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants