Skip to content

[autoscaler] rsync to node#11122

Merged
richardliaw merged 22 commits intoray-project:masterfrom
richardliaw:node-id-rsync
Oct 1, 2020
Merged

[autoscaler] rsync to node#11122
richardliaw merged 22 commits intoray-project:masterfrom
richardliaw:node-id-rsync

Conversation

@richardliaw
Copy link
Copy Markdown
Contributor

Why are these changes needed?

This PR allows users to rsync to/from a node in the cluster. This should work for both laptop-cluster and headnode-workernode.

This PR enables this functionality by making rsync taking in a IP address, which is resolved internally to the node id.

This PR also includes a change for the provider behavior, so that a created provider is automatically cached within an interpreter session. This should have no impact on existing autoscaler users (as node providers shouldn't be changed).

Each NodeProvider now exposes a method to resolve node_ip to node_id. This is done so by caching ip addresses, and updating them if not found.

The open question here is whether the ip addresses of a machine can change in its runtime, resulting in a faulty node-id being returned.

TODOs:

  • This currently is not exposed to the CLI
  • This currently does not implement for Staroid Node Provider (which seems like it should just subclass KubernetesNodeProvider).

Related issue number

Checks

  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Signed-off-by: Richard Liaw <rliaw@berkeley.edu>
Signed-off-by: Richard Liaw <rliaw@berkeley.edu>
@richardliaw richardliaw marked this pull request as draft September 29, 2020 21:42
@richardliaw richardliaw mentioned this pull request Sep 29, 2020
6 tasks
ericl
ericl previously requested changes Sep 29, 2020
Copy link
Copy Markdown
Contributor

@ericl ericl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Main comment is about code complexity concerns

Signed-off-by: Richard Liaw <rliaw@berkeley.edu>
Signed-off-by: Richard Liaw <rliaw@berkeley.edu>
Signed-off-by: Richard Liaw <rliaw@berkeley.edu>
Signed-off-by: Richard Liaw <rliaw@berkeley.edu>
Signed-off-by: Richard Liaw <rliaw@berkeley.edu>
Signed-off-by: Richard Liaw <rliaw@berkeley.edu>
Signed-off-by: Richard Liaw <rliaw@berkeley.edu>
Signed-off-by: Richard Liaw <rliaw@berkeley.edu>
Signed-off-by: Richard Liaw <rliaw@berkeley.edu>
Signed-off-by: Richard Liaw <rliaw@berkeley.edu>
Signed-off-by: Richard Liaw <rliaw@berkeley.edu>
@richardliaw richardliaw marked this pull request as ready for review September 30, 2020 21:21
@richardliaw richardliaw requested a review from ericl September 30, 2020 21:21
@ericl ericl assigned wuisawesome and unassigned ericl Sep 30, 2020
@ericl
Copy link
Copy Markdown
Contributor

ericl commented Sep 30, 2020

@wuisawesome can you review this?

Copy link
Copy Markdown
Contributor

@wuisawesome wuisawesome left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a few minor comments, but mostly looks good!

target: Optional[str],
override_cluster_name: Optional[str],
down: bool,
ip_address: Optional[str] = None,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

document pls

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this function is actually getting kinda confusing. if i'm reading it correctly, ip_address and all_nodes=True should be mutually exclusive and if ip_address is not set, then then we default to syncing to the head node only?

I'm not advocating for refactoring this rn, but we should probably write it down at least.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

documented, thanks for the catch!

# and _get_head_node does this too
nodes = _get_worker_nodes(config, override_cluster_name)

head_node = _get_head_node(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we move this into the else since it's potentially super expensive/could have side effects if Tune uses rsync without the autoscaler?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I originally had it there, but you want to do special processing for the head node case (required for docker to do some other processing iir).

Tune won't use this command without the autoscaler (since it requires the config file in the first place)

if not find_node_id():
all_nodes = self.non_terminated_nodes({})
for node_id in all_nodes:
self._external_ip_cache[self.external_ip(node_id)] = node_id
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we put an if statement here and only do the call that's necessary?

self.external_ip and self.internal_ip sometimes make API calls (for example on AWS).

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you sure that's preferable?

For our 3 major cloud providers, self.non_terminated_nodes({}) already populates the node cache, which means external_ip will (almost) never be making an API call.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually there's a bigger problem here, which is that this breaks on node providers without a concept of external node provider (like k8s and staroid), since the unused external call will throw.

Also, some people have projects with custom node providers which aren't cached, so if it's not too much trouble, it would be nice to support it.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pushed an update.

FYI we actually override this for kubernetes, and custom node providers without external IPs need to provide a custom implementation anyways because use_internal_ip=False will not work.

Signed-off-by: Richard Liaw <rliaw@berkeley.edu>
Signed-off-by: Richard Liaw <rliaw@berkeley.edu>
Signed-off-by: Richard Liaw <rliaw@berkeley.edu>
Signed-off-by: Richard Liaw <rliaw@berkeley.edu>
Signed-off-by: Richard Liaw <rliaw@berkeley.edu>
Signed-off-by: Richard Liaw <rliaw@berkeley.edu>
Signed-off-by: Richard Liaw <rliaw@berkeley.edu>
Signed-off-by: Richard Liaw <rliaw@berkeley.edu>
@richardliaw richardliaw merged commit 0d93b1d into ray-project:master Oct 1, 2020
@richardliaw richardliaw deleted the node-id-rsync branch October 1, 2020 08:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants