[autoscaler] rsync to node#11122
Conversation
Signed-off-by: Richard Liaw <rliaw@berkeley.edu>
ericl
left a comment
There was a problem hiding this comment.
Main comment is about code complexity concerns
Signed-off-by: Richard Liaw <rliaw@berkeley.edu>
Signed-off-by: Richard Liaw <rliaw@berkeley.edu>
Signed-off-by: Richard Liaw <rliaw@berkeley.edu>
Signed-off-by: Richard Liaw <rliaw@berkeley.edu>
|
@wuisawesome can you review this? |
wuisawesome
left a comment
There was a problem hiding this comment.
Left a few minor comments, but mostly looks good!
| target: Optional[str], | ||
| override_cluster_name: Optional[str], | ||
| down: bool, | ||
| ip_address: Optional[str] = None, |
There was a problem hiding this comment.
this function is actually getting kinda confusing. if i'm reading it correctly, ip_address and all_nodes=True should be mutually exclusive and if ip_address is not set, then then we default to syncing to the head node only?
I'm not advocating for refactoring this rn, but we should probably write it down at least.
There was a problem hiding this comment.
documented, thanks for the catch!
| # and _get_head_node does this too | ||
| nodes = _get_worker_nodes(config, override_cluster_name) | ||
|
|
||
| head_node = _get_head_node( |
There was a problem hiding this comment.
Can we move this into the else since it's potentially super expensive/could have side effects if Tune uses rsync without the autoscaler?
There was a problem hiding this comment.
I originally had it there, but you want to do special processing for the head node case (required for docker to do some other processing iir).
Tune won't use this command without the autoscaler (since it requires the config file in the first place)
| if not find_node_id(): | ||
| all_nodes = self.non_terminated_nodes({}) | ||
| for node_id in all_nodes: | ||
| self._external_ip_cache[self.external_ip(node_id)] = node_id |
There was a problem hiding this comment.
Can we put an if statement here and only do the call that's necessary?
self.external_ip and self.internal_ip sometimes make API calls (for example on AWS).
There was a problem hiding this comment.
Are you sure that's preferable?
For our 3 major cloud providers, self.non_terminated_nodes({}) already populates the node cache, which means external_ip will (almost) never be making an API call.
There was a problem hiding this comment.
Actually there's a bigger problem here, which is that this breaks on node providers without a concept of external node provider (like k8s and staroid), since the unused external call will throw.
Also, some people have projects with custom node providers which aren't cached, so if it's not too much trouble, it would be nice to support it.
There was a problem hiding this comment.
pushed an update.
FYI we actually override this for kubernetes, and custom node providers without external IPs need to provide a custom implementation anyways because use_internal_ip=False will not work.
Signed-off-by: Richard Liaw <rliaw@berkeley.edu>
Why are these changes needed?
This PR allows users to rsync to/from a node in the cluster. This should work for both laptop-cluster and headnode-workernode.
This PR enables this functionality by making rsync taking in a IP address, which is resolved internally to the node id.
This PR also includes a change for the provider behavior, so that a created provider is automatically cached within an interpreter session. This should have no impact on existing autoscaler users (as node providers shouldn't be changed).
Each NodeProvider now exposes a method to resolve node_ip to node_id. This is done so by caching ip addresses, and updating them if not found.
The open question here is whether the ip addresses of a machine can change in its runtime, resulting in a faulty node-id being returned.
TODOs:
Related issue number
Checks
scripts/format.shto lint the changes in this PR.