[tune] Sync logs from workers and improve tensorboard reporting#1567
[tune] Sync logs from workers and improve tensorboard reporting#1567richardliaw merged 13 commits intoray-project:masterfrom
Conversation
|
Test PASSed. |
|
Test FAILed. |
python/ray/tune/log_sync.py
Outdated
| def _refresh_worker_ip(self): | ||
| if self.worker_ip_fut: | ||
| try: | ||
| self.worker_ip = ray.get(self.worker_ip_fut) |
There was a problem hiding this comment.
todo: this can block forever if the trial is dead
|
Test FAILed. |
|
@richardliaw ready for review (though I have yet to test on a cluster) |
|
Tested; seems to work fine. |
|
Test FAILed. |
|
jenkins retest this please |
|
Test PASSed. |
richardliaw
left a comment
There was a problem hiding this comment.
The dependence on autoscaler functionality rather than particular abstractions seems brittle.
Also separately (not in this PR), it would make sense to have some sort of command building tool that takes care of things like Docker environments.
| ssh_key = get_ssh_key() | ||
| ssh_user = get_ssh_user() | ||
| if ssh_key is None or ssh_user is None: | ||
| print( |
There was a problem hiding this comment.
why not allow arbitrary clusters to use this?
There was a problem hiding this comment.
cluster_info.py will have to support that. I'm not sure how we can infer the SSH key unless it's set up by Ray.
| print("Error: log sync requires rsync to be installed.") | ||
| return | ||
| worker_to_local_sync_cmd = ( | ||
| ("""rsync -avz -e "ssh -i '{}' -o ConnectTimeout=120s """ |
There was a problem hiding this comment.
this also doesn't work with Docker btw
There was a problem hiding this comment.
Yeah, it would be nice to have a more generalized way of running remote commands. Probably when kubernetes support lands we can expose a common run_cmd interface from the autoscaler.
| values = [] | ||
| for attr in result.keys(): | ||
| value = result[attr] | ||
| if value is not None: |
There was a problem hiding this comment.
less verbose to just do for attr, value in result.items()
| for attr in result.keys(): | ||
| value = result[attr] | ||
| if value is not None: | ||
| if type(value) in [int, float]: |
There was a problem hiding this comment.
perhaps also check for long
There was a problem hiding this comment.
I don't think that's a real type?
There was a problem hiding this comment.
eh it's only for py2 so this is fine
| self.worker_ip = None | ||
| print("Created LogSyncer for {} -> {}".format(local_dir, remote_dir)) | ||
|
|
||
| def set_worker_ip(self, worker_ip): |
There was a problem hiding this comment.
it seems like sync_now(force=True) is called when forced to flush and when the trial stops. In both cases, why do you only care about the last worker ip?
Separately, It might also make sense to split functions for local -> remote sync and local <- worker sync.
There was a problem hiding this comment.
The worker ip can change over time if the trial is restarted. Unfortunately Ray doesn't have an API to get the worker IP synchronously, so we piggyback on the IP reported by the last result.
Note that syncing is per-trial, so we actually track the IP per trial.
| def get_ssh_key(): | ||
| """Returns ssh key to connecting to cluster workers.""" | ||
|
|
||
| path = os.path.expanduser("~/ray_bootstrap_key.pem") |
There was a problem hiding this comment.
hardcoding this seems brittle
There was a problem hiding this comment.
I added a TODO that this only supports this type of cluster for now.
|
Test FAILed. |
|
Test PASSed. |
|
Test failure (Valgrind and custom resources) unrelated. |
What do these changes do?
infoobject and have them show up in tensorboard.Related issue number
#1546