-
Notifications
You must be signed in to change notification settings - Fork 7.4k
Description
System information
- OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Ubuntu 18.04
- Ray installed from (source or binary): binary
- Ray version: 0.6.4
- Python version: 3.6.7
Describe the problem
In _LogSyncer, sync_to_worker_if_possible and sync_now use rsync to transfer logs between the local node and the worker. This breaks when using Docker, since:
- If the local node is in Docker, it will typically have the
rootusername, and so this is whatget_ssh_userwill return. But we cannot typically login to the worker node asroot. - The
local_diron the worker is inside the Docker container, and may not even be visible outside. If it is bound, then it will typically be at a different path.
An unrelated issue: if self.sync_func is non-None, it will get executed before the worker_to_local_sync_cmd, which I think is wrong.
I'd be happy to make a stab at a PR, but I'd appreciate some suggestions on the right way of fixing this, as it's been a while since I've looked at Ray internals. This also feels like a problem that is likely to reoccur with slight variation, e.g. this bug is similar to #4183
Perhaps we can make autoscaler provide an abstract sync interface that tune and other consumers can use. This could make to rsync in the standard case, and something more complex in the Docker case (e.g. Docker cp followed by rsync)? ray.autoscaller.commands.rsync is already something along these lines -- would this be an appropriate place to modify?
A more hacky solution would be to make get_ssh_user return the right value and make the Docker volume-binding line up so that we can just ignore the difference between Docker and non-Docker instances.
Source code / logs
A MWE for this is hard to provide, but if the above description is insufficient I can try to come up with one.