Support PyTorch Lightning 1.6#163
Conversation
…oject#161 the training results can be pulled to the main process ray-project#162
|
| precision=16 if use_gpu else 32, | ||
| callbacks=[CUDACallback()] if use_gpu else [], | ||
| plugins=plugin, | ||
| strategiesies=strategygygy, |
JiahaoYao
left a comment
There was a problem hiding this comment.
ready to review again
|
local test on gpu passed |
amogkam
left a comment
There was a problem hiding this comment.
Thanks @JiahaoYao, overall looks good!
Left some comments on how we can have better maintainability in the future, but we can do this in a follow up!
|
|
||
| if trainer is None: | ||
| raise NotImplementedError( | ||
| "Ray launcher does not support trainer is None!") |
There was a problem hiding this comment.
what exactly does this error message mean to the user?
Under what situations would users run into this error?
There was a problem hiding this comment.
make sense, most cases, it should not be none; because the strategy is always passed to trainer.
| **kwargs: Any): | ||
| """Run the function on the workers and collect the results. | ||
| `executor.run_remote` is used to launch multiple ray remote tasks | ||
| to distributed training the model using the horovod backend. |
There was a problem hiding this comment.
Make first sentence of docstring 1 line please!
There was a problem hiding this comment.
sounds good to me!
| model = trainer.model | ||
| model_ref = ray.put(model) | ||
| trainer.model = None | ||
| new_args = tuple([None] + list(args[1:])) |
There was a problem hiding this comment.
will the model always be at the 0th position in the args?
| self._strategy.set_remote(True) | ||
|
|
||
| # `function` is a trainer's class method | ||
| # in the ray remote tasks, its object `trainer` will also |
There was a problem hiding this comment.
Should it be "its bound instance trainer" instead of "its object"?
| # does not fillfullied our purpose. Ray remote tasks will | ||
| # create another copy of trainer so that | ||
| # `function.__self__ != trainer`, in which the side effect only | ||
| # happens to `function.__self__` when running |
There was a problem hiding this comment.
Thanks for leaving this comment describing the problem!
How is this problem being resolved currently?
There was a problem hiding this comment.
got it, filled in
| @@ -0,0 +1,251 @@ | |||
| from typing import Callable, Any, Optional | |||
There was a problem hiding this comment.
All the comments in this file apply to ray_launcher as well
There was a problem hiding this comment.
sounds good to me
| # `RayPlugin.execute_remote`. | ||
| return super().execute_remote( | ||
| model=self._model, global_rank=global_rank, queue=queue) | ||
| class RayShardedStrategy(RayStrategy, DDPSpawnShardedStrategy): |
There was a problem hiding this comment.
let's make sure this class name change is reflected in the examples!
There was a problem hiding this comment.
Yes, the name will be changed due to
strategy_name = "ddp_sharded_ray"| from ray_lightning.launchers.utils import _RayOutput, get_executable_cls | ||
|
|
||
|
|
||
| class RayHorovodLauncher(_Launcher): |
There was a problem hiding this comment.
Seems like there is a lot of overlap between this launcher and RayLauncher. Would we be able to consolidate a lot of this overlap into a common superclass?
There was a problem hiding this comment.
Actually, i like this idea!
|
Also, let's make sure to follow up on the previous review on adding comments for the following
Can we do this for both the Launchers and the Strategies? |


ray_ddp√