Support PyTorch Lightning 1.6 by JiahaoYao · Pull Request #163 · ray-project/ray_lightning

JiahaoYao · 2022-06-22T04:17:20Z

ray_ddp √

…oject#161 the training results can be pulled to the main process ray-project#162

JiahaoYao · 2022-06-22T18:16:23Z

ddp_shard test passed √

sxjscience · 2022-06-27T19:07:01Z

ray_lightning/examples/ray_ddp_sharded_example.py

        precision=16 if use_gpu else 32,
        callbacks=[CUDACallback()] if use_gpu else [],
-        plugins=plugin,
+        strategiesies=strategygygy,


Seems to be a typo.

JiahaoYao

ready to review again

JiahaoYao · 2022-07-26T00:29:44Z

local test on gpu passed

JiahaoYao · 2022-07-26T00:30:17Z

amogkam

Thanks @JiahaoYao, overall looks good!

Left some comments on how we can have better maintainability in the future, but we can do this in a follow up!

ray_lightning/examples/ray_ddp_sharded_example.py

ray_lightning/accelerators/delayed_gpu_accelerator.py

amogkam · 2022-07-28T17:54:34Z

ray_lightning/launchers/ray_horovod_launcher.py

+
+        if trainer is None:
+            raise NotImplementedError(
+                "Ray launcher does not support trainer is None!")


what exactly does this error message mean to the user?

Under what situations would users run into this error?

make sense, most cases, it should not be none; because the strategy is always passed to trainer.

amogkam · 2022-07-28T17:54:57Z

ray_lightning/launchers/ray_horovod_launcher.py

+                                **kwargs: Any):
+        """Run the function on the workers and collect the results.
+           `executor.run_remote` is used to launch multiple ray remote tasks
+            to distributed training the model using the horovod backend.


Make first sentence of docstring 1 line please!

sounds good to me!

amogkam · 2022-07-28T17:55:30Z

ray_lightning/launchers/ray_horovod_launcher.py

+        model = trainer.model
+        model_ref = ray.put(model)
+        trainer.model = None
+        new_args = tuple([None] + list(args[1:]))


will the model always be at the 0th position in the args?

amogkam · 2022-07-28T18:08:01Z

ray_lightning/launchers/ray_horovod_launcher.py

+        self._strategy.set_remote(True)
+
+        # `function` is a trainer's class method
+        # in the ray remote tasks, its object `trainer` will also


Should it be "its bound instance trainer" instead of "its object"?

amogkam · 2022-07-28T18:08:31Z

ray_lightning/launchers/ray_horovod_launcher.py

+        # does not fillfullied our purpose. Ray remote tasks will
+        # create another copy of trainer so that
+        # `function.__self__ != trainer`, in which the side effect only
+        # happens to `function.__self__` when running


Thanks for leaving this comment describing the problem!

How is this problem being resolved currently?

got it, filled in

amogkam · 2022-07-28T18:15:54Z

ray_lightning/launchers/ray_horovod_launcher.py

@@ -0,0 +1,251 @@
+from typing import Callable, Any, Optional


All the comments in this file apply to ray_launcher as well

sounds good to me

amogkam · 2022-07-28T18:16:57Z

ray_lightning/ray_ddp_sharded.py

-        # `RayPlugin.execute_remote`.
-        return super().execute_remote(
-            model=self._model, global_rank=global_rank, queue=queue)
+class RayShardedStrategy(RayStrategy, DDPSpawnShardedStrategy):


let's make sure this class name change is reflected in the examples!

Yes, the name will be changed due to

strategy_name = "ddp_sharded_ray"

amogkam · 2022-07-28T18:22:18Z

ray_lightning/launchers/ray_horovod_launcher.py

+from ray_lightning.launchers.utils import _RayOutput, get_executable_cls
+
+
+class RayHorovodLauncher(_Launcher):


Seems like there is a lot of overlap between this launcher and RayLauncher. Would we be able to consolidate a lot of this overlap into a common superclass?

Actually, i like this idea!

amogkam · 2022-07-28T18:36:19Z

Also, let's make sure to follow up on the previous review on adding comments for the following

Which methods are overriding from Pytorch Lightning vs. which methods are brand new ones
Which methods are run remotely vs. which ones are run on the driver.

Can we do this for both the Launchers and the Strategies?

#195) * adding the change (based on #163) * Update ray_lightning/launchers/ray_horovod_launcher.py * Update ray_lightning/launchers/ray_launcher.py Co-authored-by: Amog Kamsetty <amogkam@users.noreply.github.com>

JiahaoYao added 17 commits June 19, 2022 15:25

nit

2f59353

stragy

ecd6d2e

fix typo

2727fd4

fix [raystrategy] multi-stragy in the worker is not consistent ray-pr…

e6e741a

…oject#161 the training results can be pulled to the main process ray-project#162

nit

9b0c0b5

adding the weight load in the multi node case

544c124

nit

3df599a

type issue

d9da8c4

add the init

801d489

changing the test

d489cc2

renaming plugin to strategy

cdb8f35

typo

fa25b9b

testing update

a2007ed

adding the ray launcher

cbded20

nit

d469606

adding the shard ddp

8e4bd9d

api changes

ae52860

installation tips

b90ba84

sxjscience reviewed Jun 27, 2022

View reviewed changes

JiahaoYao added 10 commits June 27, 2022 20:31

uodate typo

b1172cf

clean typo

e8fe185

ci

be7d3a2

adding horovod

f6bff8f

adding horovod

2fd3456

adding horovod installation

e05f516

adding reference

41c7ce2

adding the horovod

b2c771a

kit

a53f7ac

nit

16bf8d6

JiahaoYao added 15 commits July 25, 2022 21:18

nit

e7a887c

change the place of check cude

3642998

adding the comment

98cc639

fix trainer none

aca223a

merge

9763e88

adding the reference

266c953

nit

5eee67f

ci

9d207fc

update

92e49ea

_gpu

ed8d620

adding the docstring

4bb9379

fix

caa068e

fix the docstring for launcher

f7b688a

adding the docstring

9ae9286

nit

95c1186

JiahaoYao commented Jul 25, 2022

View reviewed changes

JiahaoYao added 2 commits July 25, 2022 23:03

fix accelerator

50f942b

fix class

fce1d60

amogkam changed the title ~~bump pytorch lightning to 1.16~~ Support PyTorch Lightning 1.6 Jul 28, 2022

Update ray_ddp_sharded_example.py

42b9214

amogkam approved these changes Jul 28, 2022

View reviewed changes

amogkam merged commit 299a776 into ray-project:main Jul 28, 2022

scv119 mentioned this pull request Jul 28, 2022

[2.0][master] master/2.0.0 release branches are broken due to pytorch-lightning version mismatch ray-project/ray#27214

Closed

JiahaoYao mentioned this pull request Aug 3, 2022

Update ray-lightning with the latest PL Strategy API changes #129

Closed

JiahaoYao added a commit to JiahaoYao/ray_lightning that referenced this pull request Aug 11, 2022

adding the change (based on ray-project#163)

427e930

JiahaoYao mentioned this pull request Aug 11, 2022

[Ray lightning 1.6] update the change according to the comment in #163 #195

Merged

		from ray_lightning.launchers.utils import _RayOutput, get_executable_cls


		class RayHorovodLauncher(_Launcher):

Conversation

JiahaoYao commented Jun 22, 2022

Uh oh!

JiahaoYao commented Jun 22, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

JiahaoYao left a comment

Choose a reason for hiding this comment

Uh oh!

JiahaoYao commented Jul 26, 2022

Uh oh!

JiahaoYao commented Jul 26, 2022

Uh oh!

amogkam left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

amogkam commented Jul 28, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants