Skip to content
This repository was archived by the owner on Nov 3, 2023. It is now read-only.

Support PyTorch Lightning 1.6#163

Merged
amogkam merged 113 commits intoray-project:mainfrom
JiahaoYao:main
Jul 28, 2022
Merged

Support PyTorch Lightning 1.6#163
amogkam merged 113 commits intoray-project:mainfrom
JiahaoYao:main

Conversation

@JiahaoYao
Copy link
Copy Markdown
Contributor

  • ray_ddp

@JiahaoYao
Copy link
Copy Markdown
Contributor Author

  • ddp_shard test passed √

precision=16 if use_gpu else 32,
callbacks=[CUDACallback()] if use_gpu else [],
plugins=plugin,
strategiesies=strategygygy,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems to be a typo.

Copy link
Copy Markdown
Contributor Author

@JiahaoYao JiahaoYao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ready to review again

@JiahaoYao
Copy link
Copy Markdown
Contributor Author

local test on gpu passed

@JiahaoYao
Copy link
Copy Markdown
Contributor Author

image

image

@amogkam amogkam changed the title bump pytorch lightning to 1.16 Support PyTorch Lightning 1.6 Jul 28, 2022
Copy link
Copy Markdown
Collaborator

@amogkam amogkam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @JiahaoYao, overall looks good!

Left some comments on how we can have better maintainability in the future, but we can do this in a follow up!


if trainer is None:
raise NotImplementedError(
"Ray launcher does not support trainer is None!")
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what exactly does this error message mean to the user?

Under what situations would users run into this error?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make sense, most cases, it should not be none; because the strategy is always passed to trainer.

**kwargs: Any):
"""Run the function on the workers and collect the results.
`executor.run_remote` is used to launch multiple ray remote tasks
to distributed training the model using the horovod backend.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make first sentence of docstring 1 line please!

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds good to me!

model = trainer.model
model_ref = ray.put(model)
trainer.model = None
new_args = tuple([None] + list(args[1:]))
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will the model always be at the 0th position in the args?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes,

self._strategy.set_remote(True)

# `function` is a trainer's class method
# in the ray remote tasks, its object `trainer` will also
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should it be "its bound instance trainer" instead of "its object"?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes

# does not fillfullied our purpose. Ray remote tasks will
# create another copy of trainer so that
# `function.__self__ != trainer`, in which the side effect only
# happens to `function.__self__` when running
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for leaving this comment describing the problem!

How is this problem being resolved currently?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

got it, filled in

@@ -0,0 +1,251 @@
from typing import Callable, Any, Optional
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All the comments in this file apply to ray_launcher as well

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds good to me

# `RayPlugin.execute_remote`.
return super().execute_remote(
model=self._model, global_rank=global_rank, queue=queue)
class RayShardedStrategy(RayStrategy, DDPSpawnShardedStrategy):
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's make sure this class name change is reflected in the examples!

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the name will be changed due to

    strategy_name = "ddp_sharded_ray"

from ray_lightning.launchers.utils import _RayOutput, get_executable_cls


class RayHorovodLauncher(_Launcher):
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like there is a lot of overlap between this launcher and RayLauncher. Would we be able to consolidate a lot of this overlap into a common superclass?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, i like this idea!

@amogkam
Copy link
Copy Markdown
Collaborator

amogkam commented Jul 28, 2022

Also, let's make sure to follow up on the previous review on adding comments for the following

  1. Which methods are overriding from Pytorch Lightning vs. which methods are brand new ones
  2. Which methods are run remotely vs. which ones are run on the driver.

Can we do this for both the Launchers and the Strategies?

@amogkam amogkam merged commit 299a776 into ray-project:main Jul 28, 2022
JiahaoYao added a commit to JiahaoYao/ray_lightning that referenced this pull request Aug 11, 2022
amogkam added a commit that referenced this pull request Aug 16, 2022
#195)

* adding the change (based on #163)

* Update ray_lightning/launchers/ray_horovod_launcher.py

* Update ray_lightning/launchers/ray_launcher.py

Co-authored-by: Amog Kamsetty <amogkam@users.noreply.github.com>
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants