Skip to content
This repository was archived by the owner on Nov 3, 2023. It is now read-only.

Ptl 1.5 support#115

Merged
amogkam merged 12 commits intoray-project:mainfrom
amogkam:ptl-1.5-support
Jan 20, 2022
Merged

Ptl 1.5 support#115
amogkam merged 12 commits intoray-project:mainfrom
amogkam:ptl-1.5-support

Conversation

@amogkam
Copy link
Copy Markdown
Collaborator

@amogkam amogkam commented Jan 6, 2022

Various changes to support Pytorch Lightning 1.5.

Major change is the removal of the RayClusterEnvironment. In PTL 1.5, the plugins leverage the cluster environment more heavily for specifying node rank, local rank, global rank, world size. With the RayPlugin, however, the ranks are not known until after the actors are launched and training begins. Thus, the cluster environment ended up adding more complexity than needed, and this PR removes the cluster environment and handles all of it internally in the plugin.

Other changes include supporting various name changes, refactoring, and signature changes in the DDPSpawnPlugin and DDPSpawnShardedPlugin.

Closes #108, #101

dependabot bot and others added 5 commits November 22, 2021 04:03
Bumps [pytorch-lightning](https://github.com/PyTorchLightning/pytorch-lightning) from 1.4.7 to 1.5.2.
- [Release notes](https://github.com/PyTorchLightning/pytorch-lightning/releases)
- [Changelog](https://github.com/PyTorchLightning/pytorch-lightning/blob/master/CHANGELOG.md)
- [Commits](Lightning-AI/pytorch-lightning@1.4.7...1.5.2)

---
updated-dependencies:
- dependency-name: pytorch-lightning
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
@amogkam
Copy link
Copy Markdown
Collaborator Author

amogkam commented Jan 6, 2022

Blocked on #71

@matthewdeng
Copy link
Copy Markdown
Contributor

Can you update the description with some more context about this? Major changes, things to pay attention to while reviewing, etc.

@amogkam
Copy link
Copy Markdown
Collaborator Author

amogkam commented Jan 6, 2022

@matthewdeng updated the description. Main thing to review is just a quick check to see if everything looks roughly correct and the comments make sense.

For a more thorough review, you probably want to first read through source code of DDPSpawnPlugin in pytorch-lightning.

Copy link
Copy Markdown
Collaborator Author

@amogkam amogkam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review! Can you take another look please?

@amogkam amogkam requested a review from matthewdeng January 19, 2022 04:39


@pytest.fixture
def ray_start_cluster_2_node_2_cpu():
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not used?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah that's right I removed the test that was using this. But I'll just leave the fixture in here for future use.


def set_world_ranks(self, process_idx: int = 0):
"""Set the appropriate rank attributes for the trainer."""
if self._is_remote:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if this is False? Should an error be raised?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be a no-op. I'll add a comment about this. The reason that this is needed is because DDPSpawnPlugin calls set_world_ranks during __init__ because it assumes in a multi-node setup the correct address and port environment variables have already been set by the user before running the script. But in our case, we only want to call this method after the workers have been initialized.


# All methods below are only executed in remote Ray workers.

def set_world_ranks(self, process_idx: int = 0):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: What's the reason for the default value?

Copy link
Copy Markdown
Collaborator Author

@amogkam amogkam Jan 19, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is to match the signature of the superclass. If we don't have a default value, then when the superclass calls self.set_world_ranks() it will fail.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Upgrade to work with pytorch lightning 1.5 (and 1.6?)

2 participants